Inference
From the perspective of the Text Client
For development, a basic REPL client is used as a chat interface, built around
Typer. The bulk of the logic is in
inference.text-client.text_client_utils.DebugClient. The basic steps are as
follows:
The debug client first authenticates with a GET request to
/auth/callback/debugwhich returns a bearer token, which is then set in future request headers.After authenticating, the
DebugClientcreates a "chat" by posting to/chats, which returns achat_idfor session.The script then collects the user prompt.
The
DebugClientposts to the endpoint/chats/{chat_id}/messages. Included in this request is the message content, aparent_id:str(ID of assistant's response to prior message, if there is one), themodel_config_name:str, and inferencesampling_parameters:dict. In the response, the server will return amessage_id:str.The client will use this id to make a GET request to
{backend_url}/chats/{chat_id}/messages/{message_id}/events. Critically, in thisget()method of requests, thestream=Trueis passed, meaning that the response content will not be immediately downloaded but rather streamed. Additionally, the actual GET request must have"Accept": "text/event-stream"in the headers to let the server know we are awaiting an event stream.The response is then used to instantiate an SSEClient, which - via its events() method, returns an iterable that can be used to print out inference results, one token at a time.
After exhausting the events iterable (ie inference is complete), the user is prompted for a new message.
From the perspective of the OA Inference Server
The inference server is built around FastAPI.
When the client posts to
/chats, the UserChatRepository - an interface between application logic and chats message tables - creates a chat in the chat table, which assigns the needed chat id and is returned in the response to the client.Next, the client POSTs to
/chats/{chat_id}/prompter_message, which uses theUserChatRepositoryto:- Check to see if the message content included in the POST request is longer than the configured settings.
- Using the database session and
user_idwhich was configured when this instance ofUserChatRepositorywas instantiated, we lookup in thechattable the corresponding entry for thischat_idanduser_id. - Check to see if the total number of assistant + prompter messages in the
chatis greater than the configured settings. - If this is a new chat (i.e. no
parent_idis set in thechat), and the chat table contains no title for chat of that id, updates the chat table with the initial user prompt as title. - Creates a message in the message table for the user prompt.
The client then POSTs to
/chats/{chat_id}/assistant_message.- First we load the model config for the
model_config_namespecified by the client's request. - Then, we use the
UserChatRepositoryto post to create a message in themessagetable, with a pending state. Note, we also will update the state for any other currently pending messages in the chat toinference.MessageState.cancelled. - After updating the
messagetable, we create a RedisQueue for this specific message and enque the message. - Finally, we return an
inference.MessageRead(a Pydantic model) to the client. This is the object contains the neededmessage_id.
- First we load the model config for the
Once the client has the
message_id, it will make a GET request to/chats/{chat_id}/messages/{message_id}/events. Upon receiving this request, the server retrieves the message from the message table and verifies that the message isn't finished and that the message role is assistant.
After this, we get the Redis queue for this specific message id. From here, we
poll the queue, using the dequeue() method. By default this method will block
for up to 1 second, however if a message is added to the queue in that interval,
it will automatically be returned. Otherwise, dequeue() will return None.
If None is returned and this is the first time we have polled the queue since
its creation, we will yield a chat_schema.PendingResponseEvent. The loop will
continue, with 1 second blocking (by default), until a message item is returned.
A message item is a tuple where the first element is a string of the message id,
and the second element is a response. After some checks on the message response
type (for example if there is a safety intervention or an internal error from
the worker), we yield a chat_schema.MessageResponseEvent. The message polling
and yielding is bundled in an EventSourceResponse object.
The EventSourceResponse is Server Sent Events (SSE) plugin for Starlette/FastAPI, which takes content and returns it as part of a HTTP event stream. You can check out the EventSourceResposethe library's source code for more details about how this works.
From the perspective of the OA Worker
This section is not yet written. If you are interested in helping write it please get in touch on GitHub or Discord.