Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the messages were a list of dicts. These are untyped
and don't provide strict hinting. Add types for chat completion
messages and reformat existing code.
Signed-off-by: kingbri <bdashore3@proton.me>
* More robust checks for OAI chat completion message lists on /v1/encode endpoint
* Added TODO to support other aspects of chat completions
* Fix oversight where embeddings was not defined in advance on /v1/chat/completions endpoint
* Support image_url inputs containing URLs or base64 strings following OAI vision spec
* Use async lru cache for image embeddings
* Add generic wrapper class for multimodal embeddings
When the request is cancelled, cancel the load task. In addition,
when checking if a model container exists, also check if the model
is fully loaded.
Signed-off-by: kingbri <bdashore3@proton.me>
- add models for config options
- add function to regenerate config.yml
- replace references to config with pydantic compatible references
- remove unnecessary unwrap() statements
TODO:
- auto generate env vars
- auto generate argparse
- test loading a model
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.
Signed-off-by: kingbri <bdashore3@proton.me>
Identify which request is being processed to help users disambiguate
which logs correspond to which request.
Signed-off-by: kingbri <bdashore3@proton.me>
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.
This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.
Signed-off-by: kingbri <bdashore3@proton.me>
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.
Signed-off-by: kingbri <bdashore3@proton.me>
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.
Streaming requests now set an event to cancel the batched job and break
out of the generation loop.
Signed-off-by: kingbri <bdashore3@proton.me>
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.
Signed-off-by: kingbri <bdashore3@proton.me>
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.
Signed-off-by: kingbri <bdashore3@proton.me>