* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
The vision module from the ExllamaV2 backend is used in files outside
the backends contained folder. Therefore, import ExllamaV2 as an
optional dependency here.
Signed-off-by: kingbri <bdashore3@proton.me>
The strings weren't being concatenated properly. Only add the combined
text if the chat completion type is a List.
Signed-off-by: kingbri <bdashore3@proton.me>
If vision is enabled and the model doesn't support it, send an
error asking the user to reload. Also, add a method to unload the
vision tower.
Signed-off-by: kingbri <bdashore3@proton.me>
The model_type internal reference was changed to an enum for
a more extendable loading process. Return the current model type
when loading a new model.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.
Signed-off-by: kingbri <bdashore3@proton.me>
Migrate the add method into the class itself. Also, a BaseModel isn't
needed here since this isn't a serialized class.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the messages were a list of dicts. These are untyped
and don't provide strict hinting. Add types for chat completion
messages and reformat existing code.
Signed-off-by: kingbri <bdashore3@proton.me>
* More robust checks for OAI chat completion message lists on /v1/encode endpoint
* Added TODO to support other aspects of chat completions
* Fix oversight where embeddings was not defined in advance on /v1/chat/completions endpoint
* Support image_url inputs containing URLs or base64 strings following OAI vision spec
* Use async lru cache for image embeddings
* Add generic wrapper class for multimodal embeddings
If an API key sends a dummy model, it shouldn't error as the server
is catering to clients that expect specific OAI model names. This
is a problem with inline model loading since these names would error
by default. Therefore, add an exception if the provided name is in the
dummy model names (which also doubles as inline strict exceptions).
However, the dummy model names weren't configurable, so add a new
option to specify exception names, otherwise the default is gpt-3.5-turbo.
Signed-off-by: kingbri <bdashore3@proton.me>
The admin key check was running even if inline loading was disabled.
Fix this bug, but also preserve the existing permission system when
inline loading is enabled.
Signed-off-by: kingbri <bdashore3@proton.me>