When the request is cancelled, cancel the load task. In addition,
when checking if a model container exists, also check if the model
is fully loaded.
Signed-off-by: kingbri <bdashore3@proton.me>
There are two ways to load a model:
1. Via the load endpoint
2. Inline with a completion
The defaults were not applying on the inline load, so rewrite to fix
that. However, while doing this, set up a defaults dictionary rather
than comparing it at runtime and remove the pydantic default lambda
on all the model load fields.
This makes the code cleaner and establishes a clear config tree for
loading models.
Signed-off-by: kingbri <bdashore3@proton.me>
Using aiofiles, there's no longer a possiblity of blocking file operations
that can hang up the event loop. In addition, partially migrate
classes to use asynchronous init instead of the normal python magic method.
The only exception is config, since that's handled in the synchonous
init before the event loop starts.
Signed-off-by: kingbri <bdashore3@proton.me>
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
If a user requesting a model change isn't admin, error.
Better to place the load function before the generate functions.
Signed-off-by: kingbri <bdashore3@proton.me>
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.
Signed-off-by: kingbri <bdashore3@proton.me>
It's best to pass them down the config stack.
API/User config.yml -> model config.yml -> model config.json -> fallback.
Doing this allows for seamless flow and yielding control to each
member in the stack.
Signed-off-by: kingbri <bdashore3@proton.me>
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.
Signed-off-by: kingbri <bdashore3@proton.me>
* Add healthcheck
- localhost only /healthcheck endpoint
- cURL healthcheck in docker compose file
* Update Healthcheck Response
- change endpoint to /health
- remove localhost restriction
- add docstring
* move healthcheck definition to top of the file
- make the healthcheck show up first in the openAPI spec
* Tree: Format
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.
Also make it easier to determine which cache type to use rather than
multiple if/else statements.
Signed-off-by: kingbri <bdashore3@proton.me>
Metadata is generated via a template's module. This requires a single
iteration through the template. If a template tries to access a passed
variable that doesn't exist, it will error.
Therefore, generate the metadata at runtime to prevent these errors
from happening. To optimize further, cache the metadata after the
first generation to prevent the expensive call of making a template
module.
Signed-off-by: kingbri <bdashore3@proton.me>
* returning stop str if exists from gen
* added chat template for firefunctionv2
* pulling tool vars from template
* adding parsing for tool inputs/outputs
* passing tool data from endpoint to chat template, adding tool_start to the stop list
* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format
* non streaming generation prototype
* cleaning template
* Continued work with type, ingestion into template, and chat template for fire func
* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.
* Ruff Formating
* Moved stop string and tool updates out of prompt creation func
Updated tool pydantic to match OAI
Support for streaming
Updated generate tool calls to use flag within chat_template and insert tool reminder
* Llama 3.1 chat templates
Updated fire func template
* renamed llama3.1 to chatml_with_headers..
* update name of template
* Support for calling a tool start token rather than the string.
Simplified tool_params
Warning when gen_settings are being overidden becuase user set temp to 0
Corrected schema and tools to correct types for function args. Str for some reason
* draft groq tool use model template
* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)
* Clean up comments and code in chat comp
* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.
* changes example back to args as json rather than string of json
* Standardize chat templates to each other
* cleaning/rewording
* stop elements can also be ints (tokens)
* Cleaning/formatting
* added special tokens for tools and tool_response as specified in description
* Cleaning
* removing aux templates - going to live in llm-promp-templates repo instead
* Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Chat Completions: Don't include internal tool variables in OpenAPI
Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
* Templates: Deserialize metadata on template load
Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tools: Fix comments
Adhere to the format style of comments in the rest of the project.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.
Signed-off-by: kingbri <bdashore3@proton.me>
Some of the parameters the API provides are aliases for their OAI
equivalents. It makes more sense to move them to the common file.
Signed-off-by: kingbri <bdashore3@proton.me>
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.
Signed-off-by: kingbri <bdashore3@proton.me>