Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.
Also make it easier to determine which cache type to use rather than
multiple if/else statements.
Signed-off-by: kingbri <bdashore3@proton.me>
Metadata is generated via a template's module. This requires a single
iteration through the template. If a template tries to access a passed
variable that doesn't exist, it will error.
Therefore, generate the metadata at runtime to prevent these errors
from happening. To optimize further, cache the metadata after the
first generation to prevent the expensive call of making a template
module.
Signed-off-by: kingbri <bdashore3@proton.me>
* returning stop str if exists from gen
* added chat template for firefunctionv2
* pulling tool vars from template
* adding parsing for tool inputs/outputs
* passing tool data from endpoint to chat template, adding tool_start to the stop list
* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format
* non streaming generation prototype
* cleaning template
* Continued work with type, ingestion into template, and chat template for fire func
* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.
* Ruff Formating
* Moved stop string and tool updates out of prompt creation func
Updated tool pydantic to match OAI
Support for streaming
Updated generate tool calls to use flag within chat_template and insert tool reminder
* Llama 3.1 chat templates
Updated fire func template
* renamed llama3.1 to chatml_with_headers..
* update name of template
* Support for calling a tool start token rather than the string.
Simplified tool_params
Warning when gen_settings are being overidden becuase user set temp to 0
Corrected schema and tools to correct types for function args. Str for some reason
* draft groq tool use model template
* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)
* Clean up comments and code in chat comp
* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.
* changes example back to args as json rather than string of json
* Standardize chat templates to each other
* cleaning/rewording
* stop elements can also be ints (tokens)
* Cleaning/formatting
* added special tokens for tools and tool_response as specified in description
* Cleaning
* removing aux templates - going to live in llm-promp-templates repo instead
* Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Chat Completions: Don't include internal tool variables in OpenAPI
Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
* Templates: Deserialize metadata on template load
Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tools: Fix comments
Adhere to the format style of comments in the rest of the project.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.
Signed-off-by: kingbri <bdashore3@proton.me>
Some of the parameters the API provides are aliases for their OAI
equivalents. It makes more sense to move them to the common file.
Signed-off-by: kingbri <bdashore3@proton.me>
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.
Signed-off-by: kingbri <bdashore3@proton.me>
Add an API parameter to set the timeout in seconds. Keep it to None
by default for uninterrupted downloads.
Signed-off-by: kingbri <bdashore3@proton.me>
Always enable the core endpoints and allow servers to be selected
as needed. Use the OAI server by default.
Signed-off-by: kingbri <bdashore3@proton.me>
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.
Signed-off-by: kingbri <bdashore3@proton.me>
Uvicorn can log in both the request disconnect handler and the
CancelledError. However, these sometimes don't work and both
need to be checked. But, don't log twice if one works.
Signed-off-by: kingbri <bdashore3@proton.me>
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.
Signed-off-by: kingbri <bdashore3@proton.me>
Middleware runs on both the request and response. Therefore, streaming
responses had increased latency when processing tasks and sending
data to the client which resulted in erratic streaming behavior.
Use a depends to add request IDs since it only executes when the
request is run rather than expecting the response to be sent as well.
For the future, it would be best to think about limiting the time
between each tick of chunk data to be safe.
Signed-off-by: kingbri <bdashore3@proton.me>
Identify which request is being processed to help users disambiguate
which logs correspond to which request.
Signed-off-by: kingbri <bdashore3@proton.me>
Place the logic into their proper utility functions and cleanup
the code with formatting.
Also, OAI's docs specify that a [DONE] return is needed when everything
is finished.
Signed-off-by: kingbri <bdashore3@proton.me>
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.
This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.
Signed-off-by: kingbri <bdashore3@proton.me>
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.
Signed-off-by: kingbri <bdashore3@proton.me>