use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.
In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
These added extra complexity and should be removed and replaced
with a single parameter.
Changes:
- /v1/model/load must use model_name and draft_model_name
- /v1/model/embedding/load must use embedding_model_name
- /v1/template/switch must use prompt_template_name
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.
In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.
In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.
However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add non-JSON version of `tools` and `functions` to `template_vars`.
Increase the compatibility with VLLM templates which use a non-JSON tools object.
* Add list of tool template variables to the documentation
* Use Jinja templates to provide `tools_json` and `functions_json`
This should be functionally equivelant, but the JSON won't be produced
unless it's needed.
* Make message.tool_calls match the JSON from ToolCallProcessor
* Log something when generating tool calls
* Add template for Qwen QwQ 32b
* Only log if tool calls have been detected
* API: Fix tool call variable assignments
Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add `ToolCallProcessor.dump()` to get the list of processed dicts
* Remove qwen_qwq_32b.jinja
This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Fix Tool Call JSON Serialization Error
* Incorporate changes from PR 292
kingbri note: Adjusts the tool JSON formation and incorporates finish
reasons. Added both authors as co-authors due to edits on this commit
from the original PR.
Co-Authored-by: David Allada <dallada1@vt.edu>
Co-Authored-by: Benjamin Oldenburg <benjamin.oldenburg@ordis.co.th>
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* API: Cleanup tool call JSON parsing
Split pre and post-processing of tool calls to its own class. This
cleans up the chat_completion utility module and also fixes the
JSON serialization bug.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: David Allada <dallada1@vt.edu>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Infinity expects a list when embedding, so convert to a list if the
input is a string.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.
Also add draft_gpu_split to the load request.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.
Also include the contents of a prompt template when fetching the current
model.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
The strings weren't being concatenated properly. Only add the combined
text if the chat completion type is a List.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.
Signed-off-by: kingbri <bdashore3@proton.me>
Migrate the add method into the class itself. Also, a BaseModel isn't
needed here since this isn't a serialized class.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the messages were a list of dicts. These are untyped
and don't provide strict hinting. Add types for chat completion
messages and reformat existing code.
Signed-off-by: kingbri <bdashore3@proton.me>
* More robust checks for OAI chat completion message lists on /v1/encode endpoint
* Added TODO to support other aspects of chat completions
* Fix oversight where embeddings was not defined in advance on /v1/chat/completions endpoint
* Support image_url inputs containing URLs or base64 strings following OAI vision spec
* Use async lru cache for image embeddings
* Add generic wrapper class for multimodal embeddings