Commit graph

205 commits

Author SHA1 Message Date
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
707d005aad API: Default tool call ID and type
Doing this helps reduce the model's burden of generating the tool
call ID and type (which is always "function"). Follow mistral's spec
for tool call IDs by using a 9 character alphanumeric string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-11 01:11:09 -04:00
kingbri
5b1db3ad83 API: Don't do a second re-render when tool calling
Re-rendering the template is an expensive operation when it's possible
to just concatenate the prompt and current generation text together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-06 11:32:36 -04:00
kingbri
3dfa965019 API: Add tool_call_id for role = tool
If a message with role = tool is present, the tool_call_id should
also be given to the template.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:52:58 -04:00
kingbri
879f4cee7e API: Modify tool calling for wider compat
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.

However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 14:28:12 -04:00
kingbri
b6a26da50c API: Fix tool call serialization
To render in the template, tool call start tokens needed to have less
checks and remove the line to convert message.tool_calls to a dict
since that breaks the rest of the chain by disconnecting the types.
model_dump on the message itself already accomplishes this.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-04 15:02:49 -04:00
kingbri
d23fefbecd API + Model: Fix application of defaults
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.

In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 14:37:34 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
a3c780ae58 API: Core: Remove load/template aliases
These added extra complexity and should be removed and replaced
with a single parameter.

Changes:
- /v1/model/load must use model_name and draft_model_name
- /v1/model/embedding/load must use embedding_model_name
- /v1/template/switch must use prompt_template_name

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
2d89c96879 API: Re-add BOS token stripping in template render
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:11:53 -04:00
kingbri
10fbe043a4 API: Fix typing for chat templates in CC requests
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:06:05 -04:00
kingbri
54b8a20a19 API: Fix types for chat completions
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.

In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
7900b72848 API: Add chat_template_kwargs alias for template_vars
This key is used in VLLM and SGLang.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 15:48:39 -04:00
kingbri
8996dc7b02 API: Add default for backend in model load request
Should be None so pydantic doesn't complain.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
Brian
b555eeb6e7
Merge pull request #339 from Maaaxiii/fix/tool-calling-embeddings
fix: Aligned Parameter Name in chat completions generate_tool_calls
2025-05-11 20:41:58 -04:00
kingbri
f4adca1f3e API: Remove default fallback from backend param
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-11 09:56:53 -04:00
kingbri
6379081dd8 Sampling: Make add_bos_token override concise
Also set the default to None so text completions follows the same
pattern.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-10 19:07:35 -04:00
Brian
527afc206b
Merge pull request #329 from DocShotgun/exl3
Exllamav3 cache quantization
2025-05-08 23:11:45 -04:00
Maximilian Klem
22f7f1e1ec fix: flipped parameter name with variable name 2025-05-07 21:04:30 +02:00
kingbri
bc0a84241a API: Patch kobold generation call
Calling the model requires different args now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-05 22:11:21 -04:00
DocShotgun
45b966363e Tree: Format 2025-05-03 21:01:03 -07:00
turboderp
036af02bf6 Common: No default add_bos_token value for chat completion requests 2025-05-04 05:25:58 +02:00
kingbri
7c6a053747 Model: Add option to select backend
Changing the backend switches the container that's used.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00
kingbri
3960612d38 API: Format and fix message naming
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:36:30 -04:00
kingbri
9157be3e34 API: Append task index to generations with n > 1
Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:29:48 -04:00
kingbri
3649d3bb51 Tree: Format + Lint
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 02:14:30 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
Andrew Phillips
436ce752da
Support more common tool variables in templates (tools, message.tool_calls) (#308)
* Add non-JSON version of `tools` and `functions` to `template_vars`.

Increase the compatibility with VLLM templates which use a non-JSON tools object.

* Add list of tool template variables to the documentation

* Use Jinja templates to provide `tools_json` and `functions_json`

This should be functionally equivelant, but the JSON won't be produced
unless it's needed.

* Make message.tool_calls match the JSON from ToolCallProcessor

* Log something when generating tool calls

* Add template for Qwen QwQ 32b

* Only log if tool calls have been detected

* API: Fix tool call variable assignments

Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* Add `ToolCallProcessor.dump()` to get the list of processed dicts

* Remove qwen_qwq_32b.jinja

This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-23 13:23:00 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
Benjamin Oldenburg
a20abe2d33
Bugfix: Chat completion requests fail with UnboundLocalError: finish_reason variable not initialized (#307)
* fix issue #306

* removed whitespaces for ruff
2025-03-15 20:31:21 -04:00
kingbri
d98c0bd3f6 API: Add tools class
Was mistakenly not added in PR 302.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-14 15:07:11 -04:00
Benjamin Oldenburg
a2a14ea114
Fix Tool Call JSON Serialization Error (#302)
* Fix Tool Call JSON Serialization Error

* Incorporate changes from PR 292

kingbri note: Adjusts the tool JSON formation and incorporates finish
reasons. Added both authors as co-authors due to edits on this commit
from the original PR.

Co-Authored-by: David Allada <dallada1@vt.edu>
Co-Authored-by: Benjamin Oldenburg <benjamin.oldenburg@ordis.co.th>
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* API: Cleanup tool call JSON parsing

Split pre and post-processing of tool calls to its own class. This
cleans up the chat_completion utility module and also fixes the
JSON serialization bug.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: David Allada <dallada1@vt.edu>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-14 15:01:33 -04:00
kingbri
35fe372f2b Embeddings: Handle case if embedding input is passed as a string
Infinity expects a list when embedding, so convert to a list if the
input is a string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-23 00:39:21 -05:00
kingbri
9f649647f0 Model + API: GPU split updates and fixes
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.

Also add draft_gpu_split to the load request.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-15 21:50:14 -05:00
kingbri
e290b88568 Args: Expose api-servers to subcommands
This is required for the export-openapi action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:39:46 -05:00
kingbri
6da65a8fd3 Embeddings: Fix base64 return
A base64 embedding can be a string post-encoding.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:15:12 -05:00
kingbri
7878d351a7 Endpoints: Add props endpoint and add more values to model params
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.

Also include the contents of a prompt template when fetching the current
model.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 17:32:19 -05:00
kingbri
fa8035ef72 Dependencies: Update sse-starlette and formatron
Also pin newer versions of dependencies and fix an import from sse-starlette

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-21 23:14:55 -05:00
Brian
fe44e4a524
Merge pull request #253 from randoentity/workaround-toolcall
workaround for tool calling
2024-11-28 23:30:00 -05:00
kingbri
2e06fb01d3 OAI: Pass mm_embeddings to tool call generation
Don't exclude the vision embeddings when regenerating for a tool call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:27:59 -05:00
Brian
b81dcdaf66
Merge pull request #232 from AlpinDale/serviceinfo_uri
feat: add serviceinfo URI
2024-11-28 23:19:52 -05:00
kingbri
5fadaa728a API: Move serviceinfo to core
Best to expose this endpoint to all APIs as its an information endpoint.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:07:58 -05:00
randoentity
a52610fb19 workaround for tool calling 2024-11-24 13:40:33 +01:00
kingbri
388d36e6bd OAI: Fix chat completion list parsing
The strings weren't being concatenated properly. Only add the combined
text if the chat completion type is a List.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 17:30:29 -05:00
kingbri
902045edbb API: Fix chat completion formatting flow
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 17:51:14 -05:00