Commit graph

1050 commits

Author SHA1 Message Date
AUTOMATIC
056527ceb3 add logprobs support for exl3 2025-08-03 11:42:32 +03:00
Brian
03d72a37be
Merge pull request #371 from DocShotgun/main
Config: Remove developer arg cuda_malloc_backend
2025-08-01 14:02:57 -04:00
DocShotgun
102af306e5 Config: Remove developer arg cuda_malloc_backend
* cudaMallocAsync is now enabled by default on supported configurations
2025-08-01 10:59:13 -07:00
kingbri
113643c0df Main: Enable cudaMallocAsync backend by default
Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable
the backend. This is an env var that needs to be set, so it's not really
possible to set it via config.yml.

This used to be experimental, but it's probably fine to keep it enabled
since it only provides a benefit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-27 22:31:38 -04:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
e77fa0b7a8 Docs: Edit inline loading for breaking changes
Add the model key for the YAML examples.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-24 18:11:42 -04:00
kingbri
ab04a6ed60 Dependencies: Bump ExllamaV3
v0.0.5

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-18 22:56:35 -04:00
kingbri
bf936f5c39 Dependencies: Update exllamav2
v0.3.2

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-13 23:33:12 -04:00
Brian
2419d2d0a3
Merge pull request #364 from theroyallab/tool-calls
Streamline tool calling
2025-07-11 11:34:10 -04:00
kingbri
707d005aad API: Default tool call ID and type
Doing this helps reduce the model's burden of generating the tool
call ID and type (which is always "function"). Follow mistral's spec
for tool call IDs by using a 9 character alphanumeric string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-11 01:11:09 -04:00
kingbri
5b1db3ad83 API: Don't do a second re-render when tool calling
Re-rendering the template is an expensive operation when it's possible
to just concatenate the prompt and current generation text together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-06 11:32:36 -04:00
kingbri
3dfa965019 API: Add tool_call_id for role = tool
If a message with role = tool is present, the tool_call_id should
also be given to the template.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:52:58 -04:00
kingbri
1c3f84151f Docs: Update tool calling
For new variables and format.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:43:04 -04:00
kingbri
871f71c4e7 Templates: Adjust tool call example
Use the new tool call variables and formatting. Also prettify the template.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:42:23 -04:00
kingbri
879f4cee7e API: Modify tool calling for wider compat
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.

However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 14:28:12 -04:00
kingbri
b6a26da50c API: Fix tool call serialization
To render in the template, tool call start tokens needed to have less
checks and remove the line to convert message.tool_calls to a dict
since that breaks the rest of the chain by disconnecting the types.
model_dump on the message itself already accomplishes this.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-04 15:02:49 -04:00
kingbri
d23fefbecd API + Model: Fix application of defaults
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.

In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 14:37:34 -04:00
kingbri
d339139fb6 Config: Deep merge model overrides
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 12:17:09 -04:00
kingbri
0152a1665b Downloader: Switch to use API sizes
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 12:49:53 -04:00
kingbri
03ff4c3128 Downloader: Handle if Content-Length is undefined
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.

In this case, show an indeterminate progressbar and mark as complete
once the download finishes.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 11:43:22 -04:00
turboderp
0ae878712e Exl3: Clear image embedding cache on unload 2025-06-25 23:56:21 +02:00
Brian
e362319a4d
Merge pull request #358 from theroyallab/breaking
Breaking changes for configuration
2025-06-17 23:10:16 -04:00
kingbri
a02d39de31 Model: Remove rogue print
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 23:09:07 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022 Merge branch 'main' into breaking 2025-06-17 22:24:32 -04:00
turboderp
122d87ac36 Tree: Format 2025-06-15 19:33:14 +02:00
turboderp
21c5af48e1 Tree: Format 2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
4605c0f6bd Common: Refactor get_image to common functions 2025-06-15 19:20:36 +02:00
turboderp
d357f100d0 Dependencies: Bump ExllamaV3 2025-06-15 19:12:45 +02:00
turboderp
a0c16bba2a Exl2: Fix banned_strings (move outside of assign_gen_params) 2025-06-15 16:51:42 +02:00
kingbri
2096c9bad2 Model: Default max_seq_len to 4096
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.

To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
322f9b773a Model: Migrate inline config to new format
This matches config.yml and all model overrides should go under the
"model" block.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
a3c780ae58 API: Core: Remove load/template aliases
These added extra complexity and should be removed and replaced
with a single parameter.

Changes:
- /v1/model/load must use model_name and draft_model_name
- /v1/model/embedding/load must use embedding_model_name
- /v1/template/switch must use prompt_template_name

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
0ea56382f0 Dependencies: Fix unsupported dependency error
Log the package name provided to the check function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:02 -04:00
kingbri
f4ee56ba13 Update README
Include ExllamaV3

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:01 -04:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
2d89c96879 API: Re-add BOS token stripping in template render
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:11:53 -04:00
kingbri
10fbe043a4 API: Fix typing for chat templates in CC requests
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:06:05 -04:00
kingbri
0c4cc1eba3 Model: Add prompt logging to ExllamaV3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 22:05:18 -04:00
Brian
729caaeddc
Merge pull request #346 from gakada/main
Exl3: some models aren't functional without add_bos?
2025-05-17 22:05:15 -04:00
kingbri
0646d358a2 Main: Log auth and sampler overrides after model load
Like YALS, logging all pertinent information after model load makes
it easier to parse by the user.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
kingbri
54b8a20a19 API: Fix types for chat completions
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.

In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
gakada
ba6248eec0
Exl3: fix add_bos in generator 2025-05-17 19:10:49 +09:00
Brian
81170eee00
Merge pull request #312 from davidallada/add-file-based-logging
Add file based logging in addition to the normal console logs
2025-05-17 01:24:19 -04:00
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
084916c04f Model: Fix autosplit reserve crash with GPU split
ExllamaV3 does not accept autosplit_reserve and gpu_split at the same
time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:51:14 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
fa534fe551 Dependencies: Update Ruff
v0.11.10

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:25 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00