Commit graph

1033 commits

Author SHA1 Message Date
kingbri
d339139fb6 Config: Deep merge model overrides
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-03 12:17:09 -04:00
kingbri
0152a1665b Downloader: Switch to use API sizes
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 12:49:53 -04:00
kingbri
03ff4c3128 Downloader: Handle if Content-Length is undefined
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.

In this case, show an indeterminate progressbar and mark as complete
once the download finishes.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-30 11:43:22 -04:00
turboderp
0ae878712e Exl3: Clear image embedding cache on unload 2025-06-25 23:56:21 +02:00
Brian
e362319a4d
Merge pull request #358 from theroyallab/breaking
Breaking changes for configuration
2025-06-17 23:10:16 -04:00
kingbri
a02d39de31 Model: Remove rogue print
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 23:09:07 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022 Merge branch 'main' into breaking 2025-06-17 22:24:32 -04:00
turboderp
122d87ac36 Tree: Format 2025-06-15 19:33:14 +02:00
turboderp
21c5af48e1 Tree: Format 2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
4605c0f6bd Common: Refactor get_image to common functions 2025-06-15 19:20:36 +02:00
turboderp
d357f100d0 Dependencies: Bump ExllamaV3 2025-06-15 19:12:45 +02:00
turboderp
a0c16bba2a Exl2: Fix banned_strings (move outside of assign_gen_params) 2025-06-15 16:51:42 +02:00
kingbri
2096c9bad2 Model: Default max_seq_len to 4096
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.

To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
322f9b773a Model: Migrate inline config to new format
This matches config.yml and all model overrides should go under the
"model" block.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
a3c780ae58 API: Core: Remove load/template aliases
These added extra complexity and should be removed and replaced
with a single parameter.

Changes:
- /v1/model/load must use model_name and draft_model_name
- /v1/model/embedding/load must use embedding_model_name
- /v1/template/switch must use prompt_template_name

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
kingbri
0ea56382f0 Dependencies: Fix unsupported dependency error
Log the package name provided to the check function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:02 -04:00
kingbri
f4ee56ba13 Update README
Include ExllamaV3

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:01 -04:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
2d89c96879 API: Re-add BOS token stripping in template render
Matching YALS, if the model has add_bos_token enabled, then remove
an extra BOS token at the start of the prompt. This usually happens
with misconfigured templates such as Llama 3.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:11:53 -04:00
kingbri
10fbe043a4 API: Fix typing for chat templates in CC requests
Tools must be None by default. Chat completion message content can
be None, a string, or a list, so default to None. Exclude all None
values from a CC message since the template can say the variable
"exists" despite being None, causing an error.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-24 21:06:05 -04:00
kingbri
0c4cc1eba3 Model: Add prompt logging to ExllamaV3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 22:05:18 -04:00
Brian
729caaeddc
Merge pull request #346 from gakada/main
Exl3: some models aren't functional without add_bos?
2025-05-17 22:05:15 -04:00
kingbri
0646d358a2 Main: Log auth and sampler overrides after model load
Like YALS, logging all pertinent information after model load makes
it easier to parse by the user.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
kingbri
54b8a20a19 API: Fix types for chat completions
Messages were mistakenly being sent as Pydantic objects, but templates
expect dictionaries. Properly convert these before render.

In addition, initialize all Optional lists as an empty list since
this will cause the least problems when interacting with other parts
of API code, such as templates.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 18:10:34 -04:00
gakada
ba6248eec0
Exl3: fix add_bos in generator 2025-05-17 19:10:49 +09:00
Brian
81170eee00
Merge pull request #312 from davidallada/add-file-based-logging
Add file based logging in addition to the normal console logs
2025-05-17 01:24:19 -04:00
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
084916c04f Model: Fix autosplit reserve crash with GPU split
ExllamaV3 does not accept autosplit_reserve and gpu_split at the same
time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:51:14 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
fa534fe551 Dependencies: Update Ruff
v0.11.10

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:25 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00
kingbri
7900b72848 API: Add chat_template_kwargs alias for template_vars
This key is used in VLLM and SGLang.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 15:48:39 -04:00
kingbri
c9dc0b2aa4 Dependencies: Bump ExllamaV3 and ExllamaV2
v0.0.2 and v0.3.0 respectively

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 15:29:31 -04:00
kingbri
bd3fec929c Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:32:27 -04:00
kingbri
a524ac3c0f Model: Fix cache mode again
If statements can be difficult to work with.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9 Model: Fix param call
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20 Model: Fix exl2 cache mode check
FP16 was not included in the validation step.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
kingbri
8996dc7b02 API: Add default for backend in model load request
Should be None so pydantic doesn't complain.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
Brian
b555eeb6e7
Merge pull request #339 from Maaaxiii/fix/tool-calling-embeddings
fix: Aligned Parameter Name in chat completions generate_tool_calls
2025-05-11 20:41:58 -04:00
kingbri
f4adca1f3e API: Remove default fallback from backend param
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-11 09:56:53 -04:00
Brian
3674d7b9b5
Merge pull request #341 from theroyallab/exl3
Exl3
2025-05-10 23:43:02 -04:00
kingbri
6379081dd8 Sampling: Make add_bos_token override concise
Also set the default to None so text completions follows the same
pattern.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-10 19:07:35 -04:00
kingbri
656af41b5d Model: Always enable decode_special_tokens
The frontend should handle the special tokens if they get emitted.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:25:50 -04:00
kingbri
83826b56be Main: Remove unnecessary import
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:14:11 -04:00
kingbri
42346c6b39 Sampling: Remove skip_special_tokens
This parameter is way too confusing and does not make sense in
the modern LLM space.

Change approved by all maintainers.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77 Model: Remove exllamav2-specific version check
No longer necessary thanks to the agnostic check.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:08:15 -04:00
kingbri
48ea1737cf Startup: Check agnostically for inference deps
If an inference dep isn't present, force exit the application. This
occurs after all subcommands have been appropriately processed.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 21:59:00 -04:00
kingbri
33ac016023 Dependencies: Add ExllamaV3
v0.0.1

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 21:42:07 -04:00