Commit graph

265 commits

Author SHA1 Message Date
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
a02d39de31 Model: Remove rogue print
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 23:09:07 -04:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022 Merge branch 'main' into breaking 2025-06-17 22:24:32 -04:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
4605c0f6bd Common: Refactor get_image to common functions 2025-06-15 19:20:36 +02:00
turboderp
a0c16bba2a Exl2: Fix banned_strings (move outside of assign_gen_params) 2025-06-15 16:51:42 +02:00
kingbri
2096c9bad2 Model: Default max_seq_len to 4096
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.

To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00
kingbri
bd3fec929c Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:32:27 -04:00
kingbri
a524ac3c0f Model: Fix cache mode again
If statements can be difficult to work with.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9 Model: Fix param call
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20 Model: Fix exl2 cache mode check
FP16 was not included in the validation step.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
kingbri
656af41b5d Model: Always enable decode_special_tokens
The frontend should handle the special tokens if they get emitted.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:25:50 -04:00
kingbri
42346c6b39 Sampling: Remove skip_special_tokens
This parameter is way too confusing and does not make sense in
the modern LLM space.

Change approved by all maintainers.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77 Model: Remove exllamav2-specific version check
No longer necessary thanks to the agnostic check.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:08:15 -04:00
DocShotgun
9dcde59c57 Model: Check for unsupported cache mode in exllamav2 2025-05-06 01:18:15 -07:00
DocShotgun
68a660bdb3 Model: Initial Exl3 cache quantization support 2025-05-03 20:35:35 -07:00
kingbri
e8f00412f6 Model: Fetch from generation_config and tokenizer_config in Exl3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
bdc5189a4b Exl3: Add chunk size, cache size, and model info
Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.

Same applies for chunk size.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
randoentity
daae9ec43d Exl3: Couldn't wait
Just copied some stuff around and it ended up working for basic use.
2025-05-02 21:33:25 -04:00
kingbri
0c1d794390 Model: Add exl3 and associated load functions
Initial exl3 compat and loading functionality.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
242f6b7d2a Model: Simplify add_bos_token handling
Set add_bos_token to True by default in the tokenizer_config stub.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:28 -04:00
kingbri
4cb3e5d5b1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:23:15 -04:00
kingbri
47cb2a0de9 Model: Add TokenizerConfig stub and add_eos_token fallback
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:08:01 -04:00
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00
kingbri
b43f0983c8 Model: Fix max_seq_len fallbacks
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 14:09:31 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6 Model: Move calculate_rope_alpha from backend
Makes more sense to use as a utility function. Also clarify how the
vars are set.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:20:19 -04:00
kingbri
b751e0a1d5 Model: Move inline overrides to common
This is applied across containers. Doesn't make sense to put this method
in the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1 Backends: Add base model container
Base class for all model containers. Used in the shared model file
for interface.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d Model: Reject model requests when unloading
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.

Also, remove model_is_loaded since we simply check if the container
is None now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-19 22:34:06 -04:00
kingbri
3f1d5d396e Model: Store active jobs in tabby
Rather than relying on the generator, use tabby to store the active
job IDs.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 13:17:55 -04:00
kingbri
1afc9b983e Model: Remove generate_window
Not required since we error with exceeding the max_seq_len

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:59:02 -04:00
kingbri
2f5235e1a3 Model: Extract settings creation to a separate function
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:57:27 -04:00
kingbri
5697204e47 Merge branch 'main' into model-rewrite 2025-04-16 02:15:46 -04:00
kingbri
6bb5f8f599 Sampling: Rewrite mirostat_mode parameter
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 02:13:55 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
kingbri
dcb36e9ab2 Model: Remove extra unwraps
The base sampler request already specifies the defaults, so don't
unwrap in this way.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:38:46 -04:00
kingbri
11ed3cf5ee Model: Cleanup logging and remove extraneous declarations
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:31:12 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
9f649647f0 Model + API: GPU split updates and fixes
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.

Also add draft_gpu_split to the load request.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-15 21:50:14 -05:00
Brian
2e491472d1
Merge pull request #254 from lucyknada/main
add draft_gpu_split option for spec decoding
2025-02-11 16:48:03 -05:00
kingbri
0dcbb7a722 Dependencies: Update torch, exllamav2, and flash-attn
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1

Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-09 01:27:48 -05:00