Commit graph

290 commits

Author SHA1 Message Date
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
084916c04f Model: Fix autosplit reserve crash with GPU split
ExllamaV3 does not accept autosplit_reserve and gpu_split at the same
time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:51:14 -04:00
kingbri
0858b6d4b2 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:46:40 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00
kingbri
bd3fec929c Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:32:27 -04:00
kingbri
a524ac3c0f Model: Fix cache mode again
If statements can be difficult to work with.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9 Model: Fix param call
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20 Model: Fix exl2 cache mode check
FP16 was not included in the validation step.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 09:51:09 -04:00
kingbri
656af41b5d Model: Always enable decode_special_tokens
The frontend should handle the special tokens if they get emitted.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:25:50 -04:00
kingbri
42346c6b39 Sampling: Remove skip_special_tokens
This parameter is way too confusing and does not make sense in
the modern LLM space.

Change approved by all maintainers.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77 Model: Remove exllamav2-specific version check
No longer necessary thanks to the agnostic check.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-09 22:08:15 -04:00
kingbri
638eef401a Model: Move cache creation to a common function
Prevents repetitiveness while also creating a Cache class.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-08 23:10:03 -04:00
DocShotgun
9dcde59c57 Model: Check for unsupported cache mode in exllamav2 2025-05-06 01:18:15 -07:00
DocShotgun
45b966363e Tree: Format 2025-05-03 21:01:03 -07:00
DocShotgun
a635a719d7 Model: Enable draft model q-cache in Exl3
* Remove unneeded default fp16 cache layer import
2025-05-03 20:59:36 -07:00
DocShotgun
58e34ba4c5 Model: Exl3 cache quant settings lenient with whitespace 2025-05-03 20:35:35 -07:00
DocShotgun
68a660bdb3 Model: Initial Exl3 cache quantization support 2025-05-03 20:35:35 -07:00
turboderp
92ea7ee7cd Model: Add draft model/speculative decoding 2025-05-04 01:27:42 +02:00
turboderp
1db2cb99cb Model: Avoid initializing class variables 2025-05-04 01:26:42 +02:00
turboderp
0405a94a89 Model: Cast penalty range to int 2025-05-03 22:28:36 +02:00
turboderp
58c380b8ca Model: Create generator on load 2025-05-03 18:33:37 +02:00
turboderp
0d949d00b9 Model: Set default max_batch_size 2025-05-03 18:33:37 +02:00
turboderp
8c75b29923 Model: Fix some warnings 2025-05-03 18:33:36 +02:00
kingbri
15cc480cb0 Exl3: Simplify add_bos_token handling
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:50:42 -04:00
randoentity
d8a8ccfc2a Model: fix add_bos_token 2025-05-02 21:33:25 -04:00
kingbri
0d02af3c81 Model: Set model_dir on init
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
c89bea030e Model: Add template fetching to Exl3
Use the same functionality as exl2's loader.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
e8f00412f6 Model: Fetch from generation_config and tokenizer_config in Exl3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
eca403a0e4 Model: Add Exllamav3 sampler
File was not included in previous commit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
bdc5189a4b Exl3: Add chunk size, cache size, and model info
Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.

Same applies for chunk size.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
303e2dde12 Model: Correct exl3 generation, add concurrency, and cleanup
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.

Finally, allow inline loading to specify the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
randoentity
c744790f14 fixup: add sampler logs
Also passing sampler to job with this, no idea if this is correct
2025-05-02 21:33:25 -04:00
randoentity
b35c48da37 fixup: some metrics 2025-05-02 21:33:25 -04:00
randoentity
c0f268f33e fixup: autosplit, start work on metrics 2025-05-02 21:33:25 -04:00
randoentity
306fc7cd15 fixup: autosplit reserve
this probably breaks v2 support
2025-05-02 21:33:25 -04:00
randoentity
acb3adb953 fixup: auto split 2025-05-02 21:33:25 -04:00
randoentity
14fb573371 fixup: max_seq_len
Whoops
2025-05-02 21:33:25 -04:00
randoentity
daae9ec43d Exl3: Couldn't wait
Just copied some stuff around and it ended up working for basic use.
2025-05-02 21:33:25 -04:00
kingbri
b4ff2f23cf Exl3: Add token encode, decode, and special token fetch
Base class methods

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:53 -04:00
kingbri
0c1d794390 Model: Add exl3 and associated load functions
Initial exl3 compat and loading functionality.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
242f6b7d2a Model: Simplify add_bos_token handling
Set add_bos_token to True by default in the tokenizer_config stub.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:28 -04:00
kingbri
4cb3e5d5b1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:23:15 -04:00
kingbri
47cb2a0de9 Model: Add TokenizerConfig stub and add_eos_token fallback
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:08:01 -04:00
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00
kingbri
b43f0983c8 Model: Fix max_seq_len fallbacks
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 14:09:31 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
kingbri
f2c7da2faf Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:21:26 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00