Commit graph

1051 commits

Author SHA1 Message Date
randoentity
c744790f14 fixup: add sampler logs
Also passing sampler to job with this, no idea if this is correct
2025-05-02 21:33:25 -04:00
randoentity
b35c48da37 fixup: some metrics 2025-05-02 21:33:25 -04:00
randoentity
c0f268f33e fixup: autosplit, start work on metrics 2025-05-02 21:33:25 -04:00
randoentity
306fc7cd15 fixup: autosplit reserve
this probably breaks v2 support
2025-05-02 21:33:25 -04:00
randoentity
acb3adb953 fixup: auto split 2025-05-02 21:33:25 -04:00
randoentity
14fb573371 fixup: max_seq_len
Whoops
2025-05-02 21:33:25 -04:00
randoentity
daae9ec43d Exl3: Couldn't wait
Just copied some stuff around and it ended up working for basic use.
2025-05-02 21:33:25 -04:00
kingbri
b4ff2f23cf Exl3: Add token encode, decode, and special token fetch
Base class methods

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:53 -04:00
kingbri
0c1d794390 Model: Add exl3 and associated load functions
Initial exl3 compat and loading functionality.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
7c6a053747 Model: Add option to select backend
Changing the backend switches the container that's used.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
242f6b7d2a Model: Simplify add_bos_token handling
Set add_bos_token to True by default in the tokenizer_config stub.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:28 -04:00
kingbri
4cb3e5d5b1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:23:15 -04:00
kingbri
47cb2a0de9 Model: Add TokenizerConfig stub and add_eos_token fallback
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 00:08:01 -04:00
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00
kingbri
3960612d38 API: Format and fix message naming
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:36:30 -04:00
kingbri
9157be3e34 API: Append task index to generations with n > 1
Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:29:48 -04:00
kingbri
b43f0983c8 Model: Fix max_seq_len fallbacks
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 14:09:31 -04:00
kingbri
755f98a338 Docker: Move to venv for running
Newer versions of Python don't allow system package installation
unless --break-system-packages are specified. I'd like to avoid this
if possible.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-27 00:38:07 -04:00
kingbri
f70eb11db3 Docker: Use python 3.12
Ubuntu 24.04 ships with 3.12 by default.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-27 00:24:32 -04:00
kingbri
09ddfa8ffb Docker: Update to Cuda 12.8 and Ubuntu 24.04
Use more modern versions of dependencies for the containerized image.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 21:29:36 -04:00
kingbri
2b3ed3fc79 Dependencies: Switch back to official exl2 wheels
These wheels are built properly and have the correct version and
filename.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 21:27:28 -04:00
Brian
b081aa9fa3
Merge pull request #322 from theroyallab/model-rewrite
Model rewrite
2025-04-26 02:15:48 -04:00
kingbri
3649d3bb51 Tree: Format + Lint
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 02:14:30 -04:00
kingbri
eb435f79e3 Dependencies (TEMP): Use my wheels for exl2
Use these until exl2 updates its wheels to have the version equal the
filename.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 02:11:33 -04:00
kingbri
f4757d31bd Model: Raise a 503 exception with model checks
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-25 00:00:15 -04:00
kingbri
136c8139f9 Dependencies: Update PyTorch, Exllamav2, and FA2
PyTorch: v2.7.0 on cuda 128 + ROCm 6.3
Exllamav2: v0.2.9
FA2: v2.7.4.post1 on cuda 128

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:52:48 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
David Allada
bc1bef3324 FIx logs path 2025-04-22 21:14:45 -04:00
kingbri
f2c7da2faf Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:21:26 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
9834c7f99b Dependencies: Ungate numpy
numpy v2 now works with Torch

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:14:14 -04:00
kingbri
d26260b332 Model: Add fixes for kwargs and add note for migration
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 22:39:07 -04:00
Brian
93854a3107
Merge pull request #320 from Vhallo/model-rewrite
Fix RoPE Ratio
2025-04-21 10:55:55 -04:00
Vhallo
1aefa01a68
Fix RoPE Ratio 2025-04-21 01:46:18 +02:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6 Model: Move calculate_rope_alpha from backend
Makes more sense to use as a utility function. Also clarify how the
vars are set.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:20:19 -04:00
kingbri
027ffce05d Utils: Remove unused defer utils
These did not work anyways

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:59:09 -04:00
kingbri
b751e0a1d5 Model: Move inline overrides to common
This is applied across containers. Doesn't make sense to put this method
in the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1 Backends: Add base model container
Base class for all model containers. Used in the shared model file
for interface.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d Model: Reject model requests when unloading
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.

Also, remove model_is_loaded since we simply check if the container
is None now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-19 22:34:06 -04:00
kingbri
552a64c723 Model: Have load take the highest priority
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-18 22:08:48 -04:00
kingbri
3f1d5d396e Model: Store active jobs in tabby
Rather than relying on the generator, use tabby to store the active
job IDs.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 13:17:55 -04:00
kingbri
1afc9b983e Model: Remove generate_window
Not required since we error with exceeding the max_seq_len

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:59:02 -04:00
kingbri
2f5235e1a3 Model: Extract settings creation to a separate function
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:57:27 -04:00
kingbri
5697204e47 Merge branch 'main' into model-rewrite 2025-04-16 02:15:46 -04:00
kingbri
6bb5f8f599 Sampling: Rewrite mirostat_mode parameter
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 02:13:55 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
kingbri
dcb36e9ab2 Model: Remove extra unwraps
The base sampler request already specifies the defaults, so don't
unwrap in this way.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:38:46 -04:00
kingbri
11ed3cf5ee Model: Cleanup logging and remove extraneous declarations
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:31:12 -04:00