Commit graph

1038 commits

Author SHA1 Message Date
kingbri
aa657fa6e9 API: Ignore add_bos_token in chat completions
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.

In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-01 22:51:15 -04:00
kingbri
3960612d38 API: Format and fix message naming
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:36:30 -04:00
kingbri
9157be3e34 API: Append task index to generations with n > 1
Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 22:29:48 -04:00
kingbri
b43f0983c8 Model: Fix max_seq_len fallbacks
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-28 14:09:31 -04:00
kingbri
755f98a338 Docker: Move to venv for running
Newer versions of Python don't allow system package installation
unless --break-system-packages are specified. I'd like to avoid this
if possible.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-27 00:38:07 -04:00
kingbri
f70eb11db3 Docker: Use python 3.12
Ubuntu 24.04 ships with 3.12 by default.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-27 00:24:32 -04:00
kingbri
09ddfa8ffb Docker: Update to Cuda 12.8 and Ubuntu 24.04
Use more modern versions of dependencies for the containerized image.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 21:29:36 -04:00
kingbri
2b3ed3fc79 Dependencies: Switch back to official exl2 wheels
These wheels are built properly and have the correct version and
filename.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 21:27:28 -04:00
Brian
b081aa9fa3
Merge pull request #322 from theroyallab/model-rewrite
Model rewrite
2025-04-26 02:15:48 -04:00
kingbri
3649d3bb51 Tree: Format + Lint
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 02:14:30 -04:00
kingbri
eb435f79e3 Dependencies (TEMP): Use my wheels for exl2
Use these until exl2 updates its wheels to have the version equal the
filename.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-26 02:11:33 -04:00
kingbri
f4757d31bd Model: Raise a 503 exception with model checks
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-25 00:00:15 -04:00
kingbri
136c8139f9 Dependencies: Update PyTorch, Exllamav2, and FA2
PyTorch: v2.7.0 on cuda 128 + ROCm 6.3
Exllamav2: v0.2.9
FA2: v2.7.4.post1 on cuda 128

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:52:48 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
David Allada
bc1bef3324 FIx logs path 2025-04-22 21:14:45 -04:00
kingbri
f2c7da2faf Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:21:26 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
9834c7f99b Dependencies: Ungate numpy
numpy v2 now works with Torch

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:14:14 -04:00
kingbri
d26260b332 Model: Add fixes for kwargs and add note for migration
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 22:39:07 -04:00
Brian
93854a3107
Merge pull request #320 from Vhallo/model-rewrite
Fix RoPE Ratio
2025-04-21 10:55:55 -04:00
Vhallo
1aefa01a68
Fix RoPE Ratio 2025-04-21 01:46:18 +02:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6 Model: Move calculate_rope_alpha from backend
Makes more sense to use as a utility function. Also clarify how the
vars are set.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:20:19 -04:00
kingbri
027ffce05d Utils: Remove unused defer utils
These did not work anyways

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:59:09 -04:00
kingbri
b751e0a1d5 Model: Move inline overrides to common
This is applied across containers. Doesn't make sense to put this method
in the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1 Backends: Add base model container
Base class for all model containers. Used in the shared model file
for interface.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d Model: Reject model requests when unloading
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.

Also, remove model_is_loaded since we simply check if the container
is None now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-19 22:34:06 -04:00
kingbri
552a64c723 Model: Have load take the highest priority
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-18 22:08:48 -04:00
kingbri
3f1d5d396e Model: Store active jobs in tabby
Rather than relying on the generator, use tabby to store the active
job IDs.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 13:17:55 -04:00
kingbri
1afc9b983e Model: Remove generate_window
Not required since we error with exceeding the max_seq_len

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:59:02 -04:00
kingbri
2f5235e1a3 Model: Extract settings creation to a separate function
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:57:27 -04:00
kingbri
5697204e47 Merge branch 'main' into model-rewrite 2025-04-16 02:15:46 -04:00
kingbri
6bb5f8f599 Sampling: Rewrite mirostat_mode parameter
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 02:13:55 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
kingbri
dcb36e9ab2 Model: Remove extra unwraps
The base sampler request already specifies the defaults, so don't
unwrap in this way.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:38:46 -04:00
kingbri
11ed3cf5ee Model: Cleanup logging and remove extraneous declarations
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:31:12 -04:00
Andrew Phillips
436ce752da
Support more common tool variables in templates (tools, message.tool_calls) (#308)
* Add non-JSON version of `tools` and `functions` to `template_vars`.

Increase the compatibility with VLLM templates which use a non-JSON tools object.

* Add list of tool template variables to the documentation

* Use Jinja templates to provide `tools_json` and `functions_json`

This should be functionally equivelant, but the JSON won't be produced
unless it's needed.

* Make message.tool_calls match the JSON from ToolCallProcessor

* Log something when generating tool calls

* Add template for Qwen QwQ 32b

* Only log if tool calls have been detected

* API: Fix tool call variable assignments

Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* Add `ToolCallProcessor.dump()` to get the list of processed dicts

* Remove qwen_qwq_32b.jinja

This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-23 13:23:00 -04:00
David Allada
d31d17e5a2 Trigger ruff formatting 2025-03-23 17:04:09 +00:00
David Allada
bcd3413628 Try to fix ruff format 2025-03-23 17:02:52 +00:00
David Allada
0256d3b2a2 Fix the comment from 10MB to 20MB 2025-03-23 16:51:47 +00:00
David Allada
6750c291db Add file based logging in addition to the normal console logs 2025-03-23 16:49:58 +00:00
kingbri
ccf23243c1 Docs: Update getting started with downloading from private repos
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 12:02:48 -04:00
kingbri
529c90b93e Tree: Format and lint
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:55:02 -04:00
kingbri
d990bbc431 Args: Remove action arguments
Superseded by subcommands to perform the same action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:53:47 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
698d8339cb Config + Docs: Clarify YaRN rope scaling changes
In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:47:49 -04:00
Benjamin Oldenburg
a20abe2d33
Bugfix: Chat completion requests fail with UnboundLocalError: finish_reason variable not initialized (#307)
* fix issue #306

* removed whitespaces for ruff
2025-03-15 20:31:21 -04:00
kingbri
d98c0bd3f6 API: Add tools class
Was mistakenly not added in PR 302.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-14 15:07:11 -04:00
Brian
51b32621e1
Update README.md 2025-03-14 15:04:24 -04:00