Commit graph

1076 commits

Author SHA1 Message Date
kingbri
136c8139f9 Dependencies: Update PyTorch, Exllamav2, and FA2
PyTorch: v2.7.0 on cuda 128 + ROCm 6.3
Exllamav2: v0.2.9
FA2: v2.7.4.post1 on cuda 128

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:52:48 -04:00
kingbri
f070587e9f Model: Add proper jobs cleanup and fix var calls
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.

The exl2 container's generate_gen function is now internal.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761 Model: Handle finish chunks and logprobs in separate functions
Helps split up and trim the generate_gen function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-24 21:19:03 -04:00
David Allada
bc1bef3324 FIx logs path 2025-04-22 21:14:45 -04:00
kingbri
f2c7da2faf Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:21:26 -04:00
kingbri
3f09fcd8c9 Model: Make model params return a model card
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:15:46 -04:00
kingbri
9834c7f99b Dependencies: Ungate numpy
numpy v2 now works with Torch

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 23:14:14 -04:00
kingbri
d26260b332 Model: Add fixes for kwargs and add note for migration
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-21 22:39:07 -04:00
Brian
93854a3107
Merge pull request #320 from Vhallo/model-rewrite
Fix RoPE Ratio
2025-04-21 10:55:55 -04:00
Vhallo
1aefa01a68
Fix RoPE Ratio 2025-04-21 01:46:18 +02:00
kingbri
13beef8021 Model: Move find_template function to templating
Makes sense to extract to a utility function instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6 Model: Move calculate_rope_alpha from backend
Makes more sense to use as a utility function. Also clarify how the
vars are set.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 18:20:19 -04:00
kingbri
027ffce05d Utils: Remove unused defer utils
These did not work anyways

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:59:09 -04:00
kingbri
b751e0a1d5 Model: Move inline overrides to common
This is applied across containers. Doesn't make sense to put this method
in the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1 Backends: Add base model container
Base class for all model containers. Used in the shared model file
for interface.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d Model: Reject model requests when unloading
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.

Also, remove model_is_loaded since we simply check if the container
is None now.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-19 22:34:06 -04:00
kingbri
552a64c723 Model: Have load take the highest priority
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-18 22:08:48 -04:00
kingbri
3f1d5d396e Model: Store active jobs in tabby
Rather than relying on the generator, use tabby to store the active
job IDs.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 13:17:55 -04:00
kingbri
1afc9b983e Model: Remove generate_window
Not required since we error with exceeding the max_seq_len

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:59:02 -04:00
kingbri
2f5235e1a3 Model: Extract settings creation to a separate function
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 12:57:27 -04:00
kingbri
5697204e47 Merge branch 'main' into model-rewrite 2025-04-16 02:15:46 -04:00
kingbri
6bb5f8f599 Sampling: Rewrite mirostat_mode parameter
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 02:13:55 -04:00
kingbri
3084ef9fa1 Model + API: Migrate to use BaseSamplerParams
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.

However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-16 00:50:05 -04:00
kingbri
dcb36e9ab2 Model: Remove extra unwraps
The base sampler request already specifies the defaults, so don't
unwrap in this way.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:38:46 -04:00
kingbri
11ed3cf5ee Model: Cleanup logging and remove extraneous declarations
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-04-15 23:31:12 -04:00
Andrew Phillips
436ce752da
Support more common tool variables in templates (tools, message.tool_calls) (#308)
* Add non-JSON version of `tools` and `functions` to `template_vars`.

Increase the compatibility with VLLM templates which use a non-JSON tools object.

* Add list of tool template variables to the documentation

* Use Jinja templates to provide `tools_json` and `functions_json`

This should be functionally equivelant, but the JSON won't be produced
unless it's needed.

* Make message.tool_calls match the JSON from ToolCallProcessor

* Log something when generating tool calls

* Add template for Qwen QwQ 32b

* Only log if tool calls have been detected

* API: Fix tool call variable assignments

Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* Add `ToolCallProcessor.dump()` to get the list of processed dicts

* Remove qwen_qwq_32b.jinja

This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-23 13:23:00 -04:00
David Allada
d31d17e5a2 Trigger ruff formatting 2025-03-23 17:04:09 +00:00
David Allada
bcd3413628 Try to fix ruff format 2025-03-23 17:02:52 +00:00
David Allada
0256d3b2a2 Fix the comment from 10MB to 20MB 2025-03-23 16:51:47 +00:00
David Allada
6750c291db Add file based logging in addition to the normal console logs 2025-03-23 16:49:58 +00:00
kingbri
ccf23243c1 Docs: Update getting started with downloading from private repos
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 12:02:48 -04:00
kingbri
529c90b93e Tree: Format and lint
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:55:02 -04:00
kingbri
d990bbc431 Args: Remove action arguments
Superseded by subcommands to perform the same action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:53:47 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
698d8339cb Config + Docs: Clarify YaRN rope scaling changes
In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:47:49 -04:00
Benjamin Oldenburg
a20abe2d33
Bugfix: Chat completion requests fail with UnboundLocalError: finish_reason variable not initialized (#307)
* fix issue #306

* removed whitespaces for ruff
2025-03-15 20:31:21 -04:00
kingbri
d98c0bd3f6 API: Add tools class
Was mistakenly not added in PR 302.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-14 15:07:11 -04:00
Brian
51b32621e1
Update README.md 2025-03-14 15:04:24 -04:00
Benjamin Oldenburg
a2a14ea114
Fix Tool Call JSON Serialization Error (#302)
* Fix Tool Call JSON Serialization Error

* Incorporate changes from PR 292

kingbri note: Adjusts the tool JSON formation and incorporates finish
reasons. Added both authors as co-authors due to edits on this commit
from the original PR.

Co-Authored-by: David Allada <dallada1@vt.edu>
Co-Authored-by: Benjamin Oldenburg <benjamin.oldenburg@ordis.co.th>
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* API: Cleanup tool call JSON parsing

Split pre and post-processing of tool calls to its own class. This
cleans up the chat_completion utility module and also fixes the
JSON serialization bug.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: David Allada <dallada1@vt.edu>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-14 15:01:33 -04:00
kingbri
de77955428 Docs: Update
Update getting started and server options

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-12 00:41:17 -04:00
David Allada
4196bb6bc8
Update the behavior of start.py so that we can do a full build AND sa… (#293)
* Update the behavior of start.py so that we can do a full build AND save the options, so we can build in a docker image

* Add actual args RIP

* Start: Move start_options write before dependency install message

This ensures that start options are properly written before
determining to exit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-11 23:54:34 -04:00
kingbri
73688670a6 Docs: Add model and inline loading documentation
Sorely required due to the amount of questions about how does inline
loading work.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-25 00:09:18 -05:00
kingbri
35fe372f2b Embeddings: Handle case if embedding input is passed as a string
Infinity expects a list when embedding, so convert to a list if the
input is a string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-23 00:39:21 -05:00
kingbri
c580893054 Downloader: log errors when downloading
If an error is returned from HuggingFace, raise it to the calling
function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-19 23:16:17 -05:00
kingbri
48bb78c614 Logger: Switch to ISO timestamp formatting
I thought this was previously enabled, but turns out I labeled with
the wrong date format.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-19 21:48:23 -05:00
kingbri
d6b8c7db4b Docs: Update getting started guide
Add downloader options and edit some points.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-18 12:17:14 -05:00
kingbri
830301b2b4 Actions: Update and add Wiki publish
Publishes the github wiki and runs these in concurrency groups
to avoid spawning multiple actions at a time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:47:38 -05:00
kingbri
5614b342a7 Tree: Migrate docs into repository
This will auto-publish to the Github wiki via an action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:39:35 -05:00
kingbri
9f649647f0 Model + API: GPU split updates and fixes
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.

Also add draft_gpu_split to the load request.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-15 21:50:14 -05:00
Brian
304df16543
Update README.md 2025-02-15 12:14:06 -05:00