Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Newer versions of Python don't allow system package installation
unless --break-system-packages are specified. I'd like to avoid this
if possible.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
PyTorch: v2.7.0 on cuda 128 + ROCm 6.3
Exllamav2: v0.2.9
FA2: v2.7.4.post1 on cuda 128
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This is applied across containers. Doesn't make sense to put this method
in the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.
Also, remove model_is_loaded since we simply check if the container
is None now.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Maybe move this out of the class entirely, but for now, it makes
sense to encapsulate this logic.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.
However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The base sampler request already specifies the defaults, so don't
unwrap in this way.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Log the parameters passed into the generate gen function rather than
the generation settings to reduce complexity.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add non-JSON version of `tools` and `functions` to `template_vars`.
Increase the compatibility with VLLM templates which use a non-JSON tools object.
* Add list of tool template variables to the documentation
* Use Jinja templates to provide `tools_json` and `functions_json`
This should be functionally equivelant, but the JSON won't be produced
unless it's needed.
* Make message.tool_calls match the JSON from ToolCallProcessor
* Log something when generating tool calls
* Add template for Qwen QwQ 32b
* Only log if tool calls have been detected
* API: Fix tool call variable assignments
Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add `ToolCallProcessor.dump()` to get the list of processed dicts
* Remove qwen_qwq_32b.jinja
This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Fix Tool Call JSON Serialization Error
* Incorporate changes from PR 292
kingbri note: Adjusts the tool JSON formation and incorporates finish
reasons. Added both authors as co-authors due to edits on this commit
from the original PR.
Co-Authored-by: David Allada <dallada1@vt.edu>
Co-Authored-by: Benjamin Oldenburg <benjamin.oldenburg@ordis.co.th>
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* API: Cleanup tool call JSON parsing
Split pre and post-processing of tool calls to its own class. This
cleans up the chat_completion utility module and also fixes the
JSON serialization bug.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: David Allada <dallada1@vt.edu>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>