Commit graph

12 commits

Author SHA1 Message Date
kingbri
113643c0df Main: Enable cudaMallocAsync backend by default
Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable
the backend. This is an env var that needs to be set, so it's not really
possible to set it via config.yml.

This used to be experimental, but it's probably fine to keep it enabled
since it only provides a benefit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-27 22:31:38 -04:00
kingbri
e77fa0b7a8 Docs: Edit inline loading for breaking changes
Add the model key for the YAML examples.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-24 18:11:42 -04:00
kingbri
1c3f84151f Docs: Update tool calling
For new variables and format.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-05 21:43:04 -04:00
kingbri
7900b72848 API: Add chat_template_kwargs alias for template_vars
This key is used in VLLM and SGLang.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-12 15:48:39 -04:00
Andrew Phillips
436ce752da
Support more common tool variables in templates (tools, message.tool_calls) (#308)
* Add non-JSON version of `tools` and `functions` to `template_vars`.

Increase the compatibility with VLLM templates which use a non-JSON tools object.

* Add list of tool template variables to the documentation

* Use Jinja templates to provide `tools_json` and `functions_json`

This should be functionally equivelant, but the JSON won't be produced
unless it's needed.

* Make message.tool_calls match the JSON from ToolCallProcessor

* Log something when generating tool calls

* Add template for Qwen QwQ 32b

* Only log if tool calls have been detected

* API: Fix tool call variable assignments

Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

* Add `ToolCallProcessor.dump()` to get the list of processed dicts

* Remove qwen_qwq_32b.jinja

This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-23 13:23:00 -04:00
kingbri
ccf23243c1 Docs: Update getting started with downloading from private repos
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 12:02:48 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
698d8339cb Config + Docs: Clarify YaRN rope scaling changes
In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:47:49 -04:00
kingbri
de77955428 Docs: Update
Update getting started and server options

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-12 00:41:17 -04:00
kingbri
73688670a6 Docs: Add model and inline loading documentation
Sorely required due to the amount of questions about how does inline
loading work.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-25 00:09:18 -05:00
kingbri
d6b8c7db4b Docs: Update getting started guide
Add downloader options and edit some points.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-18 12:17:14 -05:00
kingbri
5614b342a7 Tree: Migrate docs into repository
This will auto-publish to the Github wiki via an action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:39:35 -05:00