In Windows, checking for a command yields a FileNotFound error if
the utility isn't found. This led to complicated logic which can
be solved by using which instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
UV is now supported as first-party in tabbyAPI's start script, so
add a dedicated section to it and recommend over miniconda.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Uv is the definitive package installation tool for Python, so add
support to check for it via the start script.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Change the sampling subsection to sampler overrides and add a warning
about the default preset.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable
the backend. This is an env var that needs to be set, so it's not really
possible to set it via config.yml.
This used to be experimental, but it's probably fine to keep it enabled
since it only provides a benefit.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Doing this helps reduce the model's burden of generating the tool
call ID and type (which is always "function"). Follow mistral's spec
for tool call IDs by using a 9 character alphanumeric string.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Re-rendering the template is an expensive operation when it's possible
to just concatenate the prompt and current generation text together.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If a message with role = tool is present, the tool_call_id should
also be given to the template.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.
However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
To render in the template, tool call start tokens needed to have less
checks and remove the line to convert message.tool_calls to a dict
since that breaks the rest of the chain by disconnecting the types.
model_dump on the message itself already accomplishes this.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.
In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.
In this case, show an indeterminate progressbar and mark as complete
once the download finishes.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>