Granite3 default template uses strftime_now function.
Currently Jinja2 raises an exception because strftime_now is undefined and /v1/chat/completions endpoint doesn't work with these models when a template from the model metadata is used.
The api-servers arg is passed when running subcommands, so use that
instead of replicating the arg again.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Migrate OpenAPI and sample config export to subcommands "export-openapi"
and "export-config".
Also add a "download" subcommand that passes args to the TabbyAPI
downloader. This allows models to be downloaded via the API and
CLI args.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1
Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.
Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Was against this for a while due to the length of timestamps clogging
the console, but it makes sense to know when something goes wrong.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
On a basic python class, class attributes are handled by reference,
meaning that every instance of embeddings would attach to that reference
and allocate more memory.
Switch to a Pydantic class and factory methods when instantiating.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous template was compatible with Jinja2 in Python, but it
was not cross-platform compatible according to HF's standards.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.
Also include the contents of a prompt template when fetching the current
model.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
The vision module from the ExllamaV2 backend is used in files outside
the backends contained folder. Therefore, import ExllamaV2 as an
optional dependency here.
Signed-off-by: kingbri <bdashore3@proton.me>