* Update the behavior of start.py so that we can do a full build AND save the options, so we can build in a docker image
* Add actual args RIP
* Start: Move start_options write before dependency install message
This ensures that start options are properly written before
determining to exit.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Infinity expects a list when embedding, so convert to a list if the
input is a string.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
I thought this was previously enabled, but turns out I labeled with
the wrong date format.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Publishes the github wiki and runs these in concurrency groups
to avoid spawning multiple actions at a time.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.
Also add draft_gpu_split to the load request.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Granite3 default template uses strftime_now function.
Currently Jinja2 raises an exception because strftime_now is undefined and /v1/chat/completions endpoint doesn't work with these models when a template from the model metadata is used.
The api-servers arg is passed when running subcommands, so use that
instead of replicating the arg again.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Migrate OpenAPI and sample config export to subcommands "export-openapi"
and "export-config".
Also add a "download" subcommand that passes args to the TabbyAPI
downloader. This allows models to be downloaded via the API and
CLI args.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1
Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.
Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Was against this for a while due to the length of timestamps clogging
the console, but it makes sense to know when something goes wrong.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
On a basic python class, class attributes are handled by reference,
meaning that every instance of embeddings would attach to that reference
and allocate more memory.
Switch to a Pydantic class and factory methods when instantiating.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous template was compatible with Jinja2 in Python, but it
was not cross-platform compatible according to HF's standards.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.
Also include the contents of a prompt template when fetching the current
model.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>