This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding a comma in the description converts the string to a tuple,
which isn't parseable by argparse's help.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.
Finally, allow inline loading to specify the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.
Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If an API key sends a dummy model, it shouldn't error as the server
is catering to clients that expect specific OAI model names. This
is a problem with inline model loading since these names would error
by default. Therefore, add an exception if the provided name is in the
dummy model names (which also doubles as inline strict exceptions).
However, the dummy model names weren't configurable, so add a new
option to specify exception names, otherwise the default is gpt-3.5-turbo.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds the ability to load vision parts of text + image models. Requires
an explicit flag in config because there isn't a way to automatically
determine whether the vision tower should be used.
Signed-off-by: kingbri <bdashore3@proton.me>
There's no native way to handle case insensitivity in pydantic, so
add a validator which converts the API server input to be lowercase.
Signed-off-by: kingbri <bdashore3@proton.me>
This is not ideal because users may still have trouble understanding
what a lora includes, but adding an example comment will help instead
of leaving a blank line.
Signed-off-by: kingbri <bdashore3@proton.me>
If a sub-field exists in the model provided to the file generator,
use it. Otherwise always fallback to the default factory. This prevents
any subsequent errors from setting None.
Signed-off-by: kingbri <bdashore3@proton.me>
It makes sense for the LLM model groups to be clustered around
each other with the least used groups towards the bottom.
Signed-off-by: kingbri <bdashore3@proton.me>
These changes fix the amount and order of newlines to look pleasing
for the user. However, the changes used in here are kind of hacky
and need a proper fix that can contain the same level of efficiency.
Signed-off-by: kingbri <bdashore3@proton.me>
Remove access of private attributes and use safer functions. Also
move generalized functions into utils files.
Signed-off-by: kingbri <bdashore3@proton.me>