A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.
This is also sent to requests for loading and unloading, so keep the
error short.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This parameter is way too confusing and does not make sense in
the modern LLM space.
Change approved by all maintainers.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If an inference dep isn't present, force exit the application. This
occurs after all subcommands have been appropriately processed.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Seemed out of place in the common load function. In addition, rename
the transformers utils signature which actually takes a directory
instead of a file.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding a comma in the description converts the string to a tuple,
which isn't parseable by argparse's help.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.
Finally, allow inline loading to specify the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This is applied across containers. Doesn't make sense to put this method
in the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Apparently the "mirostat" parameter has been updated by frontends
to pass a number. ExllamaV2 expects a boolean, but most pass a number
anyway, so just alias mirostat_mode and mirostat together.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.
However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>