The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Seemed out of place in the common load function. In addition, rename
the transformers utils signature which actually takes a directory
instead of a file.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This is applied across containers. Doesn't make sense to put this method
in the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The model_type internal reference was changed to an enum for
a more extendable loading process. Return the current model type
when loading a new model.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds the ability to load vision parts of text + image models. Requires
an explicit flag in config because there isn't a way to automatically
determine whether the vision tower should be used.
Signed-off-by: kingbri <bdashore3@proton.me>
ExllamaV2 should check for solely exllamav2, otherwise errors don't
make sense. Migrate the combined "exl2" computed property to "inference"
since those are the required dependencies for minimal inference.
Signed-off-by: kingbri <bdashore3@proton.me>
There are two ways to load a model:
1. Via the load endpoint
2. Inline with a completion
The defaults were not applying on the inline load, so rewrite to fix
that. However, while doing this, set up a defaults dictionary rather
than comparing it at runtime and remove the pydantic default lambda
on all the model load fields.
This makes the code cleaner and establishes a clear config tree for
loading models.
Signed-off-by: kingbri <bdashore3@proton.me>
Using aiofiles, there's no longer a possiblity of blocking file operations
that can hang up the event loop. In addition, partially migrate
classes to use asynchronous init instead of the normal python magic method.
The only exception is config, since that's handled in the synchonous
init before the event loop starts.
Signed-off-by: kingbri <bdashore3@proton.me>
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.
Signed-off-by: kingbri <bdashore3@proton.me>
It's best to pass them down the config stack.
API/User config.yml -> model config.yml -> model config.json -> fallback.
Doing this allows for seamless flow and yielding control to each
member in the stack.
Signed-off-by: kingbri <bdashore3@proton.me>
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
If the model didn't load properly, the container still exists until
unload is called. However, the name check still registered as true.
Signed-off-by: kingbri <bdashore3@proton.me>
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.
Use asyncio's run_thread function since it allows for errors to be
propegated.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.
Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.
Signed-off-by: kingbri <bdashore3@proton.me>
Use the module singleton pattern to share global state. This can also
be a modified version of the Global Object Pattern. The main reason
this pattern is used is for ease of use when handling global state
rather than adding extra dependencies for a DI parameter.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a shared module which manages the model container and provides
extra utility functions around it to help slim down the API.
Signed-off-by: kingbri <bdashore3@proton.me>