The loader takes in the "draft" parameter, so map the config model
to that when creating kwargs for initial load.
Also map the old "draft" key to the new "draft_model" key.
Signed-off-by: kingbri <bdashore3@proton.me>
Using aiofiles, there's no longer a possiblity of blocking file operations
that can hang up the event loop. In addition, partially migrate
classes to use asynchronous init instead of the normal python magic method.
The only exception is config, since that's handled in the synchonous
init before the event loop starts.
Signed-off-by: kingbri <bdashore3@proton.me>
- add models for config options
- add function to regenerate config.yml
- replace references to config with pydantic compatible references
- remove unnecessary unwrap() statements
TODO:
- auto generate env vars
- auto generate argparse
- test loading a model
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Realtime process priority assigns resources to point to tabby's
processes. Running as administrator will give realtime priority
while running as a normal user will set as high priority.
Signed-off-by: kingbri <bdashore3@proton.me>
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.
Signed-off-by: kingbri <bdashore3@proton.me>
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to override uvicorn's signal handler in addition
to using main's signal handler for any SIGINTs before the API server
starts.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>
Use the module singleton pattern to share global state. This can also
be a modified version of the Global Object Pattern. The main reason
this pattern is used is for ease of use when handling global state
rather than adding extra dependencies for a DI parameter.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a shared module which manages the model container and provides
extra utility functions around it to help slim down the API.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to Gradio, fall back to port + 1 if the config port isn't
bindable. If both ports aren't available, let the user know and exit.
An infinite loop of finding a port isn't advisable.
Signed-off-by: kingbri <bdashore3@proton.me>
Starlette's StreamingResponse has an issue where it yields after
a request has disconnected. A bugfix to starlette will fix this
issue, but FastAPI uses starlette <= 0.36 which isn't ideal.
Therefore, switch back to sse-starlette which handles these disconnects
correctly.
Also don't try yielding after the request is disconnected. Just return
out of the generator instead.
Signed-off-by: kingbri <bdashore3@proton.me>
Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.
Signed-off-by: kingbri <bdashore3@proton.me>
Rich is a more mature library for displaying progress bars, logging,
and console output. This should help properly align progress bars
within the terminal.
Side note: "We're Rich!"
Signed-off-by: kingbri <bdashore3@proton.me>
Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with
the cache_mode request parameter will set this on model load.
Signed-off-by: kingbri <bdashore3@proton.me>
Make a disconnect on load error consistently. It should be safer to
warn the user to run unload (or re-run load) if a model does not
load correctly.
Also don't log the traceback for request errors that don't have one.
Signed-off-by: kingbri <bdashore3@proton.me>
According to FastAPI docs, if you're using a generic function, running
it in async will make it more performant (which makes sense since
running def functions for routes will automatically run the caller
through a threadpool).
Tested and everything works fine.
Signed-off-by: kingbri <bdashore3@proton.me>