Add the ability to override uvicorn's signal handler in addition
to using main's signal handler for any SIGINTs before the API server
starts.
Signed-off-by: kingbri <bdashore3@proton.me>
If the model didn't load properly, the container still exists until
unload is called. However, the name check still registered as true.
Signed-off-by: kingbri <bdashore3@proton.me>
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.
Use asyncio's run_thread function since it allows for errors to be
propegated.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
These are mainly used for some clients that ping to see if the request
is alive. However, we don't need this.
Signed-off-by: kingbri <bdashore3@proton.me>
Speculative ngram decoding is like speculative decoding without the
draft model. It's not as useful because it only decodes on predictable
sequences, but it depends on the usecase.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.
Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>
Use the module singleton pattern to share global state. This can also
be a modified version of the Global Object Pattern. The main reason
this pattern is used is for ease of use when handling global state
rather than adding extra dependencies for a DI parameter.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a shared module which manages the model container and provides
extra utility functions around it to help slim down the API.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to Gradio, fall back to port + 1 if the config port isn't
bindable. If both ports aren't available, let the user know and exit.
An infinite loop of finding a port isn't advisable.
Signed-off-by: kingbri <bdashore3@proton.me>
Rich markup sequences inside the log string were causing issues
with printing. Fix this by using their escape function.
Signed-off-by: kingbri <bdashore3@proton.me>
Starlette's StreamingResponse has an issue where it yields after
a request has disconnected. A bugfix to starlette will fix this
issue, but FastAPI uses starlette <= 0.36 which isn't ideal.
Therefore, switch back to sse-starlette which handles these disconnects
correctly.
Also don't try yielding after the request is disconnected. Just return
out of the generator instead.
Signed-off-by: kingbri <bdashore3@proton.me>
Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.
Signed-off-by: kingbri <bdashore3@proton.me>
Rich is a more mature library for displaying progress bars, logging,
and console output. This should help properly align progress bars
within the terminal.
Side note: "We're Rich!"
Signed-off-by: kingbri <bdashore3@proton.me>
Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with
the cache_mode request parameter will set this on model load.
Signed-off-by: kingbri <bdashore3@proton.me>
Make a disconnect on load error consistently. It should be safer to
warn the user to run unload (or re-run load) if a model does not
load correctly.
Also don't log the traceback for request errors that don't have one.
Signed-off-by: kingbri <bdashore3@proton.me>
According to FastAPI docs, if you're using a generic function, running
it in async will make it more performant (which makes sense since
running def functions for routes will automatically run the caller
through a threadpool).
Tested and everything works fine.
Signed-off-by: kingbri <bdashore3@proton.me>
The semaphore/queue model for Tabby is as follows:
- Any load requests go through the semaphore by default
- Any load request can include the skip_queue parameter to bypass
the semaphore
- Any unload requests are immediately executed
- All completion requests are placed inside the semaphore by default
This model preserves the parallelism of single-user mode with extra
convenience methods for queues in multi-user. It also helps mitigate
problems that were previously present in the concurrency stack.
Also change how the program's loop runs so it exits when the API thread
dies.
Signed-off-by: kingbri <bdashore3@proton.me>
This is the first in many future commits that will overhaul the API
to be more robust and concurrent. The model is admin-first where the
admin can do anything in-case something goes awry.
Previously, calls to long running synchronous background tasks would
block the entire API, making it ignore any terminal signals until
generation is completed.
To fix this, levrage FastAPI's run_in_threadpool to offload the long
running tasks to another thread. However, signals to abort the process
still kept the background thread running and made the terminal hang.
This was due to an issue with Uvicorn not propegating the SIGINT signal
across threads in its event loop. To fix this in a catch-all way, run
the API processes in a separate thread so the main thread can still
kill the process if needed.
In addition, make request error logging more robust and refer to the
console for full error logs rather than creating a long message on the
client-side.
Finally, add state checks to see if a model is fully loaded before
generating a completion.
Signed-off-by: kingbri <bdashore3@proton.me>
Using the Outlines library, add support to supply EBNF strings and
pass them to the library for parsing.
From there, a wrapper is created and a filter is passed to generation.
Replace with an in-house solution at some point that's more flexible.
Signed-off-by: kingbri <bdashore3@proton.me>