When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.
Use asyncio's run_thread function since it allows for errors to be
propegated.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.
Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.
Signed-off-by: kingbri <bdashore3@proton.me>
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.
Signed-off-by: kingbri <bdashore3@proton.me>