Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.
Signed-off-by: kingbri <bdashore3@proton.me>
Rich is a more mature library for displaying progress bars, logging,
and console output. This should help properly align progress bars
within the terminal.
Side note: "We're Rich!"
Signed-off-by: kingbri <bdashore3@proton.me>
Add this in addition to 8bit cache and 16bit cache. Passing "Q4" with
the cache_mode request parameter will set this on model load.
Signed-off-by: kingbri <bdashore3@proton.me>
Make a disconnect on load error consistently. It should be safer to
warn the user to run unload (or re-run load) if a model does not
load correctly.
Also don't log the traceback for request errors that don't have one.
Signed-off-by: kingbri <bdashore3@proton.me>
According to FastAPI docs, if you're using a generic function, running
it in async will make it more performant (which makes sense since
running def functions for routes will automatically run the caller
through a threadpool).
Tested and everything works fine.
Signed-off-by: kingbri <bdashore3@proton.me>
The semaphore/queue model for Tabby is as follows:
- Any load requests go through the semaphore by default
- Any load request can include the skip_queue parameter to bypass
the semaphore
- Any unload requests are immediately executed
- All completion requests are placed inside the semaphore by default
This model preserves the parallelism of single-user mode with extra
convenience methods for queues in multi-user. It also helps mitigate
problems that were previously present in the concurrency stack.
Also change how the program's loop runs so it exits when the API thread
dies.
Signed-off-by: kingbri <bdashore3@proton.me>
This is the first in many future commits that will overhaul the API
to be more robust and concurrent. The model is admin-first where the
admin can do anything in-case something goes awry.
Previously, calls to long running synchronous background tasks would
block the entire API, making it ignore any terminal signals until
generation is completed.
To fix this, levrage FastAPI's run_in_threadpool to offload the long
running tasks to another thread. However, signals to abort the process
still kept the background thread running and made the terminal hang.
This was due to an issue with Uvicorn not propegating the SIGINT signal
across threads in its event loop. To fix this in a catch-all way, run
the API processes in a separate thread so the main thread can still
kill the process if needed.
In addition, make request error logging more robust and refer to the
console for full error logs rather than creating a long message on the
client-side.
Finally, add state checks to see if a model is fully loaded before
generating a completion.
Signed-off-by: kingbri <bdashore3@proton.me>
Automatically unload the existing model when calling /load. This was
requested many times, and does make more sense in the long run.
Signed-off-by: kingbri <bdashore3@proton.me>
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.
Signed-off-by: kingbri <bdashore3@proton.me>
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).
Only for text completions. Chat completions in a later commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Split the get tokens function into separate wrapper encode and decode
functions for overall code cleanliness.
Signed-off-by: kingbri <bdashore3@proton.me>
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>
Exllamav2 is currently supported on all GPUs and versions. Therefore,
it should be expected that users use the latest version of exllamav2 to
get the latest features.
Doing this helps reduce checks that don't really serve any purpose.
Signed-off-by: kingbri <bdashore3@proton.me>
Allow users to switch the currently overriden samplers via the API
so a restart isn't required to switch the overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.
Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).
Add the ability for the user to customize fallback parameters from
server-side.
In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.
Signed-off-by: kingbri <bdashore3@proton.me>
Move common functions into their own folder and refactor the backends
to use their own folder as well.
Also cleanup imports and alphabetize import statments themselves.
Finally, move colab and docker into their own folders as well.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, if model_name was commented out, a load would not occur.
Add the case if model_name or loras is blank which returns None when
parsing the YAML.
Signed-off-by: kingbri <bdashore3@proton.me>
Add an argparser that casts over to dictionaries of subgroups to
integrate with the config.
This argparser doesn't contain everything in the config due to complexity
issues with CLI args, but will eventually progress to parity. In addition,
it's used to override the config.yml rather than replace it.
A config arg is also provided if the user wants to fully override the
config yaml with another file path.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to the transformers library, add an error handler when an
exception is fired. This relays the error to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.
For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>