Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.
Signed-off-by: kingbri <bdashore3@proton.me>
Middleware runs on both the request and response. Therefore, streaming
responses had increased latency when processing tasks and sending
data to the client which resulted in erratic streaming behavior.
Use a depends to add request IDs since it only executes when the
request is run rather than expecting the response to be sent as well.
For the future, it would be best to think about limiting the time
between each tick of chunk data to be safe.
Signed-off-by: kingbri <bdashore3@proton.me>
Identify which request is being processed to help users disambiguate
which logs correspond to which request.
Signed-off-by: kingbri <bdashore3@proton.me>
Raise a 422 exception for the disconnect. This prevents pydantic
errors when returning a "response" which doesn't contain anything
in this case.
Signed-off-by: kingbri <bdashore3@proton.me>
It's possible that tracebacks can give too much info about a system
when sent over the API. Gate this under a flag to send them only
when debugging since this feature is still useful.
Signed-off-by: kingbri <bdashore3@proton.me>
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.
This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.
Signed-off-by: kingbri <bdashore3@proton.me>
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.
Signed-off-by: kingbri <bdashore3@proton.me>
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.
Signed-off-by: kingbri <bdashore3@proton.me>
At any point for any request cancellation, the semaphore will be
decremented. This is an issue since an arbitrary request can desync
the semaphore, causing multiple tasks to be processed at once and
break generation.
Remove this from the networking handlers and therefore, remove the
release_semaphore function itself.
Signed-off-by: kingbri <bdashore3@proton.me>
If an override was iterable, any modifications to the returned value
would alter the reference to the global storage dict.
Therefore, copy the structure if it's an iterable so any modification
won't alter the original override. Also apply this for the function
that checks for forced overrides.
Signed-off-by: kingbri <bdashore3@proton.me>
From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.
Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Use None-ish coalescing instead of unwrap optional handling. This means
that any value that is "empty" for python will default to the fallback.
Ex. print("" or "test") will print out "test"
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
Appends the banned tokens to the generation. This is equivalent of
setting logit bias to -100 on a specific set of tokens.
Signed-off-by: kingbri <bdashore3@proton.me>
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.
Signed-off-by: kingbri <bdashore3@proton.me>
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.
In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.
Signed-off-by: kingbri <bdashore3@proton.me>