Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Clean up paged attention checks
* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode
* Model: Fix no_flash_attention
* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
Use a queue-based system to get choices independently and send them
in the overall streaming payload. This method allows for unordered
streaming of generations.
The system is a bit redundant, so maybe make the code more optimized
in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
For multiple generations in the same request, nested arrays kept their
original reference, resulting in duplications. This will occur with
any collection type.
For optimization purposes, a deepcopy isn't run for the first iteration
since original references are created.
This is not the most elegant solution, but it works for the described
cases.
Signed-off-by: kingbri <bdashore3@proton.me>
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.
Signed-off-by: kingbri <bdashore3@proton.me>
Waiting for request disconnect takes some extra time and allows
generation chunks to pile up, resulting in large payloads being sent
at once not making up a smooth stream.
Use the polling method in non-streaming requests by creating a background
task and then check if the task is done, signifying that the request
has been disconnected.
Signed-off-by: kingbri <bdashore3@proton.me>
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.
Streaming requests now set an event to cancel the batched job and break
out of the generation loop.
Signed-off-by: kingbri <bdashore3@proton.me>
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.
If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.
Signed-off-by: kingbri <bdashore3@proton.me>
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).
Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.
Signed-off-by: kingbri <bdashore3@proton.me>
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.
Signed-off-by: kingbri <bdashore3@proton.me>
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.
Signed-off-by: kingbri <bdashore3@proton.me>
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.
Also expose the parameter to the end user in both config and model
load.
Signed-off-by: kingbri <bdashore3@proton.me>
Template modules grab all set vars, including ones that use runtime
vars. If a template var is set to a runtime var and a module is created,
an UndefinedError fires.
Use make_module instead to pass runtime vars when creating a template
module.
Resolves#92
Signed-off-by: kingbri <bdashore3@proton.me>
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding the stop_strings var to chat templates will allow for the
template creator to specify stopping strings to add onto chat completions.
Thes get appended with existing stopping strings that are passed
in the API request. However, a sampler override with force: true will
override all stopping strings.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, some request errors were only sent to the client, but
some clients don't log the full error, so log it in console.
Signed-off-by: kingbri <bdashore3@proton.me>
This causes spamming of warn statements on SIGINT. The message also
gets printed on a normal shutdown (that isn't in the middle of a
request).
Signed-off-by: kingbri <bdashore3@proton.me>
Works the same way as streaming gens. If the request is cancelled,
it will log an error to the user and release the semaphore if it's
holding anything.
Signed-off-by: kingbri <bdashore3@proton.me>
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.
Signed-off-by: kingbri <bdashore3@proton.me>