Commit graph

71 commits

Author SHA1 Message Date
kingbri
64c2cc85c9 OAI: Migrate model depends into proper file
Use amongst multiple routers.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 13:59:56 -04:00
kingbri
d1706fb067 OAI: Remove double logging if request is cancelled
Uvicorn can log in both the request disconnect handler and the
CancelledError. However, these sometimes don't work and both
need to be checked. But, don't log twice if one works.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 21:48:59 -04:00
kingbri
3826815edb API: Add request logging
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 21:40:00 -04:00
kingbri
ad4d17bca2 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 12:24:34 -04:00
kingbri
0eedc8ca14 API: Switch from request ID middleware to depends
Middleware runs on both the request and response. Therefore, streaming
responses had increased latency when processing tasks and sending
data to the client which resulted in erratic streaming behavior.

Use a depends to add request IDs since it only executes when the
request is run rather than expecting the response to be sent as well.

For the future, it would be best to think about limiting the time
between each tick of chunk data to be safe.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 12:19:46 -04:00
kingbri
cae94b920c API: Add ability to use request IDs
Identify which request is being processed to help users disambiguate
which logs correspond to which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-21 21:01:05 -04:00
kingbri
38185a1ff4 Auth: Fix key check coalesce
Prefer the auth-specific headers before the generic authorization
header.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-19 10:08:57 -04:00
kingbri
c1b61441f4 OAI: Fix usage chunk return
Place the logic into their proper utility functions and cleanup
the code with formatting.

Also, OAI's docs specify that a [DONE] return is needed when everything
is finished.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-12 14:37:20 -04:00
Volodymyr Kuznetsov
b149d3398d OAI: support stream_options argument 2024-07-11 18:37:50 -07:00
kingbri
9fc3fc4c54 OAI: Amend comments
Clarify what the user can and can't see.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
1f46a1130c OAI: Restrict list permissions for API keys
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.

This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
dfb4c51d5f OAI: Fix function idioms
Make functions mean the same thing to avoid confusion.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:50 -04:00
kingbri
b9a58ff01b Auth: Make key permission check work on Requests
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-11 14:22:49 -04:00
kingbri
5c293499bd OAI: Reorder functions
Reordering routes changes the order of appearance on documentation.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:27:08 -04:00
kingbri
521d21b9f2 OAI: Add return types for docs
Adding return types allows for responses to get included in the
autogenerated docs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:23:41 -04:00
kingbri
6613e38436 Main: Make openapi export store locally
This runs faster than always making a syscall to check if the env
var is set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 14:54:06 -04:00
kingbri
933268f7e2 API: Integrate OpenAPI export script
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.

In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:34:32 -04:00
kingbri
5e82b7eb69 API: Add standalone method to fetch OpenAPI docs
Generates and stores an export of the openapi.json file for use in
static websites.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-07 21:35:52 -04:00
kingbri
27d2d5f3d2 Config + Model: Allow for default fallbacks from config for model loads
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).

However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.

Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.

This behavior may change in the future, but I think it solves the
issue for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 17:50:58 -04:00
DocShotgun
156b74f3f0
Revision to paged attention checks (#133)
* Model: Clean up paged attention checks

* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode

* Model: Fix no_flash_attention

* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
2024-06-09 17:28:11 +02:00
DocShotgun
55d979b7a5
Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
Orion
6cc3bd9752
feat: list support in message.content (#122) 2024-06-03 19:57:15 +02:00
turboderp
1951f7521c
Forward exceptions from _stream_collector to stream_generate_(chat)_completion (#126) 2024-06-03 19:42:45 +02:00
turboderp
a011c17488 Revert "Forward exceptions from _stream_collector to stream_generate_chat_completion"
This reverts commit 1bb8d1a312.
2024-06-02 15:37:37 +02:00
turboderp
1bb8d1a312 Forward exceptions from _stream_collector to stream_generate_chat_completion 2024-06-02 15:13:30 +02:00
kingbri
e95e67a000 OAI: Add validation to "n"
n must be greater than 1 to generate.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
e2a8b6e8ae OAI: Add "n" support for streaming generations
Use a queue-based system to get choices independently and send them
in the overall streaming payload. This method allows for unordered
streaming of generations.

The system is a bit redundant, so maybe make the code more optimized
in the future.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
c8371e0f50 OAI: Copy gen params for "n"
For multiple generations in the same request, nested arrays kept their
original reference, resulting in duplications. This will occur with
any collection type.

For optimization purposes, a deepcopy isn't run for the first iteration
since original references are created.

This is not the most elegant solution, but it works for the described
cases.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
b944f8d756 OAI: Add "n" for non-streaming generations
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
DocShotgun
7ab7ffd562 Tree: Format 2024-05-26 15:48:18 -07:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
d710a1b441 OAI: Switch to background task for disconnect checks
Waiting for request disconnect takes some extra time and allows
generation chunks to pile up, resulting in large payloads being sent
at once not making up a smooth stream.

Use the polling method in non-streaming requests by creating a background
task and then check if the task is done, signifying that the request
has been disconnected.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:52:20 -04:00
kingbri
660f9b8432 OAI: Fix request cancellation behavior
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.

Streaming requests now set an event to cancel the batched job and break
out of the generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:00:33 -04:00
kingbri
9fbbc5afca Tree: Swap from map to list comprehensions
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
43cd7f57e8 API + Model: Add blocks and checks for various load requests
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
06ff47e2b4 Model: Use true async jobs and add logprobs
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
0e9385e023 API: Fix usage reporting for chat completions
Resolves #106

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-17 00:03:15 -04:00
kingbri
6f4012d20d API: Add preset listing for sampler overrides
Querying the overrides list endpoint now returns the selected preset
and a list of presets to use.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 01:34:51 -04:00
kingbri
ab526f7278 Revert "API: Remove unncessary Optional signatures"
This reverts commit 7556dcf134.

The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-02 21:23:48 -04:00
kingbri
7556dcf134 API: Remove unncessary Optional signatures
Optional isn't necessary if the function signature has a default
value.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-01 00:04:52 -04:00
kingbri
50e0b71690 Downloader: Fix handling of include pattern
If an include or exclude pattern is provided, include should include
all files by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-30 01:13:06 -04:00
kingbri
21a01741c9 Downloader: Add include and exclude parameters
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-30 00:58:54 -04:00
kingbri
55ccd1baad API: Add HuggingFace downloader
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.

Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-29 01:15:02 -04:00
kingbri
fb1d2f34c1 OAI: Add response_prefix and fix BOS token issues in chat completions
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).

Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-25 00:54:43 -04:00
kingbri
cab789e685 Templates: Migrate to class
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-21 23:28:14 -04:00
kingbri
515b3c2930 OAI: Tokenize chat completion messages
Since chat completion messages are a structure, format the prompt
before checking in the tokenizer.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-15 14:17:16 -04:00
kingbri
2a0aaa2e8a OAI: Add ability to pass extra vars in jinja templates
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-11 09:49:25 -04:00
kingbri
b1f3baad74 OAI: Add response_format parameter
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-09 21:33:31 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00