Commit graph

143 commits

Author SHA1 Message Date
kingbri
b944f8d756 OAI: Add "n" for non-streaming generations
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
DocShotgun
7ab7ffd562 Tree: Format 2024-05-26 15:48:18 -07:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
d710a1b441 OAI: Switch to background task for disconnect checks
Waiting for request disconnect takes some extra time and allows
generation chunks to pile up, resulting in large payloads being sent
at once not making up a smooth stream.

Use the polling method in non-streaming requests by creating a background
task and then check if the task is done, signifying that the request
has been disconnected.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:52:20 -04:00
kingbri
660f9b8432 OAI: Fix request cancellation behavior
Depending on the day of the week, Starlette can work with a CancelledError
or using await request.is_disconnected(). Run the same behavior for both
cases and allow cancellation.

Streaming requests now set an event to cancel the batched job and break
out of the generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 13:00:33 -04:00
kingbri
9fbbc5afca Tree: Swap from map to list comprehensions
List comprehensions are the more "pythonic" way to approach mapping
values to a list. They're also more flexible across different collection
types rather than the inbuilt map method. It's best to keep one convention
rather than splitting down two.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
43cd7f57e8 API + Model: Add blocks and checks for various load requests
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
06ff47e2b4 Model: Use true async jobs and add logprobs
The new async dynamic job allows for native async support without the
need of threading. Also add logprobs and metrics back to responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
0e9385e023 API: Fix usage reporting for chat completions
Resolves #106

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-17 00:03:15 -04:00
kingbri
6f4012d20d API: Add preset listing for sampler overrides
Querying the overrides list endpoint now returns the selected preset
and a list of presets to use.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-12 01:34:51 -04:00
kingbri
ab526f7278 Revert "API: Remove unncessary Optional signatures"
This reverts commit 7556dcf134.

The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-02 21:23:48 -04:00
kingbri
7556dcf134 API: Remove unncessary Optional signatures
Optional isn't necessary if the function signature has a default
value.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-01 00:04:52 -04:00
kingbri
50e0b71690 Downloader: Fix handling of include pattern
If an include or exclude pattern is provided, include should include
all files by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-30 01:13:06 -04:00
kingbri
21a01741c9 Downloader: Add include and exclude parameters
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-30 00:58:54 -04:00
kingbri
55ccd1baad API: Add HuggingFace downloader
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.

Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-29 01:15:02 -04:00
kingbri
fb1d2f34c1 OAI: Add response_prefix and fix BOS token issues in chat completions
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).

Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-25 00:54:43 -04:00
kingbri
cab789e685 Templates: Migrate to class
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-21 23:28:14 -04:00
kingbri
515b3c2930 OAI: Tokenize chat completion messages
Since chat completion messages are a structure, format the prompt
before checking in the tokenizer.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-15 14:17:16 -04:00
kingbri
2a0aaa2e8a OAI: Add ability to pass extra vars in jinja templates
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-11 09:49:25 -04:00
kingbri
b1f3baad74 OAI: Add response_format parameter
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-09 21:33:31 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00
kingbri
5bb4995a7c API: Move OAI to APIRouter
This makes the API more modular for other API implementations in the
future.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-06 01:25:31 -04:00
kingbri
f9f8c97c6d Templates: Fix stop_string parsing
Template modules grab all set vars, including ones that use runtime
vars. If a template var is set to a runtime var and a module is created,
an UndefinedError fires.

Use make_module instead to pass runtime vars when creating a template
module.

Resolves #92

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-02 00:44:04 -04:00
kingbri
e8b6a02aa8 API: Move prompt template construction to utils
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:24:13 -04:00
kingbri
dc456f4cc2 Templates: Add stop_strings meta param
Adding the stop_strings var to chat templates will allow for the
template creator to specify stopping strings to add onto chat completions.

Thes get appended with existing stopping strings that are passed
in the API request. However, a sampler override with force: true will
override all stopping strings.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-27 22:22:07 -04:00
kingbri
db62d1e649 OAI: Log request errors to console
Previously, some request errors were only sent to the client, but
some clients don't log the full error, so log it in console.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 20:29:17 -04:00
kingbri
f952b81ccf API: Remove uvicorn signal handler injection
This causes spamming of warn statements on SIGINT. The message also
gets printed on a normal shutdown (that isn't in the middle of a
request).

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:23:45 -04:00
kingbri
6dfcbbd813 Common: Migrate request utils to networking
Helps organize the project better. Utils is meant to be for simple
functions like unwrap.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:21:57 -04:00
kingbri
2961c5f3f9 API: Handle request disconnect on non-streaming gens
Works the same way as streaming gens. If the request is cancelled,
it will log an error to the user and release the semaphore if it's
holding anything.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:12:59 -04:00
kingbri
56fdfb5f8e OAI: Add stream to gen params
Good for logging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:55:44 -04:00
kingbri
07d9b7cf7b Model: Add abort on generation
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
2704ff8344 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 16:02:29 -04:00
kingbri
5c7fc69ded API: Fix finish_reason returns
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).

Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 15:59:28 -04:00
kingbri
25f5d4a690 API: Cleanup permission endpoint
Don't return an OAI specific type from a common file.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 15:13:26 -04:00
kingbri
3c08f46c51 Endpoints: Add key permission checker
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 00:53:27 -04:00
kingbri
14d8ec2007 Signal: Fix signal handlers for uvicorn
Add the ability to override uvicorn's signal handler in addition
to using main's signal handler for any SIGINTs before the API server
starts.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
2755fd1af0 API: Fix blocking iterator execution
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.

Use asyncio's run_thread function since it allows for errors to be
propegated.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
7fded4f183 Tree: Switch to async generators
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.

NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
33e2df50b7 API: Disable SSE ping chunks
These are mainly used for some clients that ping to see if the request
is alive. However, we don't need this.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-14 20:47:05 -04:00
kingbri
1ec8eb9620 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 00:02:55 -04:00
kingbri
6f03be9523 API: Split functions into their own files
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.

Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-12 23:59:30 -04:00
kingbri
104a6121cb API: Split into separate folder
Moving the API into its own directory helps compartmentalize it
and allows for cleaning up the main file to just contain bootstrapping
and the entry point.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-12 23:59:30 -04:00