Commit graph

641 commits

Author SHA1 Message Date
kingbri
5c94894a1a Dependencies: Update Flash Attention
v2.5.6

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 16:58:24 -04:00
kingbri
b11aac51e2 Model: Add torch.inference_mode() to generator function
Provides a speedup to model forward.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 10:45:28 -04:00
kingbri
e8b6a02aa8 API: Move prompt template construction to utils
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:24:13 -04:00
kingbri
190a0b26c3 Model: Fix generation when stream = false
References #91. Check if the length of the generation array is > 0
after popping the finish reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:15:56 -04:00
kingbri
d4280e1378 Dependencies: Add pytorch-triton-rocm
Required for AMD installs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-28 11:02:56 -04:00
kingbri
271f5ba7a4 Templates: Modify alpaca and chatml
Add the stop_strings metadata parameter.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-27 22:28:41 -04:00
kingbri
dc456f4cc2 Templates: Add stop_strings meta param
Adding the stop_strings var to chat templates will allow for the
template creator to specify stopping strings to add onto chat completions.

Thes get appended with existing stopping strings that are passed
in the API request. However, a sampler override with force: true will
override all stopping strings.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-27 22:22:07 -04:00
kingbri
277c540c98 Colab: Update
Switch to pyproject

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-24 21:48:48 -04:00
kingbri
db62d1e649 OAI: Log request errors to console
Previously, some request errors were only sent to the client, but
some clients don't log the full error, so log it in console.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 20:29:17 -04:00
kingbri
26496c4db2 Dependencies: Require tokenizers
This is used for some models and isn't too big in size (compared to
other huggingface dependencies), so include it by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 01:12:21 -04:00
kingbri
1755f284cf Model: Prompt users to install extras if dependencies don't exist
Ex: tokenizers, lmfe, outlines.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-22 22:13:55 -04:00
kingbri
f952b81ccf API: Remove uvicorn signal handler injection
This causes spamming of warn statements on SIGINT. The message also
gets printed on a normal shutdown (that isn't in the middle of a
request).

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:23:45 -04:00
kingbri
6dfcbbd813 Common: Migrate request utils to networking
Helps organize the project better. Utils is meant to be for simple
functions like unwrap.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:21:57 -04:00
kingbri
2961c5f3f9 API: Handle request disconnect on non-streaming gens
Works the same way as streaming gens. If the request is cancelled,
it will log an error to the user and release the semaphore if it's
holding anything.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:12:59 -04:00
kingbri
44b7319710 Start: Print pip install command
Helps for debugging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 18:14:48 -04:00
kingbri
5055a98e41 Model: Wrap load in inference_mode
Some tensors were being taken out of inference mode during each
iteration of exllama's load_autosplit_gen. This causes errors since
autograd is off.

Therefore, make the shared load_gen_sync function have an overarching
inference_mode context to prevent forward issues. This should allow for
the generator to iterate across each thread call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 18:06:50 -04:00
kingbri
37a80334a8 Dependencies: Add packaging
This is a required dependency.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 11:27:27 -04:00
kingbri
56fdfb5f8e OAI: Add stream to gen params
Good for logging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:55:44 -04:00
kingbri
69e41e994c Model: Fix generation with non-streaming and logprobs
Finish_reason was giving an empty offset. Fix this by grabbing the
finish reason first and then handling the static generation as normal.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:47:24 -04:00
kingbri
345bcc30c7 Dependencies: Add extras feature
Installs all optional dependencies to the venv.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:09:38 -04:00
kingbri
51b289cab2 Actions: Fix workflows
Adopt to new pyproject install method

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
1e7cf1e5a4 Start: Prompt user for GPU/lib
There is no platform agnostic way to fetch CUDA/ROCm's versions
since environment variables change and users don't necessarily need
CUDA or ROCm installed to run pytorch (pytorch installs the necessary
libs if they don't exist).

Therefore, prompt the user for their GPU lib and store the result in
a textfile so the user doesn't need to constantly enter a preference.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
7e669527ed Model: Fix tokenizer bugs
Some tokenizer variables don't get cleaned up on init, so these can
persist. Clean these up manually before creating a new tokenizer for
now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
07d9b7cf7b Model: Add abort on generation
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
7020a0a2d1 Dependencies: Update Exllamav2
v0.0.16

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
061e1d94c2 Ruff: Migrate to pyproject
Removes unnecessary ruff.toml.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
1059101b23 Dependencies: Remove requirements-*.txt files
Pyproject.toml replaces these files.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
72b08624a3 Start: Update to use pyproject
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
b1ca435695 Tree: Add pyproject.toml
This will manage dependencies from now on since it's a more flexible
file that's similar to other packaging utilities like npm and cargo.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
b74603db59 Model: Log metrics before yielding a stop
Yielding the finish reason before the logging causes the function to
terminate early. Instead, log before yielding and breaking out of the
generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 01:17:04 -04:00
kingbri
09a4c79847 Model: Auto-scale max_tokens by default
If max_tokens is None, it automatically scales to fill up the context.
This does not mean the generation will fill up that context since
EOS stops also exist.

Originally suggested by #86

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:54:59 -04:00
kingbri
8cbb59d6e1 Model: Cleanup some comments
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:20:45 -04:00
kingbri
4f75fb5588 Model: Adjust max output len
Max output len should be hardcoded to 16 since it's the amount of
tokens to predict per forward pass. 16 is a good value for both
normal inference and speculative decoding which also helps save
vram compared to 2048 which was the previous default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 22:16:53 -04:00
kingbri
2704ff8344 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 16:02:29 -04:00
kingbri
5c7fc69ded API: Fix finish_reason returns
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).

Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 15:59:28 -04:00
kingbri
25f5d4a690 API: Cleanup permission endpoint
Don't return an OAI specific type from a common file.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 15:13:26 -04:00
kingbri
3c08f46c51 Endpoints: Add key permission checker
This is a definite way to check if an authorized key is API or admin.
The endpoint only runs if the key is valid in the first place to keep
inline with the API's security model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-18 00:53:27 -04:00
kingbri
c9a6d9ae1f Model: Switch to begin_stream_ex
Allows for dynamically passing logprobs params instead of assuming
on initialization of the generator.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 14:41:16 -04:00
kingbri
08bcc6307a Config: Update description part 2
Forgot to change wording.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:07:23 -04:00
kingbri
7abbac098a Config: Update Q4 in comments
Wasn't present when the option was added.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:04:12 -04:00
kingbri
14d8ec2007 Signal: Fix signal handlers for uvicorn
Add the ability to override uvicorn's signal handler in addition
to using main's signal handler for any SIGINTs before the API server
starts.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
95e44c20d6 Model: Fix load if model didn't load properly
If the model didn't load properly, the container still exists until
unload is called. However, the name check still registered as true.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
2755fd1af0 API: Fix blocking iterator execution
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.

Use asyncio's run_thread function since it allows for errors to be
propegated.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
7fded4f183 Tree: Switch to async generators
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.

NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-16 23:23:31 -04:00
kingbri
33e2df50b7 API: Disable SSE ping chunks
These are mainly used for some clients that ping to see if the request
is alive. However, we don't need this.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-14 20:47:05 -04:00
kingbri
7006fa4cc8 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 23:33:18 -04:00
kingbri
efc01d947b API + Model: Add speculative ngram decoding
Speculative ngram decoding is like speculative decoding without the
draft model. It's not as useful because it only decodes on predictable
sequences, but it depends on the usecase.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 23:32:11 -04:00
kingbri
2ebefe8258 Logging: Move metrics to gen logging
This didn't have a place in the generation function.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 23:13:55 -04:00
kingbri
1ec8eb9620 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 00:02:55 -04:00
kingbri
8e4745920c Requirements: Update Ruff
v0.3.2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-13 00:02:55 -04:00