Commit graph

641 commits

Author SHA1 Message Date
kingbri
f196f1177d Requirements: Update exllamav2 to 0.0.11
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-16 19:33:42 -05:00
kingbri
1a331afe3a OAI: Add cache_mode parameter to model
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.

Also add when fetching model info.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-16 02:47:50 -05:00
kingbri
ed868fd262 OAI: Remove unused parameters
Seed and low_mem aren't used, so comment them out.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-15 14:56:43 -05:00
kingbri
59729e2a4a Tests: Fix linting
Also change how wheel_test works for safe import testing rather than
trying to import the package which can cause system issues.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-13 23:05:50 -05:00
kingbri
036ba2669c Auth: Migrate to Pydantic
It's easier to work with Pydantic dataclasses rather than standard
python classes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:22 -05:00
kingbri
eb8ccb9783 Tree: Fix linter issues
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:19 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
b364de1005 Update README
Add alternatives if the user doesn't agree with the focus of TabbyAPI.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 16:05:46 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
kingbri
9f195af5ad Main: Fix function calls
Some function names were declared twice.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 13:28:21 -05:00
kingbri
fd9f3eac87 Model: Add params to current model endpoint
Grabs the current model rope params, max seq len, and the draft model
if applicable.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 00:40:56 -05:00
kingbri
0f4290f05c Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 22:48:42 -05:00
kingbri
5ae2a91c04 Tree: Use unwrap and coalesce for optional handling
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.

Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".

Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 21:52:17 -05:00
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
kingbri
161c9d2c19 Tests: Fix wheel test
Fastchat is named fschat from the package's point of view.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-08 01:15:24 -05:00
kingbri
fa1e99daf6 Model: Remove unused print statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-07 21:13:52 -05:00
kingbri
47176a2a1e Requirements: Fix torch install
Use --extra-index-url to install pytorch. This should be secure enough
since dependency confusion attacks aren't possible with just installing
the torch package.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 19:04:35 -05:00
kingbri
f8e9e22c43 API: Fix model load endpoint with draft
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 18:05:55 -05:00
kingbri
6a71890d45 Model: Fix sampler bugs
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 17:29:58 -05:00
kingbri
9f34af4906 Tests: Create
Add a few tests for the user to check if stuff works.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:53:42 -05:00
kingbri
21c25fd806 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:24:49 -05:00
kingbri
b83e1b704e Requirements: Split for configurations
Add self-contained requirements for cuda 11.8 and ROCm

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:00:30 -05:00
kingbri
4c0e686e7d Model: Cleanup and fix fallbacks
Use the standard "dict.get("key") or default" to handle fetching values
from kwargs and get a fallback value without possible errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 23:28:16 -05:00
Brian Dashore
0ef2fe9b95
Merge pull request #23 from DocShotgun/main
Expose draft_rope_scale
2023-12-05 22:24:53 -05:00
kingbri
d8f7b93c54 Model: Fix fetching of draft args
Mistakenly fetched these from parent kwargs instead of the scoped
draft_config var.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 22:24:27 -05:00
DocShotgun
3f2fcbcc45
Add fallback to draft_rope_scale to 1.0 2023-12-05 18:51:36 -08:00
DocShotgun
39f7a2aabd
Expose draft_rope_scale 2023-12-05 12:59:32 -08:00
Brian Dashore
e085b806e8
Merge pull request #22 from DocShotgun/main
Update colab, expose additional args
2023-12-05 01:22:33 -05:00
DocShotgun
67507105d0
Update colab, expose additional args
* Exposed draft model args for speculative decoding
* Exposed int8 cache, dummy models, and no flash attention
* Resolved CUDA 11.8 dependency issue
2023-12-04 22:20:46 -08:00
Brian Dashore
37f8f3ef8b
Merge pull request #20 from veryamazinglystupid/main
make colab better, fix libcudart errors
2023-12-05 01:14:21 -05:00
kingbri
621e11b940 Update documentation
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:33:43 -05:00
kingbri
8ba3bfa6b3 API: Fix load exception handling
Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.

Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:23:15 -05:00
kingbri
7c92968558 API: Fix mistaken debug statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 18:07:12 -05:00
kingbri
5e54911cc8 API: Fix semaphore handling and chat completion errors
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.

In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 15:51:25 -05:00
kingbri
30fc5b3d29 Merge branch 'main' of github.com:theroyallab/tabbyAPI 2023-12-03 22:55:51 -05:00
kingbri
ed6c962aad API: Fix sequential requests
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.

Also scaffold the framework to move generator functions to their own
file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 22:54:34 -05:00
veryamazinglystupid
ad1a12a0f2
make colab better, fix libcudart errors
:3
2023-12-03 14:07:52 +05:30
DocShotgun
2a9e4ca051 Add Colab example
*note: this uses wheels for python 3.10 and torch 2.1.0+cu118 which is the current default in colab
2023-12-03 02:21:51 -05:00
kingbri
e740b53478 Requirements: Update Flash Attention 2
Bump to 2.3.6

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:56:29 -05:00
kingbri
c67c9f6d66 Model + Config: Remove low_mem option
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.

A better alternative is to use 8bit cache which works and helps save
VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:07:42 -05:00
Brian Dashore
109e4223e0
Merge pull request #18 from DocShotgun/main
Add automatic NTK-aware alpha scaling to model
2023-12-03 01:06:50 -05:00
kingbri
27fc0c0069 Model: Cleanup and compartmentalize auto rope functions
Also handle an edge case if ratio <= 1 since NTK scaling is only
used for values > 1.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:05:09 -05:00
DocShotgun
bd2c5d0d09
Force auto-alpha to 1.0 if config ctx == base ctx 2023-12-02 21:19:59 -08:00
DocShotgun
1c398b0be7
Add automatic NTK-aware alpha scaling to model
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
2023-12-02 21:02:29 -08:00
kingbri
61f6e51fdb OAI: Add separator style fallback
Some models may return None for separator style with FastChat. Fall
back to LLAMA2 if this is the case.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 23:30:19 -05:00
kingbri
ae69b18583 API: Use FastAPI streaming instead of sse_starlette
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.

This helps the API become more robust and removes an extra requirement.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 01:54:35 -05:00
kingbri
6493b1d2aa OAI: Add ability to send dummy models
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:27:28 -05:00
kingbri
aef411bed5 OAI: Fix chat completion streaming
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.

Also fix some errors that were raised from the endpoint.

References #15

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:14:24 -05:00
Brian Dashore
c4d8c901e1
Merge pull request #13 from ziadloo/main
Adding the usage stat support (prompt_tokens, completion_tokens, and total_tokens)
2023-11-30 01:57:44 -05:00
kingbri
8a5ac5485b Model: Fix rounding
generated_tokens is always a whole number.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 01:55:46 -05:00