Commit graph

1051 commits

Author SHA1 Message Date
kingbri
7cbc08fc72 Templates: Add auto-detection from path
This replicates FastChat's model path detection.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
e895eaa4bd OAI: Clarify types in docs
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
51ca1ff396 Tree: Switch to Pydantic 2
Pydantic 2 has more modern methods and stability compared to Pydantic 1

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
f631dd6ff7 Templates: Switch to Jinja2
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.

Also allows for unblocking Pydantic's version.

Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
95fd0f075e Model: Fix no flash attention
Was being called wrong from config.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 23:31:58 -05:00
kingbri
ad8807a830 Model: Add support for num_experts_by_token
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 18:03:01 -05:00
kingbri
70fbee3edd OAI: Fix model parameter placement
Accidentally edited the Model Card parameters vs the model load request
ones.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:36:28 -05:00
kingbri
1d0bdfa77c Model + OAI: Fix parameter parsing
Rope alpha changes don't require removing the 1.0 default
from Rope scale.

Keep defaults when possible to avoid errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:28:18 -05:00
Veden
3e57125025
OAI: adding optional draft model properties for draft_rope alpha and scale (#28)
* OAI: adding optional draft model properties for draft_rope alpha and scale

* Forgot to set the properties to None
2023-12-17 19:23:45 +00:00
kingbri
528d58f841 Requirements: Fix AMD
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 00:45:43 -05:00
kingbri
f196f1177d Requirements: Update exllamav2 to 0.0.11
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-16 19:33:42 -05:00
kingbri
1a331afe3a OAI: Add cache_mode parameter to model
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.

Also add when fetching model info.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-16 02:47:50 -05:00
kingbri
ed868fd262 OAI: Remove unused parameters
Seed and low_mem aren't used, so comment them out.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-15 14:56:43 -05:00
kingbri
59729e2a4a Tests: Fix linting
Also change how wheel_test works for safe import testing rather than
trying to import the package which can cause system issues.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-13 23:05:50 -05:00
kingbri
036ba2669c Auth: Migrate to Pydantic
It's easier to work with Pydantic dataclasses rather than standard
python classes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:22 -05:00
kingbri
eb8ccb9783 Tree: Fix linter issues
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:19 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
b364de1005 Update README
Add alternatives if the user doesn't agree with the focus of TabbyAPI.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 16:05:46 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
kingbri
9f195af5ad Main: Fix function calls
Some function names were declared twice.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 13:28:21 -05:00
kingbri
fd9f3eac87 Model: Add params to current model endpoint
Grabs the current model rope params, max seq len, and the draft model
if applicable.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 00:40:56 -05:00
kingbri
0f4290f05c Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 22:48:42 -05:00
kingbri
5ae2a91c04 Tree: Use unwrap and coalesce for optional handling
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.

Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".

Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 21:52:17 -05:00
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
kingbri
161c9d2c19 Tests: Fix wheel test
Fastchat is named fschat from the package's point of view.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-08 01:15:24 -05:00
kingbri
fa1e99daf6 Model: Remove unused print statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-07 21:13:52 -05:00
kingbri
47176a2a1e Requirements: Fix torch install
Use --extra-index-url to install pytorch. This should be secure enough
since dependency confusion attacks aren't possible with just installing
the torch package.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 19:04:35 -05:00
kingbri
f8e9e22c43 API: Fix model load endpoint with draft
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 18:05:55 -05:00
kingbri
6a71890d45 Model: Fix sampler bugs
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 17:29:58 -05:00
kingbri
9f34af4906 Tests: Create
Add a few tests for the user to check if stuff works.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:53:42 -05:00
kingbri
21c25fd806 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:24:49 -05:00
kingbri
b83e1b704e Requirements: Split for configurations
Add self-contained requirements for cuda 11.8 and ROCm

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 00:00:30 -05:00
kingbri
4c0e686e7d Model: Cleanup and fix fallbacks
Use the standard "dict.get("key") or default" to handle fetching values
from kwargs and get a fallback value without possible errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 23:28:16 -05:00
Brian Dashore
0ef2fe9b95
Merge pull request #23 from DocShotgun/main
Expose draft_rope_scale
2023-12-05 22:24:53 -05:00
kingbri
d8f7b93c54 Model: Fix fetching of draft args
Mistakenly fetched these from parent kwargs instead of the scoped
draft_config var.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 22:24:27 -05:00
DocShotgun
3f2fcbcc45
Add fallback to draft_rope_scale to 1.0 2023-12-05 18:51:36 -08:00
DocShotgun
39f7a2aabd
Expose draft_rope_scale 2023-12-05 12:59:32 -08:00
Brian Dashore
e085b806e8
Merge pull request #22 from DocShotgun/main
Update colab, expose additional args
2023-12-05 01:22:33 -05:00
DocShotgun
67507105d0
Update colab, expose additional args
* Exposed draft model args for speculative decoding
* Exposed int8 cache, dummy models, and no flash attention
* Resolved CUDA 11.8 dependency issue
2023-12-04 22:20:46 -08:00
Brian Dashore
37f8f3ef8b
Merge pull request #20 from veryamazinglystupid/main
make colab better, fix libcudart errors
2023-12-05 01:14:21 -05:00
kingbri
621e11b940 Update documentation
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:33:43 -05:00
kingbri
8ba3bfa6b3 API: Fix load exception handling
Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.

Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:23:15 -05:00
kingbri
7c92968558 API: Fix mistaken debug statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 18:07:12 -05:00
kingbri
5e54911cc8 API: Fix semaphore handling and chat completion errors
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.

In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 15:51:25 -05:00
kingbri
30fc5b3d29 Merge branch 'main' of github.com:theroyallab/tabbyAPI 2023-12-03 22:55:51 -05:00
kingbri
ed6c962aad API: Fix sequential requests
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.

Also scaffold the framework to move generator functions to their own
file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 22:54:34 -05:00
veryamazinglystupid
ad1a12a0f2
make colab better, fix libcudart errors
:3
2023-12-03 14:07:52 +05:30
DocShotgun
2a9e4ca051 Add Colab example
*note: this uses wheels for python 3.10 and torch 2.1.0+cu118 which is the current default in colab
2023-12-03 02:21:51 -05:00
kingbri
e740b53478 Requirements: Update Flash Attention 2
Bump to 2.3.6

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:56:29 -05:00
kingbri
c67c9f6d66 Model + Config: Remove low_mem option
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.

A better alternative is to use 8bit cache which works and helps save
VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:07:42 -05:00