Commit graph

65 commits

Author SHA1 Message Date
AlpinDale
fa47f51f85
feat: workflows for formatting/linting (#35)
* add github workflows for pylint and yapf

* yapf

* docstrings for auth

* fix auth.py

* fix generators.py

* fix gen_logging.py

* fix main.py

* fix model.py

* fix templating.py

* fix utils.py

* update formatting.sh to include subdirs for pylint

* fix model_test.py

* fix wheel_test.py

* rename utils to utils_oai

* fix OAI/utils_oai.py

* fix completion.py

* fix token.py

* fix lora.py

* fix common.py

* add pylintrc and fix model.py

* finish up pylint

* fix attribute error

* main.py formatting

* add formatting batch script

* Main: Remove unnecessary global

Linter suggestion.

Signed-off-by: kingbri <bdashore3@proton.me>

* switch to ruff

* Formatting + Linting: Add ruff.toml

Signed-off-by: kingbri <bdashore3@proton.me>

* Formatting + Linting: Switch scripts to use ruff

Also remove the file and recent file change functions from both
scripts.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format and lint

Signed-off-by: kingbri <bdashore3@proton.me>

* Scripts + Workflows: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Remove pylint flags

We use ruff now

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Formatting: Line length is 88

Use the same value as Black.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tree: Format

Update to new line length rules.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
2023-12-22 16:20:35 +00:00
kingbri
a14abfe21c Templates: Support bos_token and eos_token fields
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.

For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-22 10:33:11 -05:00
kingbri
5d80a049ae Templates: Switch to common function for JSON loading
Fix redundancy in code when loading templates. However, loading
a template from config.json may be a mistake since tokenizer_config.json
is the main place where chat templates are stored.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-21 23:08:51 -05:00
Brian Dashore
87a9dfc8c4
Merge pull request #34 from veden/dev
Templates: Added automatic detection of chat templates from tokenizer_config.json
2023-12-21 22:34:53 -05:00
kingbri
1a8afcb6ad Generator: Fix semaphore scheduling
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-21 21:39:45 -05:00
Aaron Veden
f53c98db94
Templates: Added automatic detection of chat templates from tokenizer_config.json 2023-12-20 22:45:55 -08:00
kingbri
5728b9fffb Model: Don't error out if a generation is empty
When stream is false, the generation can be empty, which means
that there's no chunks present in the final generation array, causing
an error.

Instead, return a dummy value if generation is falsy (empty array
or None)

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:51:33 -05:00
kingbri
ab10b263fd Model: Add override base seq len
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.

Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.

If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:45:39 -05:00
kingbri
ce2602df9a Model: Fix max seq len handling
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.

Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 23:37:52 -05:00
kingbri
d3246747c0 Templates: Attempt loading from model config
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 22:58:47 -05:00
kingbri
0a144688c6 Templates: Add clarity statements
Lets the user know if a file not found (OSError) occurs and prints
the applied template on model load.

Also fix some remaining references to fastchat.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 08:13:04 -05:00
kingbri
c3f7898967 OAI: Add logit bias support
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.

Strings are casted to integers on request and errors if an invalid
value is passed.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
7cbc08fc72 Templates: Add auto-detection from path
This replicates FastChat's model path detection.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
f631dd6ff7 Templates: Switch to Jinja2
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.

Also allows for unblocking Pydantic's version.

Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
95fd0f075e Model: Fix no flash attention
Was being called wrong from config.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 23:31:58 -05:00
kingbri
ad8807a830 Model: Add support for num_experts_by_token
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 18:03:01 -05:00
kingbri
1d0bdfa77c Model + OAI: Fix parameter parsing
Rope alpha changes don't require removing the 1.0 default
from Rope scale.

Keep defaults when possible to avoid errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 14:28:18 -05:00
kingbri
eb8ccb9783 Tree: Fix linter issues
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:58:19 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
kingbri
fd9f3eac87 Model: Add params to current model endpoint
Grabs the current model rope params, max seq len, and the draft model
if applicable.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 00:40:56 -05:00
kingbri
0f4290f05c Model: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 22:48:42 -05:00
kingbri
5ae2a91c04 Tree: Use unwrap and coalesce for optional handling
Python doesn't have proper handling of optionals. The only way to
handle them is checking via an if statement if the value is None or
by using the "or" keyword to unwrap optionals.

Previously, I used the "or" method to unwrap, but this caused issues
due to falsy values falling back to the default. This is especially
the case with booleans were "False" changed to "True".

Instead, add two new functions: unwrap and coalesce. Both function
to properly implement a functional way of "None" coalescing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-09 21:52:17 -05:00
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
kingbri
fa1e99daf6 Model: Remove unused print statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-07 21:13:52 -05:00
kingbri
6a71890d45 Model: Fix sampler bugs
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 17:29:58 -05:00
kingbri
4c0e686e7d Model: Cleanup and fix fallbacks
Use the standard "dict.get("key") or default" to handle fetching values
from kwargs and get a fallback value without possible errors.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 23:28:16 -05:00
kingbri
d8f7b93c54 Model: Fix fetching of draft args
Mistakenly fetched these from parent kwargs instead of the scoped
draft_config var.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 22:24:27 -05:00
DocShotgun
3f2fcbcc45
Add fallback to draft_rope_scale to 1.0 2023-12-05 18:51:36 -08:00
DocShotgun
39f7a2aabd
Expose draft_rope_scale 2023-12-05 12:59:32 -08:00
kingbri
c67c9f6d66 Model + Config: Remove low_mem option
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.

A better alternative is to use 8bit cache which works and helps save
VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:07:42 -05:00
kingbri
27fc0c0069 Model: Cleanup and compartmentalize auto rope functions
Also handle an edge case if ratio <= 1 since NTK scaling is only
used for values > 1.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 01:05:09 -05:00
DocShotgun
bd2c5d0d09
Force auto-alpha to 1.0 if config ctx == base ctx 2023-12-02 21:19:59 -08:00
DocShotgun
1c398b0be7
Add automatic NTK-aware alpha scaling to model
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
2023-12-02 21:02:29 -08:00
kingbri
ae69b18583 API: Use FastAPI streaming instead of sse_starlette
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.

This helps the API become more robust and removes an extra requirement.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 01:54:35 -05:00
kingbri
8a5ac5485b Model: Fix rounding
generated_tokens is always a whole number.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 01:55:46 -05:00
kingbri
e703c716ee Merge branch 'main' of https://github.com/ziadloo/tabbyAPI into ziadloo-main 2023-11-30 01:01:48 -05:00
kingbri
3957316b79 Revert "API: Rename repetition_decay -> repetition_slope"
This reverts commit cad144126f.

Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.

Still keep the fallback condition though.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 22:03:45 -05:00
kingbri
94696543bc Model: Warn user if context > max_seq_len
Unlike other backends, tabby attempts to generate even if the context
is greater than the max sequence length via truncation of the given
context.

Rather than artifically erroring out, give a warning that outputted
console metrics are going to be incorrect and to make sure that
context <= max_seq_len.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:35:32 -05:00
kingbri
cad144126f API: Rename repetition_decay -> repetition_slope
Also fix the fallback to use 0 for sanity checking and validation.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 01:13:05 -05:00
Mehran Ziadloo
b0c42d0f05 Leveraging local variables 2023-11-27 20:56:56 -08:00
Mehran Ziadloo
ead503c75b Adding token usage support 2023-11-27 20:05:05 -08:00
kingbri
d47c39da54 API: Don't include draft directory in response
The draft directory should be returned for a draft model request (TBD).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-23 00:07:56 -05:00
kingbri
71b9a53336 API: Add temperature_last support
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-21 21:20:59 -05:00
turboderp
3337fe6acc Warning if unsupported samplers are used 2023-11-21 18:35:22 +01:00
turboderp
a54de11cf3 Add new samplers 2023-11-21 18:16:53 +01:00
Veden
f960fac8ff
Fix incorrect ratio calculation for draft model 2023-11-19 13:12:53 -08:00
kingbri
4cddd0400c Model: Fix draft model loading
Use draft_config to find the path instead of kwargs.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 02:04:02 -05:00
kingbri
31bc418795 Model: Add context in response output
When printing to the console, give information about the context
(ingestion token count).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 00:49:32 -05:00
kingbri
6b9af58cc1 Tree: Fix extraneous bugs and update T/s print
Model: Add extra information to print and fix the divide by zero error.
Auth: Fix validation of API and admin keys to look for the entire key.

References #7 and #6

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-18 22:34:40 -05:00