Commit graph

641 commits

Author SHA1 Message Date
kingbri
5c293499bd OAI: Reorder functions
Reordering routes changes the order of appearance on documentation.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:27:08 -04:00
kingbri
521d21b9f2 OAI: Add return types for docs
Adding return types allows for responses to get included in the
autogenerated docs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:23:41 -04:00
kingbri
62e495fc13 Model: Grammar: Fix lru_cache clear function
It's cache_clear not clear_cache.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:10:15 -04:00
Brian Dashore
17438288c7
Merge pull request #146 from theroyallab/tokenizer_data_fix
Tokenizer data fix
2024-07-08 15:08:29 -04:00
kingbri
c7ce97f119 Tree: Ruff lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:06:28 -04:00
kingbri
8a81fe2eb4 Actions: Add Github Pages deploy
Deploys OpenAPI documentation to pages.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 15:04:27 -04:00
kingbri
6613e38436 Main: Make openapi export store locally
This runs faster than always making a syscall to check if the env
var is set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 14:54:06 -04:00
kingbri
ae66e8f9ba Ruff: Lint
Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 13:44:12 -04:00
kingbri
b907421285 Main: Fix launch if EXPORT_OPENAPI is unset
A default needs to be provided with getenv. Fix that with an empty
string.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 13:41:44 -04:00
kingbri
a59e8ef9e7 Main: Make EXPORT_OPENAPI only work if true or 1
Use truthy values instead of checking if the variable is set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:51:24 -04:00
kingbri
e58e197f0b Ruff: Remove deprecated rule E999
Syntax error is removed since they'll always be shown when linting
anyways.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:36:15 -04:00
kingbri
933268f7e2 API: Integrate OpenAPI export script
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.

In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-08 12:34:32 -04:00
turboderp
e97ad9cb27 RUFF 2024-07-08 03:51:14 +02:00
turboderp
8bbce3455c RUFF 2024-07-08 03:49:26 +02:00
kingbri
5e82b7eb69 API: Add standalone method to fetch OpenAPI docs
Generates and stores an export of the openapi.json file for use in
static websites.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-07 21:35:52 -04:00
turboderp
4cf79c5ae1 Clear tokenizer_data cache when unloading model 2024-07-08 03:31:05 +02:00
turboderp
b7e7df1220 Move tokenizer_data cache to global scope 2024-07-08 02:54:49 +02:00
turboderp
4d0bb1ffc3 Cache creation tokenizer_data in LMFE 2024-07-08 00:51:59 +02:00
turboderp
bb8b02a60a
Wrap arch_compat_overrides in try block
Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.
2024-07-07 07:54:05 +02:00
kingbri
773639ea89 Model: Fix flash-attn checks
If flash attention is already turned off by exllamaV2 itself, don't
try creating a paged generator. Also condense all the redundant
logic into one if statement.

Also check arch_compat_overrides to see if flash attention should
be disabled for a model arch (ex. Gemma 2)

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 20:58:24 -04:00
kingbri
27d2d5f3d2 Config + Model: Allow for default fallbacks from config for model loads
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).

However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.

Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.

This behavior may change in the future, but I think it solves the
issue for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 17:50:58 -04:00
kingbri
d03752e31b Issues: Fix template
Correct Discord invite link.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:52:01 -04:00
kingbri
45fae89af6 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:50:17 -04:00
kingbri
c5ea2abe24 Dependencies: Update ExllamaV2
v0.1.6

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:45:04 -04:00
kingbri
d85b526644 Dependencies: Pin numpy
v2.x breaks many upstream dependencies (torch). Pin until repos are
fixed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-23 21:40:09 -04:00
DocShotgun
107436f601
Dependencies: Fix AMD triton (#139) 2024-06-18 15:19:27 +02:00
Brian Dashore
06ee610a97
Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-17 03:56:47 +00:00
kingbri
c575105e41 ExllamaV2: Cleanup log placements
Move the large import errors into the check functions themselves.
This helps reduce the difficulty in interpreting where errors are
coming from.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-06-16 00:16:03 -04:00
Glenn Maynard
8da7644571
Fix exception unloading models. (#138)
self.generator is None if a model load fails or is cancelled.
2024-06-15 23:44:29 +02:00
DocShotgun
85387d97ad
Fix disabling flash attention in exl2 config (#136)
* Model: Fix disabling flash attention in exl2 config

* Model: Pass no_flash_attn to draft config

* Model: Force torch flash SDP off in compatibility mode
2024-06-12 20:00:46 +02:00
DocShotgun
156b74f3f0
Revision to paged attention checks (#133)
* Model: Clean up paged attention checks

* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode

* Model: Fix no_flash_attention

* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
2024-06-09 17:28:11 +02:00
DocShotgun
55d979b7a5
Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
DocShotgun
dcd9428325
Model: Warn if cache size is too small for CFG (#132) 2024-06-05 19:40:14 +02:00
DocShotgun
e391d84e40
More extensive checks for paged mode support (#121)
* Model: More extensive checks for paged attention
Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode.
This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention.

* Tree: Format

* Tree: Lint

* Model: Check GPU architecture first
Check GPU arch prior to checking whether flash attention 2 is installed
2024-06-05 09:33:21 +02:00
turboderp
dbdcb38ad7
Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129) 2024-06-04 02:32:39 +02:00
turboderp
e889fa3efe
Bump exllamav2 to v0.1.4 (#128) 2024-06-04 02:32:08 +02:00
Orion
6cc3bd9752
feat: list support in message.content (#122) 2024-06-03 19:57:15 +02:00
turboderp
1951f7521c
Forward exceptions from _stream_collector to stream_generate_(chat)_completion (#126) 2024-06-03 19:42:45 +02:00
turboderp
0eb8fa5d1e
[fix] Bring draft progress and model progress in sync with model loader (#125)
* Bring draft progress and model progress in sync with model loader

* Fix formatting
2024-06-03 19:41:02 +02:00
turboderp
a011c17488 Revert "Forward exceptions from _stream_collector to stream_generate_chat_completion"
This reverts commit 1bb8d1a312.
2024-06-02 15:37:37 +02:00
turboderp
1bb8d1a312 Forward exceptions from _stream_collector to stream_generate_chat_completion 2024-06-02 15:13:30 +02:00
kingbri
e95e67a000 OAI: Add validation to "n"
n must be greater than 1 to generate.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
e2a8b6e8ae OAI: Add "n" support for streaming generations
Use a queue-based system to get choices independently and send them
in the overall streaming payload. This method allows for unordered
streaming of generations.

The system is a bit redundant, so maybe make the code more optimized
in the future.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
c8371e0f50 OAI: Copy gen params for "n"
For multiple generations in the same request, nested arrays kept their
original reference, resulting in duplications. This will occur with
any collection type.

For optimization purposes, a deepcopy isn't run for the first iteration
since original references are created.

This is not the most elegant solution, but it works for the described
cases.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
b944f8d756 OAI: Add "n" for non-streaming generations
This adds the ability to add multiple choices to a generation. This
is only available for non-streaming gens for now, it requires some
more work to port over to streaming.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:52:30 -04:00
kingbri
8d31a5aed1 Dependencies: Update Flash Attention 2
v2.5.9.post1

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-28 00:45:35 -04:00
Brian Dashore
516b52b341
Merge pull request #112 from DocShotgun/main
Separate new prompt tokens from those reused from cache in metric logging
2024-05-27 18:04:43 -04:00
kingbri
19961f4126 Dependencies: Update ExllamaV2
v0.1.1

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-27 13:38:07 -04:00
kingbri
04cbed16e8 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-27 13:37:57 -04:00
kingbri
4087586449 Start: Create config.yml if it doesn't exist
While TabbyAPI doesn't need a config.yml to run, new users can get
confused by the task of copying config_sample.yml to config.yml.
Therefore, automatically do this in the start script to immediately
expose options to the user.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 21:37:52 -04:00