Commit graph

1069 commits

Author SHA1 Message Date
kingbri
30f02e5453 Main: Remove uvloop/winloop from experimental status
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 21:30:48 -05:00
kingbri
0dcbb7a722 Dependencies: Update torch, exllamav2, and flash-attn
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1

Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-09 01:27:48 -05:00
kingbri
beb6d8faa5 Model: Adjust draft_gpu_split and add to config
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.

Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00
kingbri
bd8256d168 Merge branch 'main' into draft-split 2025-02-08 15:10:44 -05:00
kingbri
dcbf2de9e5 Logger: Add timestamps
Was against this for a while due to the length of timestamps clogging
the console, but it makes sense to know when something goes wrong.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:40:28 -05:00
kingbri
54fda0dc09 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:03:33 -05:00
kingbri
96e8375ec8 Multimodal: Fix memory leak with MMEmbeddings
On a basic python class, class attributes are handled by reference,
meaning that every instance of embeddings would attach to that reference
and allocate more memory.

Switch to a Pydantic class and factory methods when instantiating.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-02 12:21:19 -05:00
kingbri
bd16681825 Start: Mark cuda 11.8 as unsupported
Temporary until existing cuda 11.8 scripts can be migrated to cuda 12.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-12 21:50:41 -05:00
Brian
566e5b5937
Merge pull request #271 from lifo9/bump-formatron
Bump formatron to `0.4.11`
2025-01-07 23:19:35 -05:00
Jakub Filo
f8d9cfb5fd Bump formatron to 0.4.11 2025-01-08 00:48:25 +01:00
kingbri
cfb439c0e6 Dependencies: Update exllamav2 and pytorch for ROCm
Exllama v0.2.7, pytorch v2.5.1 across all cards.

AMD now requires ROCm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:22:10 -05:00
kingbri
6da65a8fd3 Embeddings: Fix base64 return
A base64 embedding can be a string post-encoding.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:15:12 -05:00
kingbri
245bd5c008 Templates: Alter chatml_with_headers to fit huggingface spec
The previous template was compatible with Jinja2 in Python, but it
was not cross-platform compatible according to HF's standards.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-30 14:00:44 -05:00
Brian
709493837b
Merge pull request #264 from DocShotgun/robust-length-checking
Robust request length checking in generator
2024-12-26 23:37:53 -05:00
kingbri
b994aae995 Model: Cleanup generation length and page checks
Reduce the amount of if statements and combine parts of code.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 23:13:08 -05:00
kingbri
ba2579ff74 Merge branch 'main' into robust-length-checks 2024-12-26 18:00:26 -05:00
kingbri
7878d351a7 Endpoints: Add props endpoint and add more values to model params
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.

Also include the contents of a prompt template when fetching the current
model.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 17:32:19 -05:00
kingbri
fa8035ef72 Dependencies: Update sse-starlette and formatron
Also pin newer versions of dependencies and fix an import from sse-starlette

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-21 23:14:55 -05:00
kingbri
b579fd46b7 Dependencies: Remove outlines from optional check
Outlines is no longer a dependency that's used in TabbyAPI.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-18 11:56:40 -05:00
DocShotgun
4d11323c17 Tree: Format 2024-12-17 09:37:33 -08:00
DocShotgun
5da335eb3d Model: Robust request length checking in generator
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
2024-12-17 09:34:43 -08:00
kingbri
c23e406f2d Sampling: Add max_completion_tokens
Conforms with OAI's updated spec

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 01:02:37 -05:00
kingbri
bc3c154c96 Dependencies: Pin tokenizers
Use a version greater than 0.20.0 for newer model support.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:58:25 -05:00
Brian
1ba33bf646
Merge pull request #252 from DocShotgun/main
Switch grammar backend to Formatron
2024-12-13 00:55:20 -05:00
kingbri
f25ac4b833 Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:47:29 -05:00
kingbri
8df8ba3ddb Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-11 21:58:25 -05:00
DocShotgun
7f899734c0 Grammar: Cache the engine vocabulary
* Avoid rebuilding the KBNF engine vocabulary on every grammar-enabled request
2024-12-05 21:36:37 -08:00
kingbri
8ccd7a12a2 Merge branch 'main' into formatron 2024-12-05 23:01:22 -05:00
kingbri
ac85e34356 Depenedencies: Update Torch, FA2, and Exl2
Torch: 2.5, FA2 2.7.0.post2, Exl2 v0.2.5

Don't update torch for rocm as exl2 isn't built for rocm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:57:00 -05:00
kingbri
ca86ab5477 Dependencies: Remove CUDA 11.8
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:37:03 -05:00
kingbri
3c4211c963 Dependencies: Ensure updated kbnf
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-02 15:10:20 -05:00
Brian
fe44e4a524
Merge pull request #253 from randoentity/workaround-toolcall
workaround for tool calling
2024-11-28 23:30:00 -05:00
kingbri
2e06fb01d3 OAI: Pass mm_embeddings to tool call generation
Don't exclude the vision embeddings when regenerating for a tool call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:27:59 -05:00
Brian
b81dcdaf66
Merge pull request #232 from AlpinDale/serviceinfo_uri
feat: add serviceinfo URI
2024-11-28 23:19:52 -05:00
kingbri
5fadaa728a API: Move serviceinfo to core
Best to expose this endpoint to all APIs as its an information endpoint.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:07:58 -05:00
lucy
ab1f4b7a6a
add draft_gpu_split option 2024-11-27 02:52:19 +01:00
DocShotgun
6f2dc2ea99 Grammar: Fix syntax, lint 2024-11-24 11:35:45 -08:00
DocShotgun
8f209efb99 Grammar: Clean up KBNF implementation
* Also remove empty cache clear function
2024-11-24 10:44:45 -08:00
randoentity
a52610fb19 workaround for tool calling 2024-11-24 13:40:33 +01:00
DocShotgun
a9f39bcff3 Grammar: Preliminary Formatron KBNF support 2024-11-23 12:05:41 -08:00
DocShotgun
0836a9317f Grammar: Initial Formatron regex and JSON schema implementation
* Replace LMFE's regex and JSON schema filters with Formatron's
* Remove Outlines EBNF filter in preparation for Formatron KBNF filter
* TODO: Implement Formatron KBNF filter
2024-11-23 10:27:37 -08:00
kingbri
aa4ccd03d4 Infinity: Use a runtime type hint for engine
Remove the antipattern of the conditional type for the Async engine
and use string-based type inference.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 18:06:08 -05:00
kingbri
242ff4f892 Dependencies: Fix OpenAPI generation
The vision module from the ExllamaV2 backend is used in files outside
the backends contained folder. Therefore, import ExllamaV2 as an
optional dependency here.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 17:59:20 -05:00
kingbri
9cd7fcaf99 Pyproject: Add pillow to deps
Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 17:48:56 -05:00
Brian
9c8186c138
Merge pull request #249 from theroyallab/vision
Vision
2024-11-22 17:45:49 -05:00
kingbri
388d36e6bd OAI: Fix chat completion list parsing
The strings weren't being concatenated properly. Only add the combined
text if the chat completion type is a List.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 17:30:29 -05:00
kingbri
eadc71a4c3 Model: Add unload and error messages for vision
If vision is enabled and the model doesn't support it, send an
error asking the user to reload. Also, add a method to unload the
vision tower.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 14:25:03 -05:00
kingbri
c49047eea1 Model: Fix load packets
The model_type internal reference was changed to an enum for
a more extendable loading process. Return the current model type
when loading a new model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 18:06:47 -05:00
kingbri
0ab393f09c Model: Set vision load to False by default
Mistake in unwrapping. Vision should be false to allow normal model
loading when the flag isn't provided.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 17:54:42 -05:00
kingbri
902045edbb API: Fix chat completion formatting flow
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-21 17:51:14 -05:00