Commit graph

1076 commits

Author SHA1 Message Date
Brian
ba9fae808e
Merge pull request #281 from mefich/main
Add strftime_now to Jinja2 to use with Granite3 models
2025-02-13 22:52:51 -05:00
kingbri
7f6294a96d Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-13 22:42:59 -05:00
mefich
3b482d80e4
Add strftime_now to Jijnja2 to use with Granite3 models
Granite3 default template uses strftime_now function.
Currently Jinja2 raises an exception because strftime_now is undefined and /v1/chat/completions endpoint doesn't work with these models when a template from the model metadata is used.
2025-02-13 18:08:24 +02:00
Brian
2e491472d1
Merge pull request #254 from lucyknada/main
add draft_gpu_split option for spec decoding
2025-02-11 16:48:03 -05:00
kingbri
e290b88568 Args: Expose api-servers to subcommands
This is required for the export-openapi action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:39:46 -05:00
kingbri
153dac496c Args: Fix imports and handling of export openapi
The api-servers arg is passed when running subcommands, so use that
instead of replicating the arg again.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:19:44 -05:00
kingbri
30ab8e04b9 Args: Add subcommands to run actions
Migrate OpenAPI and sample config export to subcommands "export-openapi"
and "export-config".

Also add a "download" subcommand that passes args to the TabbyAPI
downloader. This allows models to be downloaded via the API and
CLI args.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:14:22 -05:00
kingbri
30f02e5453 Main: Remove uvloop/winloop from experimental status
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 21:30:48 -05:00
kingbri
0dcbb7a722 Dependencies: Update torch, exllamav2, and flash-attn
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1

Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-09 01:27:48 -05:00
kingbri
beb6d8faa5 Model: Adjust draft_gpu_split and add to config
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.

Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00
kingbri
bd8256d168 Merge branch 'main' into draft-split 2025-02-08 15:10:44 -05:00
kingbri
dcbf2de9e5 Logger: Add timestamps
Was against this for a while due to the length of timestamps clogging
the console, but it makes sense to know when something goes wrong.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:40:28 -05:00
kingbri
54fda0dc09 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:03:33 -05:00
kingbri
96e8375ec8 Multimodal: Fix memory leak with MMEmbeddings
On a basic python class, class attributes are handled by reference,
meaning that every instance of embeddings would attach to that reference
and allocate more memory.

Switch to a Pydantic class and factory methods when instantiating.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-02 12:21:19 -05:00
kingbri
bd16681825 Start: Mark cuda 11.8 as unsupported
Temporary until existing cuda 11.8 scripts can be migrated to cuda 12.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-12 21:50:41 -05:00
Brian
566e5b5937
Merge pull request #271 from lifo9/bump-formatron
Bump formatron to `0.4.11`
2025-01-07 23:19:35 -05:00
Jakub Filo
f8d9cfb5fd Bump formatron to 0.4.11 2025-01-08 00:48:25 +01:00
kingbri
cfb439c0e6 Dependencies: Update exllamav2 and pytorch for ROCm
Exllama v0.2.7, pytorch v2.5.1 across all cards.

AMD now requires ROCm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:22:10 -05:00
kingbri
6da65a8fd3 Embeddings: Fix base64 return
A base64 embedding can be a string post-encoding.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:15:12 -05:00
kingbri
245bd5c008 Templates: Alter chatml_with_headers to fit huggingface spec
The previous template was compatible with Jinja2 in Python, but it
was not cross-platform compatible according to HF's standards.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-30 14:00:44 -05:00
Brian
709493837b
Merge pull request #264 from DocShotgun/robust-length-checking
Robust request length checking in generator
2024-12-26 23:37:53 -05:00
kingbri
b994aae995 Model: Cleanup generation length and page checks
Reduce the amount of if statements and combine parts of code.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 23:13:08 -05:00
kingbri
ba2579ff74 Merge branch 'main' into robust-length-checks 2024-12-26 18:00:26 -05:00
kingbri
7878d351a7 Endpoints: Add props endpoint and add more values to model params
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.

Also include the contents of a prompt template when fetching the current
model.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 17:32:19 -05:00
kingbri
fa8035ef72 Dependencies: Update sse-starlette and formatron
Also pin newer versions of dependencies and fix an import from sse-starlette

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-21 23:14:55 -05:00
kingbri
b579fd46b7 Dependencies: Remove outlines from optional check
Outlines is no longer a dependency that's used in TabbyAPI.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-18 11:56:40 -05:00
DocShotgun
4d11323c17 Tree: Format 2024-12-17 09:37:33 -08:00
DocShotgun
5da335eb3d Model: Robust request length checking in generator
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
2024-12-17 09:34:43 -08:00
kingbri
c23e406f2d Sampling: Add max_completion_tokens
Conforms with OAI's updated spec

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 01:02:37 -05:00
kingbri
bc3c154c96 Dependencies: Pin tokenizers
Use a version greater than 0.20.0 for newer model support.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:58:25 -05:00
Brian
1ba33bf646
Merge pull request #252 from DocShotgun/main
Switch grammar backend to Formatron
2024-12-13 00:55:20 -05:00
kingbri
f25ac4b833 Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:47:29 -05:00
kingbri
8df8ba3ddb Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-11 21:58:25 -05:00
DocShotgun
7f899734c0 Grammar: Cache the engine vocabulary
* Avoid rebuilding the KBNF engine vocabulary on every grammar-enabled request
2024-12-05 21:36:37 -08:00
kingbri
8ccd7a12a2 Merge branch 'main' into formatron 2024-12-05 23:01:22 -05:00
kingbri
ac85e34356 Depenedencies: Update Torch, FA2, and Exl2
Torch: 2.5, FA2 2.7.0.post2, Exl2 v0.2.5

Don't update torch for rocm as exl2 isn't built for rocm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:57:00 -05:00
kingbri
ca86ab5477 Dependencies: Remove CUDA 11.8
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:37:03 -05:00
kingbri
3c4211c963 Dependencies: Ensure updated kbnf
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-02 15:10:20 -05:00
Brian
fe44e4a524
Merge pull request #253 from randoentity/workaround-toolcall
workaround for tool calling
2024-11-28 23:30:00 -05:00
kingbri
2e06fb01d3 OAI: Pass mm_embeddings to tool call generation
Don't exclude the vision embeddings when regenerating for a tool call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:27:59 -05:00
Brian
b81dcdaf66
Merge pull request #232 from AlpinDale/serviceinfo_uri
feat: add serviceinfo URI
2024-11-28 23:19:52 -05:00
kingbri
5fadaa728a API: Move serviceinfo to core
Best to expose this endpoint to all APIs as its an information endpoint.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:07:58 -05:00
lucy
ab1f4b7a6a
add draft_gpu_split option 2024-11-27 02:52:19 +01:00
DocShotgun
6f2dc2ea99 Grammar: Fix syntax, lint 2024-11-24 11:35:45 -08:00
DocShotgun
8f209efb99 Grammar: Clean up KBNF implementation
* Also remove empty cache clear function
2024-11-24 10:44:45 -08:00
randoentity
a52610fb19 workaround for tool calling 2024-11-24 13:40:33 +01:00
DocShotgun
a9f39bcff3 Grammar: Preliminary Formatron KBNF support 2024-11-23 12:05:41 -08:00
DocShotgun
0836a9317f Grammar: Initial Formatron regex and JSON schema implementation
* Replace LMFE's regex and JSON schema filters with Formatron's
* Remove Outlines EBNF filter in preparation for Formatron KBNF filter
* TODO: Implement Formatron KBNF filter
2024-11-23 10:27:37 -08:00
kingbri
aa4ccd03d4 Infinity: Use a runtime type hint for engine
Remove the antipattern of the conditional type for the Async engine
and use string-based type inference.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 18:06:08 -05:00
kingbri
242ff4f892 Dependencies: Fix OpenAPI generation
The vision module from the ExllamaV2 backend is used in files outside
the backends contained folder. Therefore, import ExllamaV2 as an
optional dependency here.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-22 17:59:20 -05:00