Commit graph

1036 commits

Author SHA1 Message Date
David Allada
4196bb6bc8
Update the behavior of start.py so that we can do a full build AND sa… (#293)
* Update the behavior of start.py so that we can do a full build AND save the options, so we can build in a docker image

* Add actual args RIP

* Start: Move start_options write before dependency install message

This ensures that start options are properly written before
determining to exit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>

---------

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-11 23:54:34 -04:00
kingbri
73688670a6 Docs: Add model and inline loading documentation
Sorely required due to the amount of questions about how does inline
loading work.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-25 00:09:18 -05:00
kingbri
35fe372f2b Embeddings: Handle case if embedding input is passed as a string
Infinity expects a list when embedding, so convert to a list if the
input is a string.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-23 00:39:21 -05:00
kingbri
c580893054 Downloader: log errors when downloading
If an error is returned from HuggingFace, raise it to the calling
function.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-19 23:16:17 -05:00
kingbri
48bb78c614 Logger: Switch to ISO timestamp formatting
I thought this was previously enabled, but turns out I labeled with
the wrong date format.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-19 21:48:23 -05:00
kingbri
d6b8c7db4b Docs: Update getting started guide
Add downloader options and edit some points.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-18 12:17:14 -05:00
kingbri
830301b2b4 Actions: Update and add Wiki publish
Publishes the github wiki and runs these in concurrency groups
to avoid spawning multiple actions at a time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:47:38 -05:00
kingbri
5614b342a7 Tree: Migrate docs into repository
This will auto-publish to the Github wiki via an action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:39:35 -05:00
kingbri
9f649647f0 Model + API: GPU split updates and fixes
For the TP loader, GPU split cannot be an empty array. However,
defaulting the parameter to an empty array makes it easier to calculate
the device list. Therefore, cast an empty array to None using
falsy comparisons at load time.

Also add draft_gpu_split to the load request.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-15 21:50:14 -05:00
Brian
304df16543
Update README.md 2025-02-15 12:14:06 -05:00
Brian
ba9fae808e
Merge pull request #281 from mefich/main
Add strftime_now to Jinja2 to use with Granite3 models
2025-02-13 22:52:51 -05:00
kingbri
7f6294a96d Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-13 22:42:59 -05:00
mefich
3b482d80e4
Add strftime_now to Jijnja2 to use with Granite3 models
Granite3 default template uses strftime_now function.
Currently Jinja2 raises an exception because strftime_now is undefined and /v1/chat/completions endpoint doesn't work with these models when a template from the model metadata is used.
2025-02-13 18:08:24 +02:00
Brian
2e491472d1
Merge pull request #254 from lucyknada/main
add draft_gpu_split option for spec decoding
2025-02-11 16:48:03 -05:00
kingbri
e290b88568 Args: Expose api-servers to subcommands
This is required for the export-openapi action.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:39:46 -05:00
kingbri
153dac496c Args: Fix imports and handling of export openapi
The api-servers arg is passed when running subcommands, so use that
instead of replicating the arg again.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:19:44 -05:00
kingbri
30ab8e04b9 Args: Add subcommands to run actions
Migrate OpenAPI and sample config export to subcommands "export-openapi"
and "export-config".

Also add a "download" subcommand that passes args to the TabbyAPI
downloader. This allows models to be downloaded via the API and
CLI args.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 23:14:22 -05:00
kingbri
30f02e5453 Main: Remove uvloop/winloop from experimental status
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 21:30:48 -05:00
kingbri
0dcbb7a722 Dependencies: Update torch, exllamav2, and flash-attn
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1

Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-09 01:27:48 -05:00
kingbri
beb6d8faa5 Model: Adjust draft_gpu_split and add to config
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.

Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00
kingbri
bd8256d168 Merge branch 'main' into draft-split 2025-02-08 15:10:44 -05:00
kingbri
dcbf2de9e5 Logger: Add timestamps
Was against this for a while due to the length of timestamps clogging
the console, but it makes sense to know when something goes wrong.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:40:28 -05:00
kingbri
54fda0dc09 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-07 18:03:33 -05:00
kingbri
96e8375ec8 Multimodal: Fix memory leak with MMEmbeddings
On a basic python class, class attributes are handled by reference,
meaning that every instance of embeddings would attach to that reference
and allocate more memory.

Switch to a Pydantic class and factory methods when instantiating.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-02 12:21:19 -05:00
kingbri
bd16681825 Start: Mark cuda 11.8 as unsupported
Temporary until existing cuda 11.8 scripts can be migrated to cuda 12.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-12 21:50:41 -05:00
Brian
566e5b5937
Merge pull request #271 from lifo9/bump-formatron
Bump formatron to `0.4.11`
2025-01-07 23:19:35 -05:00
Jakub Filo
f8d9cfb5fd Bump formatron to 0.4.11 2025-01-08 00:48:25 +01:00
kingbri
cfb439c0e6 Dependencies: Update exllamav2 and pytorch for ROCm
Exllama v0.2.7, pytorch v2.5.1 across all cards.

AMD now requires ROCm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:22:10 -05:00
kingbri
6da65a8fd3 Embeddings: Fix base64 return
A base64 embedding can be a string post-encoding.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2025-01-01 16:15:12 -05:00
kingbri
245bd5c008 Templates: Alter chatml_with_headers to fit huggingface spec
The previous template was compatible with Jinja2 in Python, but it
was not cross-platform compatible according to HF's standards.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-30 14:00:44 -05:00
Brian
709493837b
Merge pull request #264 from DocShotgun/robust-length-checking
Robust request length checking in generator
2024-12-26 23:37:53 -05:00
kingbri
b994aae995 Model: Cleanup generation length and page checks
Reduce the amount of if statements and combine parts of code.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 23:13:08 -05:00
kingbri
ba2579ff74 Merge branch 'main' into robust-length-checks 2024-12-26 18:00:26 -05:00
kingbri
7878d351a7 Endpoints: Add props endpoint and add more values to model params
The props endpoint is a standard used by llamacpp APIs which returns
various properties of a model to a server. It's still recommended to
use /v1/model to get all the parameters a TabbyAPI model has.

Also include the contents of a prompt template when fetching the current
model.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-26 17:32:19 -05:00
kingbri
fa8035ef72 Dependencies: Update sse-starlette and formatron
Also pin newer versions of dependencies and fix an import from sse-starlette

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-21 23:14:55 -05:00
kingbri
b579fd46b7 Dependencies: Remove outlines from optional check
Outlines is no longer a dependency that's used in TabbyAPI.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-18 11:56:40 -05:00
DocShotgun
4d11323c17 Tree: Format 2024-12-17 09:37:33 -08:00
DocShotgun
5da335eb3d Model: Robust request length checking in generator
* Ensure that length of positive/negative prompt + max_tokens does not exceed max_seq_len
* Ensure that total required pages for CFG request does not exceed allocated cache_size
2024-12-17 09:34:43 -08:00
kingbri
c23e406f2d Sampling: Add max_completion_tokens
Conforms with OAI's updated spec

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 01:02:37 -05:00
kingbri
bc3c154c96 Dependencies: Pin tokenizers
Use a version greater than 0.20.0 for newer model support.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:58:25 -05:00
Brian
1ba33bf646
Merge pull request #252 from DocShotgun/main
Switch grammar backend to Formatron
2024-12-13 00:55:20 -05:00
kingbri
f25ac4b833 Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-13 00:47:29 -05:00
kingbri
8df8ba3ddb Dependencies: Update ExllamaV2
v0.2.6

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-11 21:58:25 -05:00
DocShotgun
7f899734c0 Grammar: Cache the engine vocabulary
* Avoid rebuilding the KBNF engine vocabulary on every grammar-enabled request
2024-12-05 21:36:37 -08:00
kingbri
8ccd7a12a2 Merge branch 'main' into formatron 2024-12-05 23:01:22 -05:00
kingbri
ac85e34356 Depenedencies: Update Torch, FA2, and Exl2
Torch: 2.5, FA2 2.7.0.post2, Exl2 v0.2.5

Don't update torch for rocm as exl2 isn't built for rocm 6.2

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:57:00 -05:00
kingbri
ca86ab5477 Dependencies: Remove CUDA 11.8
Most software has moved to CUDA 12 and cards that aren't supported by
11.8 don't use tabby anyways.

Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-03 22:37:03 -05:00
kingbri
3c4211c963 Dependencies: Ensure updated kbnf
Signed-off-by: kingbri <8082010+bdashore3@users.noreply.github.com>
2024-12-02 15:10:20 -05:00
Brian
fe44e4a524
Merge pull request #253 from randoentity/workaround-toolcall
workaround for tool calling
2024-11-28 23:30:00 -05:00
kingbri
2e06fb01d3 OAI: Pass mm_embeddings to tool call generation
Don't exclude the vision embeddings when regenerating for a tool call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-28 23:27:59 -05:00