Commit graph

83 commits

Author SHA1 Message Date
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
DocShotgun
102af306e5 Config: Remove developer arg cuda_malloc_backend
* cudaMallocAsync is now enabled by default on supported configurations
2025-08-01 10:59:13 -07:00
kingbri
2096c9bad2 Model: Default max_seq_len to 4096
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.

To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-13 14:57:24 -04:00
Brian
02a8d68e17
Merge branch 'exl3' into backend-detect 2025-05-08 23:50:33 -04:00
DocShotgun
f8070e7707 Model: Auto detect model backend from config
* Use exllamav3 for exl3 models, exllamav2 otherwise
2025-05-06 18:51:58 -07:00
DocShotgun
68a660bdb3 Model: Initial Exl3 cache quantization support 2025-05-03 20:35:35 -07:00
kingbri
7c6a053747 Model: Add option to select backend
Changing the backend switches the container that's used.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:32:39 -04:00
kingbri
79f9c6e854 Model: Remove num_experts_per_token
This shouldn't even be an exposed option since changing it always
breaks inference with the model. Let the model's config.json handle
it.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:52:10 -04:00
kingbri
698d8339cb Config + Docs: Clarify YaRN rope scaling changes
In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:47:49 -04:00
Brian
2e491472d1
Merge pull request #254 from lucyknada/main
add draft_gpu_split option for spec decoding
2025-02-11 16:48:03 -05:00
kingbri
30f02e5453 Main: Remove uvloop/winloop from experimental status
Uvloop/Winloop does provide advantages to asyncio vs the standard
Proactor loop, so remove experimental status.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-10 21:30:48 -05:00
kingbri
beb6d8faa5 Model: Adjust draft_gpu_split and add to config
The previous code overrode the existing gpu split and device idx
values. This now sets an independent draft_gpu_split value and
adjusts the gpu_devices check only if the draft_gpu_split array
is larger than the gpu_split array.

Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto
if a split is not provided.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00
kingbri
0fadb1e5e8 Merge branch 'main' into vision 2024-11-19 21:19:21 -05:00
DocShotgun
c42655336b Config: Add option to disable fetching content from URLs 2024-11-17 23:05:17 -08:00
kingbri
bd9e78e19e API: Add inline exception for dummy models
If an API key sends a dummy model, it shouldn't error as the server
is catering to clients that expect specific OAI model names. This
is a problem with inline model loading since these names would error
by default. Therefore, add an exception if the provided name is in the
dummy model names (which also doubles as inline strict exceptions).

However, the dummy model names weren't configurable, so add a new
option to specify exception names, otherwise the default is gpt-3.5-turbo.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-17 21:15:45 -05:00
kingbri
69ac0eb8aa Model: Add vision loading support
Adds the ability to load vision parts of text + image models. Requires
an explicit flag in config because there isn't a way to automatically
determine whether the vision tower should be used.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-11-11 12:10:11 -05:00
DocShotgun
603760cecb Model: Remove override_base_seq_len 2024-10-30 10:03:08 +08:00
kingbri
126a44483c Tree: Remove fasttensors
Now a noop in upstream.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-30 00:18:47 -04:00
kingbri
63634beb5e Config: Clarify Rope alpha options
Leaving blank will use the model's set value or auto-calculate.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-17 23:03:28 -04:00
kingbri
a34bd9a684 Config: Alter YAML generation script for formatting adherence
Properly add comments and newlines where they need to go.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-17 22:44:42 -04:00
TerminalMan
bb4dd7200e fix defaults for api_servers 2024-09-17 15:41:32 +01:00
kingbri
7fe0dbd62f Tree: Update config_sample
Uses the new YAML generator.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-16 23:32:54 -04:00
kingbri
21f14d4318 API: Update inline load
- Add a config flag
- Migrate support to /v1/completions
- Unify the load function

Signed-off-by: kingbri <bdashore3@proton.me>
2024-09-03 23:37:28 -04:00
kingbri
4aebe8a2a5 Config: Use an explicit "auto" value for rope_alpha
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
364032e39e Config: Remove developement flag from tensor parallel
Exists in stable ExllamaV2 version.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
871c89063d Model: Add Tensor Parallel support
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.

Also make it easier to determine which cache type to use rather than
multiple if/else statements.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
3e42211c3e Config: Embeddings: Make embeddings_device a default when API loading
When loading from the API, the fallback for embeddings_device will be
the same as the config.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 13:59:49 -04:00
Brian Dashore
1bf062559d
Merge pull request #158 from AlpinDale/embeddings
feat: add embeddings support via Infinity-emb
2024-07-31 20:33:12 -04:00
kingbri
46304ce875 Model: Properly pass in max_batch_size from config
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 18:42:25 -04:00
kingbri
dc3dcc9c0d Embeddings: Update config, args, and parameter names
Use embeddings_device as the parameter for device to remove ambiguity.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 15:32:26 -04:00
kingbri
bfa011e0ce Embeddings: Add model management
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 15:19:27 -04:00
kingbri
3f21d9ef96 Embeddings: Switch to Infinity
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-29 13:42:03 -04:00
kingbri
42bc4adcfb Config: Add option to set priority to realtime
Realtime process priority assigns resources to point to tabby's
processes. Running as administrator will give realtime priority
while running as a normal user will set as high priority.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-24 21:50:06 -04:00
kingbri
5c082b7e8c Async: Add option to use Uvloop/Winloop
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-24 18:59:20 -04:00
kingbri
300f034233 API: Add config option to select servers
Always enable the core endpoints and allow servers to be selected
as needed. Use the OAI server by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:27:42 -04:00
kingbri
3826815edb API: Add request logging
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 21:40:00 -04:00
kingbri
6019c93637 Networking: Gate sending tracebacks over the API
It's possible that tracebacks can give too much info about a system
when sent over the API. Gate this under a flag to send them only
when debugging since this feature is still useful.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-14 10:30:11 -04:00
kingbri
27d2d5f3d2 Config + Model: Allow for default fallbacks from config for model loads
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).

However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.

Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.

This behavior may change in the future, but I think it solves the
issue for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 17:50:58 -04:00
DocShotgun
55d979b7a5
Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
kingbri
bec919e202 Config: Change cache_size description and location
Makes more sense to place cache_size with the other cache options.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 20:50:56 -04:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00
kingbri
46ac3beea9 Templates: Support list style chat_template keys
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.

In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 11:20:25 -04:00
kingbri
08bcc6307a Config: Update description part 2
Forgot to change wording.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:07:23 -04:00
kingbri
7abbac098a Config: Update Q4 in comments
Wasn't present when the option was added.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:04:12 -04:00
DocShotgun
8245488926
Additional clarification for override_base_seq_len 2024-03-02 09:29:50 -08:00
kingbri
949248fb94 Config: Add experimental torch cuda malloc backend
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:45:56 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
58590a6c57 Config: Add option to force streaming off
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:09:59 -05:00