Commit graph

60 commits

Author SHA1 Message Date
kingbri
4aebe8a2a5 Config: Use an explicit "auto" value for rope_alpha
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-31 22:59:56 -04:00
kingbri
364032e39e Config: Remove developement flag from tensor parallel
Exists in stable ExllamaV2 version.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
871c89063d Model: Add Tensor Parallel support
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.

Also make it easier to determine which cache type to use rather than
multiple if/else statements.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
3e42211c3e Config: Embeddings: Make embeddings_device a default when API loading
When loading from the API, the fallback for embeddings_device will be
the same as the config.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 13:59:49 -04:00
Brian Dashore
1bf062559d
Merge pull request #158 from AlpinDale/embeddings
feat: add embeddings support via Infinity-emb
2024-07-31 20:33:12 -04:00
kingbri
46304ce875 Model: Properly pass in max_batch_size from config
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 18:42:25 -04:00
kingbri
dc3dcc9c0d Embeddings: Update config, args, and parameter names
Use embeddings_device as the parameter for device to remove ambiguity.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 15:32:26 -04:00
kingbri
bfa011e0ce Embeddings: Add model management
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 15:19:27 -04:00
kingbri
3f21d9ef96 Embeddings: Switch to Infinity
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-29 13:42:03 -04:00
kingbri
42bc4adcfb Config: Add option to set priority to realtime
Realtime process priority assigns resources to point to tabby's
processes. Running as administrator will give realtime priority
while running as a normal user will set as high priority.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-24 21:50:06 -04:00
kingbri
5c082b7e8c Async: Add option to use Uvloop/Winloop
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-24 18:59:20 -04:00
kingbri
300f034233 API: Add config option to select servers
Always enable the core endpoints and allow servers to be selected
as needed. Use the OAI server by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-23 14:27:42 -04:00
kingbri
3826815edb API: Add request logging
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-22 21:40:00 -04:00
kingbri
6019c93637 Networking: Gate sending tracebacks over the API
It's possible that tracebacks can give too much info about a system
when sent over the API. Gate this under a flag to send them only
when debugging since this feature is still useful.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-14 10:30:11 -04:00
kingbri
27d2d5f3d2 Config + Model: Allow for default fallbacks from config for model loads
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).

However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.

Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.

This behavior may change in the future, but I think it solves the
issue for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-06 17:50:58 -04:00
DocShotgun
55d979b7a5
Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134)
* Dependencies: Add wheels for Python 3.12

* Model: Switch fp8 cache to Q8 cache

* Model: Add ability to set draft model cache mode

* Dependencies: Bump exllamav2 to 0.1.5

* Model: Support Q6 cache

* Config: Add Q6 cache and draft_cache_mode to config sample
2024-06-09 17:27:39 +02:00
kingbri
bec919e202 Config: Change cache_size description and location
Makes more sense to place cache_size with the other cache options.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-26 20:50:56 -04:00
DocShotgun
767e6a798a API + Model: Add support for specifying k/v cache size 2024-05-26 14:17:01 -07:00
kingbri
408c66a1f2 Model: Change FA2 and paged attention checks
The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-05-25 21:16:14 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00
kingbri
46ac3beea9 Templates: Support list style chat_template keys
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.

In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 11:20:25 -04:00
kingbri
08bcc6307a Config: Update description part 2
Forgot to change wording.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:07:23 -04:00
kingbri
7abbac098a Config: Update Q4 in comments
Wasn't present when the option was added.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-17 01:04:12 -04:00
DocShotgun
8245488926
Additional clarification for override_base_seq_len 2024-03-02 09:29:50 -08:00
kingbri
949248fb94 Config: Add experimental torch cuda malloc backend
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:45:56 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
58590a6c57 Config: Add option to force streaming off
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:09:59 -05:00
kingbri
c0ad647fa7 Model: Auto-detect a one GPU setup and fix gpu_split_auto
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 23:08:57 -05:00
kingbri
849179df17 Model: Make loading use less VRAM
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).

This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 22:29:56 -05:00
kingbri
1919bf7705 Launch: Make exllamav2 requirement more friendly
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
fc4570220c API + Model: Add new parameters and clean up documentation
The example JSON fields were changed because of the new sampler
default strategy. Fix these by manually changing the values.

Also add support for fasttensors and expose generate_window to
the API. It's recommended to not adjust generate_window as it's
dynamically scaled based on max_seq_len by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
6c30f24c83 Tree: Unify sampler parameters and add override support
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.

Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).

Add the ability for the user to customize fallback parameters from
server-side.

In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
6b04463051 API: Fix CFG reporting
THe model endpoint wasn't reporting if CFG is on.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-02 13:54:16 -05:00
kingbri
b378773d0a Model: Add CFG support
CFG, or classifier-free guidance helps push a model in different
directions based on what the user provides.

Currently, CFG is ignored if the negative prompt is blank (it shouldn't
be used in that way anyways).

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-02 01:46:51 -05:00
kingbri
4136f19058 Config: Make the sample a drop-in solution
With the new wiki, all parameters are fully documented along with
comments in the YAML file itself. This should help new users who
pull, copy the config, and can't start the API due to subsections
being uncommented and read.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-29 01:36:21 -05:00
kingbri
ec929728d9 Model: Read scale_pos_emb from config
In newer versions of exllamav2, this value is read from the model's
config.json. This value will still default to 1.0 anyways.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-28 21:14:24 -05:00
kingbri
c72d30918c Config: Default None -> Empty in comments
Empty makes more sense when talking about empty fields.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-28 00:32:29 -05:00
kingbri
3622710582 API: Fix num_experts_per_token reporting
This wasn't linked to the model config. This value can be 1 if
a MoE model isn't loaded.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-28 00:31:14 -05:00
kingbri
8fa764bfbe Auth: Add option to disable authentication
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.

A warning will pop up when users disable authentication.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-21 23:40:16 -05:00
kingbri
72e19dbc12 Config: Change default dirs in sample
Models and draft models default to the models directory while
loras default to the loras directory.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-21 22:35:03 -05:00
kingbri
bee758dae9 Config: Clarify rope parameters
Blank = automatic calculation of alpha value.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 21:15:06 -05:00
kingbri
ab10b263fd Model: Add override base seq len
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.

Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.

If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-20 00:45:39 -05:00
kingbri
ce2602df9a Model: Fix max seq len handling
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.

Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-19 23:37:52 -05:00
kingbri
de9a19b5d3 Templating: Add generation prompt appending
Append generation prompts if given the flag on an OAI chat completion
request.

This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
f631dd6ff7 Templates: Switch to Jinja2
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.

Also allows for unblocking Pydantic's version.

Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-18 23:53:47 -05:00
kingbri
ad8807a830 Model: Add support for num_experts_by_token
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-17 18:03:01 -05:00
kingbri
083df7d585 Tree: Add generation logging support
Generations can be logged in the console along with sampling parameters
if the user enables it in config.

Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-12 23:43:35 -05:00
kingbri
db87efde4a OAI: Add ability to specify fastchat prompt template
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.

Also send the provided prompt template on model info request.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-10 15:43:58 -05:00
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
DocShotgun
39f7a2aabd
Expose draft_rope_scale 2023-12-05 12:59:32 -08:00