This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.
Signed-off-by: kingbri <bdashore3@proton.me>
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.
Signed-off-by: kingbri <bdashore3@proton.me>
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.
Signed-off-by: kingbri <bdashore3@proton.me>
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).
This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>
The example JSON fields were changed because of the new sampler
default strategy. Fix these by manually changing the values.
Also add support for fasttensors and expose generate_window to
the API. It's recommended to not adjust generate_window as it's
dynamically scaled based on max_seq_len by default.
Signed-off-by: kingbri <bdashore3@proton.me>
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.
Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).
Add the ability for the user to customize fallback parameters from
server-side.
In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.
Signed-off-by: kingbri <bdashore3@proton.me>
CFG, or classifier-free guidance helps push a model in different
directions based on what the user provides.
Currently, CFG is ignored if the negative prompt is blank (it shouldn't
be used in that way anyways).
Signed-off-by: kingbri <bdashore3@proton.me>
With the new wiki, all parameters are fully documented along with
comments in the YAML file itself. This should help new users who
pull, copy the config, and can't start the API due to subsections
being uncommented and read.
Signed-off-by: kingbri <bdashore3@proton.me>
In newer versions of exllamav2, this value is read from the model's
config.json. This value will still default to 1.0 anyways.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.
Also allows for unblocking Pydantic's version.
Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.
Signed-off-by: kingbri <bdashore3@proton.me>
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.
Signed-off-by: kingbri <bdashore3@proton.me>
Generations can be logged in the console along with sampling parameters
if the user enables it in config.
Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.
Signed-off-by: kingbri <bdashore3@proton.me>
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.
Also send the provided prompt template on model info request.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.
A better alternative is to use 8bit cache which works and helps save
VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.
Signed-off-by: kingbri <bdashore3@proton.me>
This helps clarify things when users are configuring for the first
time. For example, some users were putting the model name in the
"model" block instead of the "model_name" field.
Signed-off-by: kingbri <bdashore3@proton.me>
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.
Add options in the config to support this. API options will occur
in a different commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded and unloaded via the API. Also add authentication
to use the API and for administrator tasks.
Both types of authorization use different keys.
Also fix the unload function to properly free all used vram.
Signed-off-by: kingbri <bdashore3@proton.me>
YAML is a more flexible format when it comes to configuration. Commandline
arguments are difficult to remember and configure especially for
an API with complicated commandline names. Rather than using half-baked
textfiles, implement a proper config solution.
Also add a progress bar when loading models in the commandline.
Signed-off-by: kingbri <bdashore3@proton.me>