Commit graph

641 commits

Author SHA1 Message Date
kingbri
cce97deea5 Model: Switch logprobs to use post-sampling
Previously, pre-sampling logprobs were used from the raw logits,
but newer versions of exl2 allow for returning token probs post-sampling.
Convert these to logprobs and send to the user.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:51:25 -05:00
kingbri
949248fb94 Config: Add experimental torch cuda malloc backend
This option saves some VRAM, but does have the chance to error out.
Add this in the experimental config section.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 21:45:56 -05:00
kingbri
664e2c417e Model: Fix GPU split args loading
Autosplit was overwriting a manual GPU split if the YAML parameter
wasn't set.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-14 17:42:20 -05:00
kingbri
a79c42ff4c Sampling: Make validators simpler
Injecting into Pydantic fields caused issues with serialization for
documentation rendering. Rather than reinvent the wheel again,
switch to a chain of if statements for now. This may change in the
future if subclasses from the base sampler request need to be
validated as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-11 15:28:43 -05:00
kingbri
f627485534 OAI: Fix completion token fetching
The generator returns generated_tokens in the dict.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-11 01:12:13 -05:00
kingbri
7e730e3507 Sampling: Add universal validation system
Rather than maintaining yet another function to validate sampler
ranges/values, embed them in fields which allows for less
maintainence in the future.

Also add validation for existing samplers that can corrupt
the sampling stack if set improperly.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 14:59:23 -05:00
kingbri
9f1d891490 Packages: Fix exllamav2 version check
Post versions are ok to use for checking if the user is on the correct
exllamav2 wheel.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 14:00:26 -05:00
kingbri
8d8cf5dc69 Model: Fix dynatemp fallback
Set to 1.0 if the condition fails.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-10 12:02:31 -05:00
Brian Dashore
17636ed899 Create pull request template
Asks users to give more information when committing a pull request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:53:29 -05:00
Brian Dashore
c3601bdd18
Issues: Disable blank issues
Users must follow the appropriate issue templates

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:48:03 -05:00
Brian Dashore
aa56ff829f
Add issue templates
Creates templates for issues to help guide users in the right direction when making a bug report or request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-09 14:43:33 -05:00
kingbri
2f568ff573 Config: Expose auto GPU split reserve config
The GPU reserve is used as a VRAM buffer to prevent GPU overflow
when automatically deciding how to load a model on multiple GPUs.
Make this configurable.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 22:09:50 -05:00
kingbri
43bba526bf Model: Fix logprobs unwrapping
Take a log of the token probs since they're already normalized which
reflects the proper value. Also, don't error out if a token prob
doesn't exist in the dict and return None instead from zip.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c7428f0bcd API: Add logprobs for chat completions
Adds chat completion logprob support using OAI's spec. Tokens are
not converted to tiktoken here since that will add an extra dependency
for no real reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
c02fe4d1db API: Fix response creation
Change chat completion and text completion responses to be more
flexible.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
0af6a38af3 Model: Add logprobs support
Returns token offsets, selected tokens, probabilities of tokens
post-sampling, and normalized probability of selecting a token
pre-sampling (for efficiency purposes).

Only for text completions. Chat completions in a later commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
2642ef7156 OAI: Update logprobs type
Some logprobs cannot exist, so make the type optional

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
kingbri
284f20263f API: Clean up tokenizing endpoint
Split the get tokens function into separate wrapper encode and decode
functions for overall code cleanliness.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-08 21:26:53 -05:00
AliCat
bb48f77ca1
Neutralize samplers (#59)
* Update sample_preset.yml

Neutralized the samplers.

* Sampling: Fix dynatemp defaults

Default max temp and min temp is 1.0

* Sampling: Fix TFS defaults

Default is 1.0

---------

Co-authored-by: AliCat <86847834+alicat22@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
2024-02-08 00:23:09 -05:00
kingbri
321c9a1ea9 Requirements: Fix FA2 version number
The URL wasn't edited correctly

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:37:30 -05:00
kingbri
58590a6c57 Config: Add option to force streaming off
Many APIs automatically ask for request streaming without giving
the user the option to turn it off. Therefore, give the user more
freedom by giving a server-side kill switch.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 21:09:59 -05:00
kingbri
d0027bce32 Requirements: Update flash attention 2 for Windows
Version 2.5.2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-07 20:44:23 -05:00
kingbri
c0ad647fa7 Model: Auto-detect a one GPU setup and fix gpu_split_auto
It makes more sense to use gpu split parameters when the user has
>1 GPUs. Otherwise, set split and split_auto to False and save
the user some VRAM.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 23:08:57 -05:00
kingbri
849179df17 Model: Make loading use less VRAM
The model loader was using more VRAM on a single GPU compared to
base exllamav2's loader. This was because single GPUs were running
using the autosplit config which allocates an extra vram buffer
for safe loading. Turn this off for single-GPU setups (and turn
it off by default).

This change should allow users to run models which require the
entire card with hopefully faster T/s. For example, Mixtral with
3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom
on Windows.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 22:29:56 -05:00
kingbri
fedebadc81 Model: Fix generate window fallback
Use max_seq_len as the numerator, not the max_tokens. Mismatched
parameter.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-06 14:48:42 -05:00
kingbri
543a9b68c8 Requirements: Update Exllamav2 to 0.0.13.post1
Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-04 21:25:57 -05:00
kingbri
f10a5cfee6 Auth: Create keys on different exception
FileNotFoundError is the proper exception to catch here rather than
OSError.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-04 01:56:42 -05:00
erinmaybe
fa2acb2828
Adds aliases for min_temp and max_temp (#58)
* Adds aliases for min_temp and max_temp

* Sampling: Add dynatemp_exponent alias
2024-02-03 21:51:29 -05:00
kingbri
a769d90bad Args: Fix developer group
Wasn't being added to the parser

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-03 00:16:47 -05:00
kingbri
f1ea15d77e Model: Remove backwards compatability hacks
Now that exllamav2 is required to be the latest, don't add attribute
checks unless the feature is not in the release build.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:53:45 -05:00
kingbri
6eeb62b82c Requirements: Update exllamav2, torch, and FA2
Torch to 2.2, exllamav2 to 0.0.13, FA2 to 2.4.2 on Windows and 2.5.2
on Linux.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:53:42 -05:00
kingbri
1919bf7705 Launch: Make exllamav2 requirement more friendly
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
b827bcbb44 Sampling: Cleanup and update
Cleanup how overrides are handled, class naming, and adopt exllamav2's
model class to enforce latest stable version methods rather than
adding multiple backwards compatability checks.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
2ea063cea9 Tree: Require exllamav2 version for startup
Exllamav2 is currently supported on all GPUs and versions. Therefore,
it should be expected that users use the latest version of exllamav2 to
get the latest features.

Doing this helps reduce checks that don't really serve any purpose.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
d3781920b3 OAI: Split up utility functions
Just like types, put utility functions in their own separate module
based on the route.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:36:17 -05:00
kingbri
634d299fd9 Sampling: Fix smoothing factor default fallback
default_factory, not default_factor

Signed-off-by: kingbri <bdashore3@proton.me>
2024-02-02 23:35:15 -05:00
Alexander Abushady
d7c18855e7
added quadratic sampling (#56)
* added quadratic sampling

* Update sample_preset.yml

* oops missed a spot

* Sampling: Fix smoothing factor semantics
2024-02-02 22:12:59 -05:00
kingbri
4a7b8b1b7a Samplers: Add dynamic temperature
Does not work if max_temp is less than or equal to min_temp. Sampler
validation will have to be refactored in the future, so the dynamic
temperature check will also be changed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-31 01:20:59 -05:00
kingbri
3605067898 Requirements: Don't use torch 2.2
Pytorch released 2.2 without letting the community know first. Pin
the torch version to 2.1.2 until exllamav2 builds for torch 2.2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-29 23:30:10 -05:00
kingbri
751627e571 OAI: Add fasttensors to model load endpoint
Also fix logging when loading prompt templates.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 01:08:02 -05:00
kingbri
fc4570220c API + Model: Add new parameters and clean up documentation
The example JSON fields were changed because of the new sampler
default strategy. Fix these by manually changing the values.

Also add support for fasttensors and expose generate_window to
the API. It's recommended to not adjust generate_window as it's
dynamically scaled based on max_seq_len by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
90fb41a77a Model: Fix prompt template initialization
The previous commit iterated through multiple try conditions which
made it so the user has to provide a dummy prompt template. Now,
template loading is fallback based.

Run through a loop of functions and return if one of them succeeds.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
740b0215dd Model: Dynamically scale generate_window
Allows for adjustment of reservation space at the end of the context
before rolling it. This should be scaled as a model's max_seq_len
goes up.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
b14c5443fd API: Add sampler override switching
Allow users to switch the currently overriden samplers via the API
so a restart isn't required to switch the overrides.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
de0ba7214c API: Add template switching and unload endpoints
Templates can be switched and unloaded without reloading the entire
model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
6c30f24c83 Tree: Unify sampler parameters and add override support
Unify API sampler params into a superclass which should make them
easier to manage and inherit generic functions from.

Not all frontends expose all sampling parameters due to connections
with OAI (that handles sampling themselves with the exception of
a few sliders).

Add the ability for the user to customize fallback parameters from
server-side.

In addition, parameters can be forced to a certain value server-side
in case the repo automatically sets other sampler values in the
background that the user doesn't want.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
78f920eeda Tree: Refactor code organization
Move common functions into their own folder and refactor the backends
to use their own folder as well.

Also cleanup imports and alphabetize import statments themselves.

Finally, move colab and docker into their own folders as well.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-25 00:15:40 -05:00
kingbri
ee99349a78 Requirements: Bump exllamav2
0.0.12

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-22 21:13:31 -05:00
kingbri
902e841c39 Main: Add logging for API routes
Helps users get started with accessing the docs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-10 23:50:11 -05:00
kingbri
7a29664f06 API: Add alias names to field descriptions
Helps with understanding API aliases. These aliases should not be
used but are helpful for developers who want frontend compat.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-01-08 23:00:33 -05:00