Commit graph

411 commits

Author SHA1 Message Date
kingbri
4d158dac90 Start: Fix when reading from gpu_lib file
The wrong variable was being set, so fix that.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-13 12:24:30 -04:00
kingbri
2a0aaa2e8a OAI: Add ability to pass extra vars in jinja templates
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-11 09:49:25 -04:00
kingbri
b1f3baad74 OAI: Add response_format parameter
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-09 21:33:31 -04:00
kingbri
de41e9f7e9 Start: Add gpu_lib argument
Argument to override the selected GPU library. Useful for daemoniztion
when running for the first time.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-08 23:33:19 -04:00
kingbri
d759a15559 Model: Fix chunk size handling
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:39:19 -04:00
kingbri
30c4554572 Requirements: Update Exllamav2
v0.0.18

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 18:00:56 -04:00
kingbri
46ac3beea9 Templates: Support list style chat_template keys
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.

In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-07 11:20:25 -04:00
kingbri
5bb4995a7c API: Move OAI to APIRouter
This makes the API more modular for other API implementations in the
future.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-06 01:25:31 -04:00
kingbri
8bdc19124f Start: Fix gpu lib when reading from file
Readline doesn't strip out newlines or spaces.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-02 22:04:01 -04:00
Brian Dashore
cdb96e4f74
Merge pull request #93 from AlpinDale/chore/log-level
chore: make log level configurable via env variable
2024-04-02 00:52:06 -04:00
kingbri
f9f8c97c6d Templates: Fix stop_string parsing
Template modules grab all set vars, including ones that use runtime
vars. If a template var is set to a runtime var and a module is created,
an UndefinedError fires.

Use make_module instead to pass runtime vars when creating a template
module.

Resolves #92

Signed-off-by: kingbri <bdashore3@proton.me>
2024-04-02 00:44:04 -04:00
AlpinDale
1650e6e640 ruff 2024-04-01 23:11:30 +00:00
AlpinDale
5e599ddbd4 typo 2024-04-01 23:08:28 +00:00
AlpinDale
6c4a1a9c70 make log level a global var 2024-04-01 23:07:30 +00:00
AlpinDale
031349133b properly order imports 2024-04-01 23:03:16 +00:00
AlpinDale
e90ead3b35 chore: make log level configurable via env variable 2024-04-01 22:57:56 +00:00
kingbri
6ecce1604b Model: Fix log if exl2 version is too low
Switch to pyproject syntax.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:11:21 -04:00
kingbri
f534930270 Dependencies: Bump Exllamav2
v0.0.17

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 23:10:28 -04:00
kingbri
d716527b92 Sampling: Add additive param to overrides
Additive is used to add collections together. Currently, it's used
for lists, but it can be used for dictionaries in the future.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-31 01:10:55 -04:00
kingbri
05b5700334 Dependencies: Update torch
v2.2.2

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 17:03:37 -04:00
kingbri
5c94894a1a Dependencies: Update Flash Attention
v2.5.6

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 16:58:24 -04:00
kingbri
b11aac51e2 Model: Add torch.inference_mode() to generator function
Provides a speedup to model forward.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-30 10:45:28 -04:00
kingbri
e8b6a02aa8 API: Move prompt template construction to utils
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:24:13 -04:00
kingbri
190a0b26c3 Model: Fix generation when stream = false
References #91. Check if the length of the generation array is > 0
after popping the finish reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-29 02:15:56 -04:00
kingbri
d4280e1378 Dependencies: Add pytorch-triton-rocm
Required for AMD installs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-28 11:02:56 -04:00
kingbri
271f5ba7a4 Templates: Modify alpaca and chatml
Add the stop_strings metadata parameter.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-27 22:28:41 -04:00
kingbri
dc456f4cc2 Templates: Add stop_strings meta param
Adding the stop_strings var to chat templates will allow for the
template creator to specify stopping strings to add onto chat completions.

Thes get appended with existing stopping strings that are passed
in the API request. However, a sampler override with force: true will
override all stopping strings.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-27 22:22:07 -04:00
kingbri
277c540c98 Colab: Update
Switch to pyproject

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-24 21:48:48 -04:00
kingbri
db62d1e649 OAI: Log request errors to console
Previously, some request errors were only sent to the client, but
some clients don't log the full error, so log it in console.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 20:29:17 -04:00
kingbri
26496c4db2 Dependencies: Require tokenizers
This is used for some models and isn't too big in size (compared to
other huggingface dependencies), so include it by default.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-23 01:12:21 -04:00
kingbri
1755f284cf Model: Prompt users to install extras if dependencies don't exist
Ex: tokenizers, lmfe, outlines.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-22 22:13:55 -04:00
kingbri
f952b81ccf API: Remove uvicorn signal handler injection
This causes spamming of warn statements on SIGINT. The message also
gets printed on a normal shutdown (that isn't in the middle of a
request).

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:23:45 -04:00
kingbri
6dfcbbd813 Common: Migrate request utils to networking
Helps organize the project better. Utils is meant to be for simple
functions like unwrap.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:21:57 -04:00
kingbri
2961c5f3f9 API: Handle request disconnect on non-streaming gens
Works the same way as streaming gens. If the request is cancelled,
it will log an error to the user and release the semaphore if it's
holding anything.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 23:12:59 -04:00
kingbri
44b7319710 Start: Print pip install command
Helps for debugging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 18:14:48 -04:00
kingbri
5055a98e41 Model: Wrap load in inference_mode
Some tensors were being taken out of inference mode during each
iteration of exllama's load_autosplit_gen. This causes errors since
autograd is off.

Therefore, make the shared load_gen_sync function have an overarching
inference_mode context to prevent forward issues. This should allow for
the generator to iterate across each thread call.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 18:06:50 -04:00
kingbri
37a80334a8 Dependencies: Add packaging
This is a required dependency.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 11:27:27 -04:00
kingbri
56fdfb5f8e OAI: Add stream to gen params
Good for logging.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:55:44 -04:00
kingbri
69e41e994c Model: Fix generation with non-streaming and logprobs
Finish_reason was giving an empty offset. Fix this by grabbing the
finish reason first and then handling the static generation as normal.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:47:24 -04:00
kingbri
345bcc30c7 Dependencies: Add extras feature
Installs all optional dependencies to the venv.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-21 00:09:38 -04:00
kingbri
51b289cab2 Actions: Fix workflows
Adopt to new pyproject install method

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
1e7cf1e5a4 Start: Prompt user for GPU/lib
There is no platform agnostic way to fetch CUDA/ROCm's versions
since environment variables change and users don't necessarily need
CUDA or ROCm installed to run pytorch (pytorch installs the necessary
libs if they don't exist).

Therefore, prompt the user for their GPU lib and store the result in
a textfile so the user doesn't need to constantly enter a preference.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
7e669527ed Model: Fix tokenizer bugs
Some tokenizer variables don't get cleaned up on init, so these can
persist. Clean these up manually before creating a new tokenizer for
now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
07d9b7cf7b Model: Add abort on generation
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
7020a0a2d1 Dependencies: Update Exllamav2
v0.0.16

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
061e1d94c2 Ruff: Migrate to pyproject
Removes unnecessary ruff.toml.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
1059101b23 Dependencies: Remove requirements-*.txt files
Pyproject.toml replaces these files.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
72b08624a3 Start: Update to use pyproject
Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
b1ca435695 Tree: Add pyproject.toml
This will manage dependencies from now on since it's a more flexible
file that's similar to other packaging utilities like npm and cargo.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 15:21:37 -04:00
kingbri
b74603db59 Model: Log metrics before yielding a stop
Yielding the finish reason before the logging causes the function to
terminate early. Instead, log before yielding and breaking out of the
generation loop.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-03-20 01:17:04 -04:00