A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
response_format allows a user to request a valid, but arbitrary JSON
object from the API. This is a new part of the OAI spec.
Signed-off-by: kingbri <bdashore3@proton.me>
Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.
Also expose the parameter to the end user in both config and model
load.
Signed-off-by: kingbri <bdashore3@proton.me>
HuggingFace updated transformers to provide templates in a list for
tokenizers. Update to support this new format. Providing the name
of a template for the "prompt_template" value in config.yml will also
look inside the template list.
In addition, log if there's a template exception, but continue model
loading since it shouldn't shut down the application.
Signed-off-by: kingbri <bdashore3@proton.me>
Template modules grab all set vars, including ones that use runtime
vars. If a template var is set to a runtime var and a module is created,
an UndefinedError fires.
Use make_module instead to pass runtime vars when creating a template
module.
Resolves#92
Signed-off-by: kingbri <bdashore3@proton.me>
Additive is used to add collections together. Currently, it's used
for lists, but it can be used for dictionaries in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding the stop_strings var to chat templates will allow for the
template creator to specify stopping strings to add onto chat completions.
Thes get appended with existing stopping strings that are passed
in the API request. However, a sampler override with force: true will
override all stopping strings.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, some request errors were only sent to the client, but
some clients don't log the full error, so log it in console.
Signed-off-by: kingbri <bdashore3@proton.me>
This is used for some models and isn't too big in size (compared to
other huggingface dependencies), so include it by default.
Signed-off-by: kingbri <bdashore3@proton.me>
This causes spamming of warn statements on SIGINT. The message also
gets printed on a normal shutdown (that isn't in the middle of a
request).
Signed-off-by: kingbri <bdashore3@proton.me>
Works the same way as streaming gens. If the request is cancelled,
it will log an error to the user and release the semaphore if it's
holding anything.
Signed-off-by: kingbri <bdashore3@proton.me>
Some tensors were being taken out of inference mode during each
iteration of exllama's load_autosplit_gen. This causes errors since
autograd is off.
Therefore, make the shared load_gen_sync function have an overarching
inference_mode context to prevent forward issues. This should allow for
the generator to iterate across each thread call.
Signed-off-by: kingbri <bdashore3@proton.me>
Finish_reason was giving an empty offset. Fix this by grabbing the
finish reason first and then handling the static generation as normal.
Signed-off-by: kingbri <bdashore3@proton.me>
There is no platform agnostic way to fetch CUDA/ROCm's versions
since environment variables change and users don't necessarily need
CUDA or ROCm installed to run pytorch (pytorch installs the necessary
libs if they don't exist).
Therefore, prompt the user for their GPU lib and store the result in
a textfile so the user doesn't need to constantly enter a preference.
Signed-off-by: kingbri <bdashore3@proton.me>
Some tokenizer variables don't get cleaned up on init, so these can
persist. Clean these up manually before creating a new tokenizer for
now.
Signed-off-by: kingbri <bdashore3@proton.me>
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.
Signed-off-by: kingbri <bdashore3@proton.me>
This will manage dependencies from now on since it's a more flexible
file that's similar to other packaging utilities like npm and cargo.
Signed-off-by: kingbri <bdashore3@proton.me>
Yielding the finish reason before the logging causes the function to
terminate early. Instead, log before yielding and breaking out of the
generation loop.
Signed-off-by: kingbri <bdashore3@proton.me>