skip_special_tokens is in stable exl2. Also default the parameters
if they are not present in the function signature.
Signed-off-by: kingbri <bdashore3@proton.me>
From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.
Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.
This reverts commit 7556dcf134.
The Optionals allowed requests to send "null" in the body for optional
parameters which should be allowed.
Signed-off-by: kingbri <bdashore3@proton.me>
These both take an array of glob strings to state what files or
directories to include or exclude when parsing the download list.
Signed-off-by: kingbri <bdashore3@proton.me>
Use None-ish coalescing instead of unwrap optional handling. This means
that any value that is "empty" for python will default to the fallback.
Ex. print("" or "test") will print out "test"
Signed-off-by: kingbri <bdashore3@proton.me>
Adds an asynchronous huggingface downloader that uses HF hub to fetch
all repo files. The current HF hub package has a snapshot_download
function that does not cancel on KeyboardInterrupt.
Instead, make a downloader that uses the Rich progress bar styling
along with a cancellable interface. Finally, link this to TabbyAPI.
Signed-off-by: kingbri <bdashore3@proton.me>
Appends the banned tokens to the generation. This is equivalent of
setting logit bias to -100 on a specific set of tokens.
Signed-off-by: kingbri <bdashore3@proton.me>
response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).
Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.
Signed-off-by: kingbri <bdashore3@proton.me>
Torch errors if float values are passed (because bytes are not float
types). Therefore, overestimate and cast to an int type.
Resolves#97
Signed-off-by: kingbri <bdashore3@proton.me>
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.
Signed-off-by: kingbri <bdashore3@proton.me>
GenerationConfig is meant to override various parts of the model
on generation within the transformers lib. Rather than implementing
the entire GenerationConfig framework (since it's pretty redundant),
add in multi eos_token support like VLLM.
The GenerationConfig is used only for generation, but can be used
for other uses if needed.
If there's more necessary parameters in the future, add those in
as well.
Signed-off-by: kingbri <bdashore3@proton.me>
Automatically create a config.yml on build. Also use the cuda runtime
image which is much lighter than the previous cuda devel image.
Signed-off-by: kingbri <bdashore3@proton.me>
LM format enforcer has some latency on token ingestion, so use an
optimized fork instead. Also add this in as a base dependency since
the size is small.
Signed-off-by: kingbri <bdashore3@proton.me>