The appropriate branches weren't firing when frequency penalty is
0.0. Also fix repetition penalty overriding.
Signed-off-by: kingbri <bdashore3@proton.me>
Previous behavior aliased freq pen for rep pen. Keep this behavior
when using the freq pen parameter with a legacy exllamav2 version
rather than ignoring both entirely.
Signed-off-by: kingbri <bdashore3@proton.me>
With the new wiki, all parameters are fully documented along with
comments in the YAML file itself. This should help new users who
pull, copy the config, and can't start the API due to subsections
being uncommented and read.
Signed-off-by: kingbri <bdashore3@proton.me>
In newer versions of exllamav2, this value is read from the model's
config.json. This value will still default to 1.0 anyways.
Signed-off-by: kingbri <bdashore3@proton.me>
All penalties can have a sustain (range) applied to them in exl2,
so clarify the parameter.
However, the default behaviors change based on if freq OR pres pen
is enabled. For the sanity of OAI users, have freq and pres pen only
apply on the output tokens when range is -1 (default).
But, repetition penalty still functions the same way where -1 means
the range is the max seq len.
Doing this prevents gibberish output when using the more modern freq
and presence penalties similar to llamacpp.
NOTE: This logic is still subject to change in the future, but I believe
it hits the happy medium for users who want defaults and users who want
to tinker around with the sampling knobs.
Signed-off-by: kingbri <bdashore3@proton.me>
Direct python can be used for requirements checking. Remove the ps1
script and create a venv purely in batch.
Signed-off-by: kingbri <bdashore3@proton.me>
Building from source is a case for many wheels, so add an option
to skip wheel upgrades/installation if the user uses the start script.
Signed-off-by: kingbri <bdashore3@proton.me>
This maps the absolute path when loading the config file. Making
things safer when loading and finding the correct path.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to the transformers library, add an error handler when an
exception is fired. This relays the error to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.
For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Fix redundancy in code when loading templates. However, loading
a template from config.json may be a mistake since tokenizer_config.json
is the main place where chat templates are stored.
Signed-off-by: kingbri <bdashore3@proton.me>
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>