Similar to the transformers library, add an error handler when an
exception is fired. This relays the error to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.
For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Fix redundancy in code when loading templates. However, loading
a template from config.json may be a mistake since tokenizer_config.json
is the main place where chat templates are stored.
Signed-off-by: kingbri <bdashore3@proton.me>
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>
When stream is false, the generation can be empty, which means
that there's no chunks present in the final generation array, causing
an error.
Instead, return a dummy value if generation is falsy (empty array
or None)
Signed-off-by: kingbri <bdashore3@proton.me>
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Lets the user know if a file not found (OSError) occurs and prints
the applied template on model load.
Also fix some remaining references to fastchat.
Signed-off-by: kingbri <bdashore3@proton.me>
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.
Strings are casted to integers on request and errors if an invalid
value is passed.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
OSError means that a file wasn't found, which means auth tokens should
be rengenerated. Otherwise, fire the error and exit.
Signed-off-by: kingbri <bdashore3@proton.me>
Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice
was being sent instead of a ChatCompletionMessage when streaming
responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.
Signed-off-by: kingbri <bdashore3@proton.me>
Jinja2 is a lightweight template parser that's used in Transformers
for parsing chat completions. It's much more efficient than Fastchat
and can be imported as part of requirements.
Also allows for unblocking Pydantic's version.
Users now have to provide their own template if needed. A separate
repo may be usable for common prompt template storage.
Signed-off-by: kingbri <bdashore3@proton.me>
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.
Signed-off-by: kingbri <bdashore3@proton.me>
Rope alpha changes don't require removing the 1.0 default
from Rope scale.
Keep defaults when possible to avoid errors.
Signed-off-by: kingbri <bdashore3@proton.me>