Building from source is a case for many wheels, so add an option
to skip wheel upgrades/installation if the user uses the start script.
Signed-off-by: kingbri <bdashore3@proton.me>
This maps the absolute path when loading the config file. Making
things safer when loading and finding the correct path.
Signed-off-by: kingbri <bdashore3@proton.me>
Similar to the transformers library, add an error handler when an
exception is fired. This relays the error to the user.
Signed-off-by: kingbri <bdashore3@proton.me>
These are commonly seen in huggingface provided chat templates and
aren't that difficult to add in.
For feature parity, honor the add_bos_token and ban_eos_token
parameters when constructing the prompt.
Signed-off-by: kingbri <bdashore3@proton.me>
This creates a massive security hole, but it's gated behind a flag
for users who only use localhost.
A warning will pop up when users disable authentication.
Signed-off-by: kingbri <bdashore3@proton.me>
Fix redundancy in code when loading templates. However, loading
a template from config.json may be a mistake since tokenizer_config.json
is the main place where chat templates are stored.
Signed-off-by: kingbri <bdashore3@proton.me>
Non-streaming tasks were not regulated by the semaphore, causing these
tasks to interfere with streaming generations. Add helper functions
to take in both sync and async functions for callbacks and sequential
blocking with the semaphore.
Signed-off-by: kingbri <bdashore3@proton.me>
When stream is false, the generation can be empty, which means
that there's no chunks present in the final generation array, causing
an error.
Instead, return a dummy value if generation is falsy (empty array
or None)
Signed-off-by: kingbri <bdashore3@proton.me>
Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Lets the user know if a file not found (OSError) occurs and prints
the applied template on model load.
Also fix some remaining references to fastchat.
Signed-off-by: kingbri <bdashore3@proton.me>
Use exllamav2's token bias which is the functional equivalent of
OAI's logit bias parameter.
Strings are casted to integers on request and errors if an invalid
value is passed.
Signed-off-by: kingbri <bdashore3@proton.me>
Append generation prompts if given the flag on an OAI chat completion
request.
This appends the "assistant" message to the instruct prompt. Defaults
to true since this is intended behavior.
Signed-off-by: kingbri <bdashore3@proton.me>
OSError means that a file wasn't found, which means auth tokens should
be rengenerated. Otherwise, fire the error and exit.
Signed-off-by: kingbri <bdashore3@proton.me>
Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice
was being sent instead of a ChatCompletionMessage when streaming
responses.
Signed-off-by: kingbri <bdashore3@proton.me>