Some models (such as mistral and mixtral) set their base sequence
length to 32k due to assumptions of support for sliding window
attention.
Therefore, add this parameter to override the base sequence length
of a model which helps with auto-calculation of rope alpha.
If auto-calculation of rope alpha isn't being used, the max_seq_len
parameter works fine as is.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the max sequence length was overriden by the user's
config and never took the model's config.json into account.
Now, set the default to 4096, but include config.prepare when
selecting the max sequence length. The yaml and API request
now serve as overrides rather than parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Adding field descriptions show which parameters are used solely for
OAI compliance and not actually parsed in the model code.
Signed-off-by: kingbri <bdashore3@proton.me>
New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended
for people who know what they're doing.
Signed-off-by: kingbri <bdashore3@proton.me>
Rope alpha changes don't require removing the 1.0 default
from Rope scale.
Keep defaults when possible to avoid errors.
Signed-off-by: kingbri <bdashore3@proton.me>
Mistakenly forgot that the user can choose what cache mode to use
when loading a model.
Also add when fetching model info.
Signed-off-by: kingbri <bdashore3@proton.me>
Generations can be logged in the console along with sampling parameters
if the user enables it in config.
Metrics are always logged at the end of each prompt. In addition,
the model endpoint tells the user if they're being logged or not
for transparancy purposes.
Signed-off-by: kingbri <bdashore3@proton.me>
Sometimes fastchat may not be able to detect the prompt template from
the model path. Therefore, add the ability to set it in config.yml or
via the request object itself.
Also send the provided prompt template on model info request.
Signed-off-by: kingbri <bdashore3@proton.me>
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded with a child object called "draft" in the POST
request. Again, models need to be located within the draft model dir
to get loaded.
Signed-off-by: kingbri <bdashore3@proton.me>