When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.
In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Since jobs are tracked via request IDs now, each generation task should
be uniquely identified in the event of cancellation.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
kwargs is pretty ugly when figuring out which arguments to use. The
base requests falls back to defaults anyways, so pass in the params
object as is.
However, since Python's typing isn't like TypeScript where types
can be transformed, the type hinting has a possiblity of None showing
up despite there always being a value for some params.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add non-JSON version of `tools` and `functions` to `template_vars`.
Increase the compatibility with VLLM templates which use a non-JSON tools object.
* Add list of tool template variables to the documentation
* Use Jinja templates to provide `tools_json` and `functions_json`
This should be functionally equivelant, but the JSON won't be produced
unless it's needed.
* Make message.tool_calls match the JSON from ToolCallProcessor
* Log something when generating tool calls
* Add template for Qwen QwQ 32b
* Only log if tool calls have been detected
* API: Fix tool call variable assignments
Jinja functions do not run when variables are called. Use json.dumps
instead. In addition, log the request ID when stating that a tool
call was fired.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Add `ToolCallProcessor.dump()` to get the list of processed dicts
* Remove qwen_qwq_32b.jinja
This will be added to the following repository at a later date:
https://github.com/theroyallab/llm-prompt-templates
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* Fix Tool Call JSON Serialization Error
* Incorporate changes from PR 292
kingbri note: Adjusts the tool JSON formation and incorporates finish
reasons. Added both authors as co-authors due to edits on this commit
from the original PR.
Co-Authored-by: David Allada <dallada1@vt.edu>
Co-Authored-by: Benjamin Oldenburg <benjamin.oldenburg@ordis.co.th>
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
* API: Cleanup tool call JSON parsing
Split pre and post-processing of tool calls to its own class. This
cleans up the chat_completion utility module and also fixes the
JSON serialization bug.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
---------
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Co-authored-by: David Allada <dallada1@vt.edu>
Co-authored-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Infinity expects a list when embedding, so convert to a list if the
input is a string.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The strings weren't being concatenated properly. Only add the combined
text if the chat completion type is a List.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the flow for parsing chat completion messages and rendering
from the prompt template was disconnected between endpoints. Now, create
a common function to render and handle everything appropriately afterwards.
Signed-off-by: kingbri <bdashore3@proton.me>
Migrate the add method into the class itself. Also, a BaseModel isn't
needed here since this isn't a serialized class.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the messages were a list of dicts. These are untyped
and don't provide strict hinting. Add types for chat completion
messages and reformat existing code.
Signed-off-by: kingbri <bdashore3@proton.me>
* More robust checks for OAI chat completion message lists on /v1/encode endpoint
* Added TODO to support other aspects of chat completions
* Fix oversight where embeddings was not defined in advance on /v1/chat/completions endpoint
* Support image_url inputs containing URLs or base64 strings following OAI vision spec
* Use async lru cache for image embeddings
* Add generic wrapper class for multimodal embeddings
If an API key sends a dummy model, it shouldn't error as the server
is catering to clients that expect specific OAI model names. This
is a problem with inline model loading since these names would error
by default. Therefore, add an exception if the provided name is in the
dummy model names (which also doubles as inline strict exceptions).
However, the dummy model names weren't configurable, so add a new
option to specify exception names, otherwise the default is gpt-3.5-turbo.
Signed-off-by: kingbri <bdashore3@proton.me>
The admin key check was running even if inline loading was disabled.
Fix this bug, but also preserve the existing permission system when
inline loading is enabled.
Signed-off-by: kingbri <bdashore3@proton.me>
* improve validation
* remove to_gen_params functions
* update changes for all endpoint types
* OAI: Fix calls to generation
Chat completion and completion need to have prompt split out before
pushing to the backend.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Convert Top-K values of -1 to 0
Some OAI implementations use -1 as disabled instead of 0. Therefore,
add a coalesce case.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Format and space out
Make the code more readable.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Fix mirostat
Field items are nested in data within a Pydantic FieldInfo
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Fix banned_tokens and allowed_tokens conversion
If the provided string has whitespace, trim it before splitting.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Add helpful log to dry_sequence_breakers
Let the user know if the sequence errors out.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Apply validators in right order
Validators need to be applied in order from top to bottom, this is why
the after validator was not being applied properly.
Set the model to validate default params for sampler override purposes.
This can be turned off if there are unclear errors.
Signed-off-by: kingbri <bdashore3@proton.me>
* Endpoints: Format
Cleanup and semantically fix field validators
Signed-off-by: kingbri <bdashore3@proton.me>
* Kobold: Update validators and fix parameter application
Validators on parent fields cannot see child fields. Therefore,
validate using the child fields instead and alter the parent field
data from there.
Also fix badwordsids casting.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Remove validate defaults and fix mirostat
If a user sets an override to a non-default value, that's their
own fault.
Run validator on the actual mirostat_mode parameter rather than
the alternate mirostat parameter.
Signed-off-by: kingbri <bdashore3@proton.me>
* Kobold: Rework badwordsids
Currently, this serves to ban the EOS token. All other functionality
was legacy, so remove it.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Remove HuggingfaceConfig
This was only necessary for badwordsids. All other fields are handled
by exl2. Keep the class as a stub if it's needed again.
Signed-off-by: kingbri <bdashore3@proton.me>
* Kobold: Bump kcpp impersonation
TabbyAPI supports XTC now.
Signed-off-by: kingbri <bdashore3@proton.me>
* Sampling: Change alias to validation_alias
Reduces the probability for errors and makes the class consistent.
Signed-off-by: kingbri <bdashore3@proton.me>
* OAI: Use constraints for validation
Instead of adding a model_validator, use greater than or equal to
constraints provided by Pydantic.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tree: Lint
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: SecretiveShell <84923604+SecretiveShell@users.noreply.github.com>
Co-authored-by: kingbri <bdashore3@proton.me>
* Model: Fix inline loading and draft key
There was a lack of foresight between the new config.yml and how
it was structured. The "draft" key became "draft_model" without updating
both the API request and inline loading keys.
For the API requests, still support "draft" as legacy, but the "draft_model"
key is preferred.
Signed-off-by: kingbri <bdashore3@proton.me>
* OAI: Add draft model dir to inline load
Was not pushed before and caused errors of the kwargs being None.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Fix draft args application
Draft model args weren't applying since there was a reset due to how
the old override behavior worked.
Signed-off-by: kingbri <bdashore3@proton.me>
* OAI: Change embedding model load params
Use embedding_model_name to be inline with the config.
Signed-off-by: kingbri <bdashore3@proton.me>
* API: Fix parameter for draft model load
Alias name to draft_model_name.
Signed-off-by: kingbri <bdashore3@proton.me>
* API: Fix parameter for template switch
Add prompt_template_name to be more descriptive.
Signed-off-by: kingbri <bdashore3@proton.me>
* API: Fix parameter for model load
Alias name to model_name for config parity.
Signed-off-by: kingbri <bdashore3@proton.me>
* API: Add alias documentation
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Signed-off-by: kingbri <bdashore3@proton.me>
Make it so any message role can be parsed from a list. Not really
sure why this is the case because system and assistant shouldn't be
sending data other than text, but it also doesn't make much sense
to be extremely strict with roles either.
Signed-off-by: kingbri <bdashore3@proton.me>
When the request is cancelled, cancel the load task. In addition,
when checking if a model container exists, also check if the model
is fully loaded.
Signed-off-by: kingbri <bdashore3@proton.me>
- add models for config options
- add function to regenerate config.yml
- replace references to config with pydantic compatible references
- remove unnecessary unwrap() statements
TODO:
- auto generate env vars
- auto generate argparse
- test loading a model
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
If a user requesting a model change isn't admin, error.
Better to place the load function before the generate functions.
Signed-off-by: kingbri <bdashore3@proton.me>
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.
Signed-off-by: kingbri <bdashore3@proton.me>
Metadata is generated via a template's module. This requires a single
iteration through the template. If a template tries to access a passed
variable that doesn't exist, it will error.
Therefore, generate the metadata at runtime to prevent these errors
from happening. To optimize further, cache the metadata after the
first generation to prevent the expensive call of making a template
module.
Signed-off-by: kingbri <bdashore3@proton.me>
* returning stop str if exists from gen
* added chat template for firefunctionv2
* pulling tool vars from template
* adding parsing for tool inputs/outputs
* passing tool data from endpoint to chat template, adding tool_start to the stop list
* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format
* non streaming generation prototype
* cleaning template
* Continued work with type, ingestion into template, and chat template for fire func
* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.
* Ruff Formating
* Moved stop string and tool updates out of prompt creation func
Updated tool pydantic to match OAI
Support for streaming
Updated generate tool calls to use flag within chat_template and insert tool reminder
* Llama 3.1 chat templates
Updated fire func template
* renamed llama3.1 to chatml_with_headers..
* update name of template
* Support for calling a tool start token rather than the string.
Simplified tool_params
Warning when gen_settings are being overidden becuase user set temp to 0
Corrected schema and tools to correct types for function args. Str for some reason
* draft groq tool use model template
* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)
* Clean up comments and code in chat comp
* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.
* changes example back to args as json rather than string of json
* Standardize chat templates to each other
* cleaning/rewording
* stop elements can also be ints (tokens)
* Cleaning/formatting
* added special tokens for tools and tool_response as specified in description
* Cleaning
* removing aux templates - going to live in llm-promp-templates repo instead
* Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Chat Completions: Don't include internal tool variables in OpenAPI
Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
* Templates: Deserialize metadata on template load
Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tools: Fix comments
Adhere to the format style of comments in the rest of the project.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>