Metadata is generated via a template's module. This requires a single
iteration through the template. If a template tries to access a passed
variable that doesn't exist, it will error.
Therefore, generate the metadata at runtime to prevent these errors
from happening. To optimize further, cache the metadata after the
first generation to prevent the expensive call of making a template
module.
Signed-off-by: kingbri <bdashore3@proton.me>
* returning stop str if exists from gen
* added chat template for firefunctionv2
* pulling tool vars from template
* adding parsing for tool inputs/outputs
* passing tool data from endpoint to chat template, adding tool_start to the stop list
* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format
* non streaming generation prototype
* cleaning template
* Continued work with type, ingestion into template, and chat template for fire func
* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.
* Ruff Formating
* Moved stop string and tool updates out of prompt creation func
Updated tool pydantic to match OAI
Support for streaming
Updated generate tool calls to use flag within chat_template and insert tool reminder
* Llama 3.1 chat templates
Updated fire func template
* renamed llama3.1 to chatml_with_headers..
* update name of template
* Support for calling a tool start token rather than the string.
Simplified tool_params
Warning when gen_settings are being overidden becuase user set temp to 0
Corrected schema and tools to correct types for function args. Str for some reason
* draft groq tool use model template
* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)
* Clean up comments and code in chat comp
* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.
* changes example back to args as json rather than string of json
* Standardize chat templates to each other
* cleaning/rewording
* stop elements can also be ints (tokens)
* Cleaning/formatting
* added special tokens for tools and tool_response as specified in description
* Cleaning
* removing aux templates - going to live in llm-promp-templates repo instead
* Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
* Chat Completions: Don't include internal tool variables in OpenAPI
Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.
Signed-off-by: kingbri <bdashore3@proton.me>
* Templates: Deserialize metadata on template load
Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.
Signed-off-by: kingbri <bdashore3@proton.me>
* Tools: Fix comments
Adhere to the format style of comments in the rest of the project.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
ExllamaV2 filters don't allow for rewinding which is what banned
strings uses. Therefore, constrained generation via LMFE or outlines
is not compatible for now.
Signed-off-by: kingbri <bdashore3@proton.me>
platform.system() was not called in some places, breaking the
ternary on Windows.
Pip's --upgrade flag does not actually update dependencies to their
latest versions. That's what the --upgrade-strategy eager flag is for.
Tell the user where their start preferences are coming from.
Signed-off-by: kingbri <bdashore3@proton.me>
Install a prebuilt fastparquet wheel for Windows and add setuptools
since torch may require it for some reason.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.
Signed-off-by: kingbri <bdashore3@proton.me>
Start scripts now don't update dependencies by default due to mishandling
caches from pip. Also add dedicated update scripts and save options
to a JSON file instead of a text one.
Signed-off-by: kingbri <bdashore3@proton.me>
The async signal exit function should be the internal for exiting
the program. In addition, prevent the handler from being called
twice by adding a boolean. May become an asyncio event later on.
In addition, make sure to skip_wait when running model.unload.
Signed-off-by: kingbri <bdashore3@proton.me>
A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.
However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.
Signed-off-by: kingbri <bdashore3@proton.me>
Installing directly from github causes pip's HTTP cache to not
recognize that the correct version of a package is already installed.
This causes a redownload.
When using the Start.bat script, it updates dependencies automatically
to keep users on the latest versions of a package for security reasons.
A simple pip cache website helps alleviate this problem and allows pip
to find the cached wheels when invoked with an upgrade argument.
Signed-off-by: kingbri <bdashore3@proton.me>
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>