Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.
Signed-off-by: kingbri <bdashore3@proton.me>
Start scripts now don't update dependencies by default due to mishandling
caches from pip. Also add dedicated update scripts and save options
to a JSON file instead of a text one.
Signed-off-by: kingbri <bdashore3@proton.me>
The async signal exit function should be the internal for exiting
the program. In addition, prevent the handler from being called
twice by adding a boolean. May become an asyncio event later on.
In addition, make sure to skip_wait when running model.unload.
Signed-off-by: kingbri <bdashore3@proton.me>
A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.
However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.
Signed-off-by: kingbri <bdashore3@proton.me>
Installing directly from github causes pip's HTTP cache to not
recognize that the correct version of a package is already installed.
This causes a redownload.
When using the Start.bat script, it updates dependencies automatically
to keep users on the latest versions of a package for security reasons.
A simple pip cache website helps alleviate this problem and allows pip
to find the cached wheels when invoked with an upgrade argument.
Signed-off-by: kingbri <bdashore3@proton.me>
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.
Signed-off-by: kingbri <bdashore3@proton.me>
Some of the parameters the API provides are aliases for their OAI
equivalents. It makes more sense to move them to the common file.
Signed-off-by: kingbri <bdashore3@proton.me>
Reduces dependency size since the full fastapi package isn't required.
Add httptools since it makes requests faster and it was installed
with fastapi previously.
Signed-off-by: kingbri <bdashore3@proton.me>
Realtime process priority assigns resources to point to tabby's
processes. Running as administrator will give realtime priority
while running as a normal user will set as high priority.
Signed-off-by: kingbri <bdashore3@proton.me>
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.
Signed-off-by: kingbri <bdashore3@proton.me>
Add an API parameter to set the timeout in seconds. Keep it to None
by default for uninterrupted downloads.
Signed-off-by: kingbri <bdashore3@proton.me>
This prevents TimeoutErrors from showing up. However, a longer
timeout may be necessary since this is in the API. Turning it off
for now will help resolve immediate errors.
Signed-off-by: kingbri <bdashore3@proton.me>