A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.
However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.
Signed-off-by: kingbri <bdashore3@proton.me>
Installing directly from github causes pip's HTTP cache to not
recognize that the correct version of a package is already installed.
This causes a redownload.
When using the Start.bat script, it updates dependencies automatically
to keep users on the latest versions of a package for security reasons.
A simple pip cache website helps alleviate this problem and allows pip
to find the cached wheels when invoked with an upgrade argument.
Signed-off-by: kingbri <bdashore3@proton.me>
The override wasn't being passed in before. Also, the default is now
none since Exl2 can automatically calculate the max batch size.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Infinity-emb is an async batching engine for embeddings. This is
preferable to sentence-transformers since it handles scalable usecases
without the need for external thread intervention.
Signed-off-by: kingbri <bdashore3@proton.me>
This is necessary for Kobold's API. Current models use bad_words_ids
in generation_config.json, but for some reason, they're also present
in the model's config.json.
Signed-off-by: kingbri <bdashore3@proton.me>
Some of the parameters the API provides are aliases for their OAI
equivalents. It makes more sense to move them to the common file.
Signed-off-by: kingbri <bdashore3@proton.me>
Reduces dependency size since the full fastapi package isn't required.
Add httptools since it makes requests faster and it was installed
with fastapi previously.
Signed-off-by: kingbri <bdashore3@proton.me>
Realtime process priority assigns resources to point to tabby's
processes. Running as administrator will give realtime priority
while running as a normal user will set as high priority.
Signed-off-by: kingbri <bdashore3@proton.me>
These are faster event loops for asyncio which should improve overall
performance. Gate these under an experimental flag for now to stress
test these loops.
Signed-off-by: kingbri <bdashore3@proton.me>
Add an API parameter to set the timeout in seconds. Keep it to None
by default for uninterrupted downloads.
Signed-off-by: kingbri <bdashore3@proton.me>
This prevents TimeoutErrors from showing up. However, a longer
timeout may be necessary since this is in the API. Turning it off
for now will help resolve immediate errors.
Signed-off-by: kingbri <bdashore3@proton.me>
Always enable the core endpoints and allow servers to be selected
as needed. Use the OAI server by default.
Signed-off-by: kingbri <bdashore3@proton.me>
Place OAI specific routes in the appropriate folder. This is in
preperation for adding new API servers that can be optionally enabled.
Signed-off-by: kingbri <bdashore3@proton.me>
Uvicorn can log in both the request disconnect handler and the
CancelledError. However, these sometimes don't work and both
need to be checked. But, don't log twice if one works.
Signed-off-by: kingbri <bdashore3@proton.me>
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit 21516bd7b5.
This skips EOS and implementing it the proper way seems more
costly than necessary.
Signed-off-by: kingbri <bdashore3@proton.me>