Commit graph

40 commits

Author SHA1 Message Date
DocShotgun
7380a3b79a Implement lora support (#24)
* Model: Implement basic lora support

* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras

* Colab: Update for basic lora support

* Model: Test vram alloc after lora load, add docs

* Git: Add loras folder to .gitignore

* API: Add basic lora-related endpoints

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Revert bad CRLF line ending changes

* API: Add basic lora-related endpoints (fixed)

* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints

* Model: Unload loras first when unloading model

* API + Models: Cleanup lora endpoints and functions

Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.

Signed-off-by: kingbri <bdashore3@proton.me>

* Loras: Optimize load endpoint

Return successes and failures along with consolidating the request
to the rewritten load_loras function.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
2023-12-08 23:38:08 -05:00
kingbri
f8e9e22c43 API: Fix model load endpoint with draft
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-06 18:05:55 -05:00
kingbri
8ba3bfa6b3 API: Fix load exception handling
Models do not fully unload if an exception is caught in load. Therefore,
leave it to the client to unload on cancel.

Also add handlers in the event a SSE stream is cancelled. These packets
can't be sent back to the client since the client has severed the
connection, so print them in terminal.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-05 00:23:15 -05:00
kingbri
7c92968558 API: Fix mistaken debug statement
Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 18:07:12 -05:00
kingbri
5e54911cc8 API: Fix semaphore handling and chat completion errors
Chat completions previously always yielded a final packet to say that
a generation finished. However, this caused errors that a yield was
executed after GeneratorExit. This is correctly stated because python's
garbage collector can't clean up the generator after exiting due to the
finally block executing.

In addition, SSE endpoints close off the connection, so the finish packet
can only be yielded when the response has completed, so ignore yield on
exception.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-04 15:51:25 -05:00
kingbri
ed6c962aad API: Fix sequential requests
FastAPI is kinda weird with queueing. If an await is used within an
async def, requests aren't executed sequentially. Get the sequential
requests back by using a semaphore to limit concurrent execution from
generator functions.

Also scaffold the framework to move generator functions to their own
file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-03 22:54:34 -05:00
kingbri
ae69b18583 API: Use FastAPI streaming instead of sse_starlette
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.

This helps the API become more robust and removes an extra requirement.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 01:54:35 -05:00
kingbri
6493b1d2aa OAI: Add ability to send dummy models
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:27:28 -05:00
kingbri
aef411bed5 OAI: Fix chat completion streaming
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.

Also fix some errors that were raised from the endpoint.

References #15

Signed-off-by: kingbri <bdashore3@proton.me>
2023-12-01 00:14:24 -05:00
kingbri
e703c716ee Merge branch 'main' of https://github.com/ziadloo/tabbyAPI into ziadloo-main 2023-11-30 01:01:48 -05:00
kingbri
56f9b1d1a8 API: Add generator error handling
If the generator errors, there's no proper handling to send an error
packet and close the connection.

This is especially important for unloading models if the load fails
at any stage to reclaim a user's VRAM. Raising an exception caused
the model_container object to lock and not get freed by the GC.

This made sense to propegate SSE errors across all generator functions
rather than relying on abort signals.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-30 00:37:48 -05:00
kingbri
2bc3da0155 YAML: Force all files to open with utf8
The default encoding method when opening files on Windows is cp1252
which doesn't support all unicode and can cause issues.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-29 22:04:29 -05:00
Mehran Ziadloo
b0c42d0f05 Leveraging local variables 2023-11-27 20:56:56 -08:00
Mehran Ziadloo
ead503c75b Adding token usage support 2023-11-27 20:05:05 -08:00
kingbri
d929e0c826 API: Fix error points and exceptions
On /v1/model/load, some internal server errors weren't being sent,
so migrate directory checking out and also add a check to make sure
the proposed model path exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-25 00:27:02 -05:00
kingbri
d47c39da54 API: Don't include draft directory in response
The draft directory should be returned for a draft model request (TBD).

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-23 00:07:56 -05:00
kingbri
f47919b1d3 API: Add draft model support
Models can be loaded with a child object called "draft" in the POST
request. Again, models need to be located within the draft model dir
to get loaded.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-19 00:32:25 -05:00
kingbri
27ebec3b35 Model: Add speculative decoding support via config
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.

Add options in the config to support this. API options will occur
in a different commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-18 01:42:20 -05:00
kingbri
4669e49ff0 API: Fix errors with token endpoint
Handle None cases if the provided text/token lists are empty.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-17 01:39:06 -05:00
kingbri
021981fce0 API: Re-add depends endpoints
Mistakenly removed API key authentication for the models endpoints in
testing.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-17 00:50:42 -05:00
kingbri
ac4e9c2277 API: Add CORS support
Tell CORS to go fly a kite.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 22:19:47 -05:00
kingbri
08a183540b Config: Add warning on exceptions and clarify parameters
Due to how YAML works, double quotes are bad. Specify a linter in
the top of the config_sample file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 22:19:47 -05:00
kingbri
282b5b2931 API: Fix responses and some params
Responses were not being properly sent as JSON. Only run pydantic's
JSON function on stream responses. FastAPI does the rest with static
responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 17:11:55 -05:00
kingbri
d8d61fa19b API: Add fallback if model isn't loaded
Most endpoints require the model to be loaded, so add a depends.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 12:20:35 -05:00
kingbri
60eb076b43 Tree: Basic formatting and comments
Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 11:48:40 -05:00
kingbri
5defb1b0b4 Config: Fix errors when stuff doesn't exist
Add safe fallbacks if any part of the config tree doesn't exist. This
prevents random internal server errors from showing up.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 11:41:03 -05:00
kingbri
5e8419ec0c OAI: Add chat completions endpoint
Chat completions is the endpoint that will be used by OAI in the
future. Makes sense to support it even though the completions
endpoint will be used more often.

Also unify common parameters between the chat completion and completion
requests since they're very similar.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-16 01:06:07 -05:00
kingbri
1f444c8fb7 Requirements: Add fastchat and override pydantic
Use an older version of pydantic to stay compatible

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-15 01:00:08 -05:00
kingbri
d0b6b11068 OAI: Make freq and presence pen floats
Also rename the completions typing file.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-15 00:55:15 -05:00
kingbri
126afdfdc2 Model: Fix gpu split params
GPU split auto is a bool and GPU split is an array of integers for
GBs to allocate per GPU.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-15 00:55:15 -05:00
kingbri
8fea5391a8 Api: Add token endpoints
Support for encoding and decoding with various parameters.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-15 00:55:15 -05:00
kingbri
4670a77c26 API: Don't use response_class
This arg in routes caused many errors and isn't even needed for
responses.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-14 22:09:26 -05:00
kingbri
b625bface9 OAI: Add API-based model loading/unloading and auth routes
Models can be loaded and unloaded via the API. Also add authentication
to use the API and for administrator tasks.

Both types of authorization use different keys.

Also fix the unload function to properly free all used vram.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-14 01:17:19 -05:00
kingbri
47343e2f1a OAI: Add models support
The models endpoint fetches all the models that OAI has to offer.
However, since this is an OAI clone, just list the models inside
the user's configured model directory instead.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-13 21:38:34 -05:00
kingbri
eee8b642bd OAI: Implement completion API endpoint
Add support for /v1/completions with the option to use streaming
if needed. Also rewrite API endpoints to use async when possible
since that improves request performance.

Model container parameter names also needed rewrites as well and
set fallback cases to their disabled values.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-13 18:31:26 -05:00
kingbri
a10c14d357 Config: Switch to YAML and add load progress
YAML is a more flexible format when it comes to configuration. Commandline
arguments are difficult to remember and configure especially for
an API with complicated commandline names. Rather than using half-baked
textfiles, implement a proper config solution.

Also add a progress bar when loading models in the commandline.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-12 00:21:16 -05:00
kingbri
5d32aa02cd Tree: Update to use ModelContainer and args
Use command-line arguments to load an initial model if necessary.
API routes are broken, but we should be using the container from
now on as a primary interface with the exllama2 library.

Also these args should be turned into a YAML configuration file in
the future.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-10 23:19:54 -05:00
Splice86
8e2671a265 Update to README and other minor changes 2023-11-10 01:37:24 -06:00
Splice86
ab84b01fdf
Updated readme 2023-11-10 00:39:08 -06:00
david
b967e2e604 Initial 2023-11-09 21:27:45 -06:00