jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	f196f1177d	Requirements: Update exllamav2 to 0.0.11 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-16 19:33:42 -05:00
kingbri	1a331afe3a	OAI: Add cache_mode parameter to model Mistakenly forgot that the user can choose what cache mode to use when loading a model. Also add when fetching model info. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-16 02:47:50 -05:00
kingbri	ed868fd262	OAI: Remove unused parameters Seed and low_mem aren't used, so comment them out. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-15 14:56:43 -05:00
kingbri	59729e2a4a	Tests: Fix linting Also change how wheel_test works for safe import testing rather than trying to import the package which can cause system issues. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-13 23:05:50 -05:00
kingbri	036ba2669c	Auth: Migrate to Pydantic It's easier to work with Pydantic dataclasses rather than standard python classes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:58:22 -05:00
kingbri	eb8ccb9783	Tree: Fix linter issues Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:58:19 -05:00
kingbri	083df7d585	Tree: Add generation logging support Generations can be logged in the console along with sampling parameters if the user enables it in config. Metrics are always logged at the end of each prompt. In addition, the model endpoint tells the user if they're being logged or not for transparancy purposes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:43:35 -05:00
kingbri	b364de1005	Update README Add alternatives if the user doesn't agree with the focus of TabbyAPI. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 16:05:46 -05:00
kingbri	db87efde4a	OAI: Add ability to specify fastchat prompt template Sometimes fastchat may not be able to detect the prompt template from the model path. Therefore, add the ability to set it in config.yml or via the request object itself. Also send the provided prompt template on model info request. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 15:43:58 -05:00
kingbri	9f195af5ad	Main: Fix function calls Some function names were declared twice. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 13:28:21 -05:00
kingbri	fd9f3eac87	Model: Add params to current model endpoint Grabs the current model rope params, max seq len, and the draft model if applicable. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 00:40:56 -05:00
kingbri	0f4290f05c	Model: Format Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-09 22:48:42 -05:00
kingbri	5ae2a91c04	Tree: Use unwrap and coalesce for optional handling Python doesn't have proper handling of optionals. The only way to handle them is checking via an if statement if the value is None or by using the "or" keyword to unwrap optionals. Previously, I used the "or" method to unwrap, but this caused issues due to falsy values falling back to the default. This is especially the case with booleans were "False" changed to "True". Instead, add two new functions: unwrap and coalesce. Both function to properly implement a functional way of "None" coalescing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-09 21:52:17 -05:00
DocShotgun	7380a3b79a	Implement lora support (#24 ) * Model: Implement basic lora support * Add ability to load loras from config on launch * Supports loading multiple loras and lora scaling * Add function to unload loras * Colab: Update for basic lora support * Model: Test vram alloc after lora load, add docs * Git: Add loras folder to .gitignore * API: Add basic lora-related endpoints * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Revert bad CRLF line ending changes * API: Add basic lora-related endpoints (fixed) * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Model: Unload loras first when unloading model * API + Models: Cleanup lora endpoints and functions Condenses down endpoint and model load code. Also makes the routes behave the same way as model routes to help not confuse the end user. Signed-off-by: kingbri <bdashore3@proton.me> * Loras: Optimize load endpoint Return successes and failures along with consolidating the request to the rewritten load_loras function. Signed-off-by: kingbri <bdashore3@proton.me> --------- Co-authored-by: kingbri <bdashore3@proton.me> Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>	2023-12-08 23:38:08 -05:00
kingbri	161c9d2c19	Tests: Fix wheel test Fastchat is named fschat from the package's point of view. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-08 01:15:24 -05:00
kingbri	fa1e99daf6	Model: Remove unused print statement Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-07 21:13:52 -05:00
kingbri	47176a2a1e	Requirements: Fix torch install Use --extra-index-url to install pytorch. This should be secure enough since dependency confusion attacks aren't possible with just installing the torch package. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 19:04:35 -05:00
kingbri	f8e9e22c43	API: Fix model load endpoint with draft Draft wasn't being parsed correctly with the new changes which removed the draft_enabled bool. There's still some more work to be done with returning exceptions. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 18:05:55 -05:00
kingbri	6a71890d45	Model: Fix sampler bugs Lots of bugs were unearthed when switching to the new fallback changes. Fix them and make sure samplers are being set properly. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 17:29:58 -05:00
kingbri	9f34af4906	Tests: Create Add a few tests for the user to check if stuff works. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:53:42 -05:00
kingbri	21c25fd806	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:24:49 -05:00
kingbri	b83e1b704e	Requirements: Split for configurations Add self-contained requirements for cuda 11.8 and ROCm Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:00:30 -05:00
kingbri	4c0e686e7d	Model: Cleanup and fix fallbacks Use the standard "dict.get("key") or default" to handle fetching values from kwargs and get a fallback value without possible errors. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 23:28:16 -05:00
Brian Dashore	0ef2fe9b95	Merge pull request #23 from DocShotgun/main Expose draft_rope_scale	2023-12-05 22:24:53 -05:00
kingbri	d8f7b93c54	Model: Fix fetching of draft args Mistakenly fetched these from parent kwargs instead of the scoped draft_config var. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 22:24:27 -05:00
DocShotgun	3f2fcbcc45	Add fallback to draft_rope_scale to 1.0	2023-12-05 18:51:36 -08:00
DocShotgun	39f7a2aabd	Expose draft_rope_scale	2023-12-05 12:59:32 -08:00
Brian Dashore	e085b806e8	Merge pull request #22 from DocShotgun/main Update colab, expose additional args	2023-12-05 01:22:33 -05:00
DocShotgun	67507105d0	Update colab, expose additional args * Exposed draft model args for speculative decoding * Exposed int8 cache, dummy models, and no flash attention * Resolved CUDA 11.8 dependency issue	2023-12-04 22:20:46 -08:00
Brian Dashore	37f8f3ef8b	Merge pull request #20 from veryamazinglystupid/main make colab better, fix libcudart errors	2023-12-05 01:14:21 -05:00
kingbri	621e11b940	Update documentation Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 00:33:43 -05:00
kingbri	8ba3bfa6b3	API: Fix load exception handling Models do not fully unload if an exception is caught in load. Therefore, leave it to the client to unload on cancel. Also add handlers in the event a SSE stream is cancelled. These packets can't be sent back to the client since the client has severed the connection, so print them in terminal. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 00:23:15 -05:00
kingbri	7c92968558	API: Fix mistaken debug statement Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-04 18:07:12 -05:00
kingbri	5e54911cc8	API: Fix semaphore handling and chat completion errors Chat completions previously always yielded a final packet to say that a generation finished. However, this caused errors that a yield was executed after GeneratorExit. This is correctly stated because python's garbage collector can't clean up the generator after exiting due to the finally block executing. In addition, SSE endpoints close off the connection, so the finish packet can only be yielded when the response has completed, so ignore yield on exception. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-04 15:51:25 -05:00
kingbri	30fc5b3d29	Merge branch 'main' of github.com:theroyallab/tabbyAPI	2023-12-03 22:55:51 -05:00
kingbri	ed6c962aad	API: Fix sequential requests FastAPI is kinda weird with queueing. If an await is used within an async def, requests aren't executed sequentially. Get the sequential requests back by using a semaphore to limit concurrent execution from generator functions. Also scaffold the framework to move generator functions to their own file. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 22:54:34 -05:00
veryamazinglystupid	ad1a12a0f2	make colab better, fix libcudart errors :3	2023-12-03 14:07:52 +05:30
DocShotgun	2a9e4ca051	Add Colab example *note: this uses wheels for python 3.10 and torch 2.1.0+cu118 which is the current default in colab	2023-12-03 02:21:51 -05:00
kingbri	e740b53478	Requirements: Update Flash Attention 2 Bump to 2.3.6 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 01:56:29 -05:00
kingbri	c67c9f6d66	Model + Config: Remove low_mem option Low_mem doesn't work in exl2 and it was an experimental option to begin with. Keep the loading code commented out in case it gets fixed in the future. A better alternative is to use 8bit cache which works and helps save VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 01:07:42 -05:00
Brian Dashore	109e4223e0	Merge pull request #18 from DocShotgun/main Add automatic NTK-aware alpha scaling to model	2023-12-03 01:06:50 -05:00
kingbri	27fc0c0069	Model: Cleanup and compartmentalize auto rope functions Also handle an edge case if ratio <= 1 since NTK scaling is only used for values > 1. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 01:05:09 -05:00
DocShotgun	bd2c5d0d09	Force auto-alpha to 1.0 if config ctx == base ctx	2023-12-02 21:19:59 -08:00
DocShotgun	1c398b0be7	Add automatic NTK-aware alpha scaling to model * enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models	2023-12-02 21:02:29 -08:00
kingbri	61f6e51fdb	OAI: Add separator style fallback Some models may return None for separator style with FastChat. Fall back to LLAMA2 if this is the case. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-01 23:30:19 -05:00
kingbri	ae69b18583	API: Use FastAPI streaming instead of sse_starlette sse_starlette kept firing a ping response if it was taking too long to set an event. Rather than using a hacky workaround, switch to FastAPI's inbuilt streaming response and construct SSE requests with a utility function. This helps the API become more robust and removes an extra requirement. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-01 01:54:35 -05:00
kingbri	6493b1d2aa	OAI: Add ability to send dummy models Some APIs require an OAI model to be sent against the models endpoint. Fix this by adding a GPT 3.5 turbo entry as first in the list to cover as many APIs as possible. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-01 00:27:28 -05:00
kingbri	aef411bed5	OAI: Fix chat completion streaming Chat completions require a finish reason to be provided in the OAI spec once the streaming is completed. This is different from a non- streaming chat completion response. Also fix some errors that were raised from the endpoint. References #15 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-01 00:14:24 -05:00
Brian Dashore	c4d8c901e1	Merge pull request #13 from ziadloo/main Adding the usage stat support (prompt_tokens, completion_tokens, and total_tokens)	2023-11-30 01:57:44 -05:00
kingbri	8a5ac5485b	Model: Fix rounding generated_tokens is always a whole number. Signed-off-by: kingbri <bdashore3@proton.me>	2023-11-30 01:55:46 -05:00

... 9 10 11 12 13

641 commits