jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	7cbc08fc72	Templates: Add auto-detection from path This replicates FastChat's model path detection. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	e895eaa4bd	OAI: Clarify types in docs Adding field descriptions show which parameters are used solely for OAI compliance and not actually parsed in the model code. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	51ca1ff396	Tree: Switch to Pydantic 2 Pydantic 2 has more modern methods and stability compared to Pydantic 1 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	f631dd6ff7	Templates: Switch to Jinja2 Jinja2 is a lightweight template parser that's used in Transformers for parsing chat completions. It's much more efficient than Fastchat and can be imported as part of requirements. Also allows for unblocking Pydantic's version. Users now have to provide their own template if needed. A separate repo may be usable for common prompt template storage. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	95fd0f075e	Model: Fix no flash attention Was being called wrong from config. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 23:31:58 -05:00
kingbri	ad8807a830	Model: Add support for num_experts_by_token New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended for people who know what they're doing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 18:03:01 -05:00
kingbri	70fbee3edd	OAI: Fix model parameter placement Accidentally edited the Model Card parameters vs the model load request ones. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 14:36:28 -05:00
kingbri	1d0bdfa77c	Model + OAI: Fix parameter parsing Rope alpha changes don't require removing the 1.0 default from Rope scale. Keep defaults when possible to avoid errors. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 14:28:18 -05:00
Veden	3e57125025	OAI: adding optional draft model properties for draft_rope alpha and scale (#28 ) * OAI: adding optional draft model properties for draft_rope alpha and scale * Forgot to set the properties to None	2023-12-17 19:23:45 +00:00
kingbri	528d58f841	Requirements: Fix AMD Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 00:45:43 -05:00
kingbri	f196f1177d	Requirements: Update exllamav2 to 0.0.11 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-16 19:33:42 -05:00
kingbri	1a331afe3a	OAI: Add cache_mode parameter to model Mistakenly forgot that the user can choose what cache mode to use when loading a model. Also add when fetching model info. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-16 02:47:50 -05:00
kingbri	ed868fd262	OAI: Remove unused parameters Seed and low_mem aren't used, so comment them out. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-15 14:56:43 -05:00
kingbri	59729e2a4a	Tests: Fix linting Also change how wheel_test works for safe import testing rather than trying to import the package which can cause system issues. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-13 23:05:50 -05:00
kingbri	036ba2669c	Auth: Migrate to Pydantic It's easier to work with Pydantic dataclasses rather than standard python classes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:58:22 -05:00
kingbri	eb8ccb9783	Tree: Fix linter issues Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:58:19 -05:00
kingbri	083df7d585	Tree: Add generation logging support Generations can be logged in the console along with sampling parameters if the user enables it in config. Metrics are always logged at the end of each prompt. In addition, the model endpoint tells the user if they're being logged or not for transparancy purposes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:43:35 -05:00
kingbri	b364de1005	Update README Add alternatives if the user doesn't agree with the focus of TabbyAPI. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 16:05:46 -05:00
kingbri	db87efde4a	OAI: Add ability to specify fastchat prompt template Sometimes fastchat may not be able to detect the prompt template from the model path. Therefore, add the ability to set it in config.yml or via the request object itself. Also send the provided prompt template on model info request. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 15:43:58 -05:00
kingbri	9f195af5ad	Main: Fix function calls Some function names were declared twice. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 13:28:21 -05:00
kingbri	fd9f3eac87	Model: Add params to current model endpoint Grabs the current model rope params, max seq len, and the draft model if applicable. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 00:40:56 -05:00
kingbri	0f4290f05c	Model: Format Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-09 22:48:42 -05:00
kingbri	5ae2a91c04	Tree: Use unwrap and coalesce for optional handling Python doesn't have proper handling of optionals. The only way to handle them is checking via an if statement if the value is None or by using the "or" keyword to unwrap optionals. Previously, I used the "or" method to unwrap, but this caused issues due to falsy values falling back to the default. This is especially the case with booleans were "False" changed to "True". Instead, add two new functions: unwrap and coalesce. Both function to properly implement a functional way of "None" coalescing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-09 21:52:17 -05:00
DocShotgun	7380a3b79a	Implement lora support (#24 ) * Model: Implement basic lora support * Add ability to load loras from config on launch * Supports loading multiple loras and lora scaling * Add function to unload loras * Colab: Update for basic lora support * Model: Test vram alloc after lora load, add docs * Git: Add loras folder to .gitignore * API: Add basic lora-related endpoints * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Revert bad CRLF line ending changes * API: Add basic lora-related endpoints (fixed) * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Model: Unload loras first when unloading model * API + Models: Cleanup lora endpoints and functions Condenses down endpoint and model load code. Also makes the routes behave the same way as model routes to help not confuse the end user. Signed-off-by: kingbri <bdashore3@proton.me> * Loras: Optimize load endpoint Return successes and failures along with consolidating the request to the rewritten load_loras function. Signed-off-by: kingbri <bdashore3@proton.me> --------- Co-authored-by: kingbri <bdashore3@proton.me> Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>	2023-12-08 23:38:08 -05:00
kingbri	161c9d2c19	Tests: Fix wheel test Fastchat is named fschat from the package's point of view. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-08 01:15:24 -05:00
kingbri	fa1e99daf6	Model: Remove unused print statement Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-07 21:13:52 -05:00
kingbri	47176a2a1e	Requirements: Fix torch install Use --extra-index-url to install pytorch. This should be secure enough since dependency confusion attacks aren't possible with just installing the torch package. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 19:04:35 -05:00
kingbri	f8e9e22c43	API: Fix model load endpoint with draft Draft wasn't being parsed correctly with the new changes which removed the draft_enabled bool. There's still some more work to be done with returning exceptions. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 18:05:55 -05:00
kingbri	6a71890d45	Model: Fix sampler bugs Lots of bugs were unearthed when switching to the new fallback changes. Fix them and make sure samplers are being set properly. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 17:29:58 -05:00
kingbri	9f34af4906	Tests: Create Add a few tests for the user to check if stuff works. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:53:42 -05:00
kingbri	21c25fd806	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:24:49 -05:00
kingbri	b83e1b704e	Requirements: Split for configurations Add self-contained requirements for cuda 11.8 and ROCm Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-06 00:00:30 -05:00
kingbri	4c0e686e7d	Model: Cleanup and fix fallbacks Use the standard "dict.get("key") or default" to handle fetching values from kwargs and get a fallback value without possible errors. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 23:28:16 -05:00
Brian Dashore	0ef2fe9b95	Merge pull request #23 from DocShotgun/main Expose draft_rope_scale	2023-12-05 22:24:53 -05:00
kingbri	d8f7b93c54	Model: Fix fetching of draft args Mistakenly fetched these from parent kwargs instead of the scoped draft_config var. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 22:24:27 -05:00
DocShotgun	3f2fcbcc45	Add fallback to draft_rope_scale to 1.0	2023-12-05 18:51:36 -08:00
DocShotgun	39f7a2aabd	Expose draft_rope_scale	2023-12-05 12:59:32 -08:00
Brian Dashore	e085b806e8	Merge pull request #22 from DocShotgun/main Update colab, expose additional args	2023-12-05 01:22:33 -05:00
DocShotgun	67507105d0	Update colab, expose additional args * Exposed draft model args for speculative decoding * Exposed int8 cache, dummy models, and no flash attention * Resolved CUDA 11.8 dependency issue	2023-12-04 22:20:46 -08:00
Brian Dashore	37f8f3ef8b	Merge pull request #20 from veryamazinglystupid/main make colab better, fix libcudart errors	2023-12-05 01:14:21 -05:00
kingbri	621e11b940	Update documentation Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 00:33:43 -05:00
kingbri	8ba3bfa6b3	API: Fix load exception handling Models do not fully unload if an exception is caught in load. Therefore, leave it to the client to unload on cancel. Also add handlers in the event a SSE stream is cancelled. These packets can't be sent back to the client since the client has severed the connection, so print them in terminal. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-05 00:23:15 -05:00
kingbri	7c92968558	API: Fix mistaken debug statement Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-04 18:07:12 -05:00
kingbri	5e54911cc8	API: Fix semaphore handling and chat completion errors Chat completions previously always yielded a final packet to say that a generation finished. However, this caused errors that a yield was executed after GeneratorExit. This is correctly stated because python's garbage collector can't clean up the generator after exiting due to the finally block executing. In addition, SSE endpoints close off the connection, so the finish packet can only be yielded when the response has completed, so ignore yield on exception. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-04 15:51:25 -05:00
kingbri	30fc5b3d29	Merge branch 'main' of github.com:theroyallab/tabbyAPI	2023-12-03 22:55:51 -05:00
kingbri	ed6c962aad	API: Fix sequential requests FastAPI is kinda weird with queueing. If an await is used within an async def, requests aren't executed sequentially. Get the sequential requests back by using a semaphore to limit concurrent execution from generator functions. Also scaffold the framework to move generator functions to their own file. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 22:54:34 -05:00
veryamazinglystupid	ad1a12a0f2	make colab better, fix libcudart errors :3	2023-12-03 14:07:52 +05:30
DocShotgun	2a9e4ca051	Add Colab example *note: this uses wheels for python 3.10 and torch 2.1.0+cu118 which is the current default in colab	2023-12-03 02:21:51 -05:00
kingbri	e740b53478	Requirements: Update Flash Attention 2 Bump to 2.3.6 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 01:56:29 -05:00
kingbri	c67c9f6d66	Model + Config: Remove low_mem option Low_mem doesn't work in exl2 and it was an experimental option to begin with. Keep the loading code commented out in case it gets fixed in the future. A better alternative is to use 8bit cache which works and helps save VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-03 01:07:42 -05:00

... 17 18 19 20 21 ...

1051 commits