jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	5c293499bd	OAI: Reorder functions Reordering routes changes the order of appearance on documentation. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:27:08 -04:00
kingbri	521d21b9f2	OAI: Add return types for docs Adding return types allows for responses to get included in the autogenerated docs. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:23:41 -04:00
kingbri	62e495fc13	Model: Grammar: Fix lru_cache clear function It's cache_clear not clear_cache. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:10:15 -04:00
Brian Dashore	17438288c7	Merge pull request #146 from theroyallab/tokenizer_data_fix Tokenizer data fix	2024-07-08 15:08:29 -04:00
kingbri	c7ce97f119	Tree: Ruff lint Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:06:28 -04:00
kingbri	8a81fe2eb4	Actions: Add Github Pages deploy Deploys OpenAPI documentation to pages. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:04:27 -04:00
kingbri	6613e38436	Main: Make openapi export store locally This runs faster than always making a syscall to check if the env var is set. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 14:54:06 -04:00
kingbri	ae66e8f9ba	Ruff: Lint Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 13:44:12 -04:00
kingbri	b907421285	Main: Fix launch if EXPORT_OPENAPI is unset A default needs to be provided with getenv. Fix that with an empty string. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 13:41:44 -04:00
kingbri	a59e8ef9e7	Main: Make EXPORT_OPENAPI only work if true or 1 Use truthy values instead of checking if the variable is set. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 12:51:24 -04:00
kingbri	e58e197f0b	Ruff: Remove deprecated rule E999 Syntax error is removed since they'll always be shown when linting anyways. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 12:36:15 -04:00
kingbri	933268f7e2	API: Integrate OpenAPI export script Move OpenAPI export as an env var within the main function. This allows for easy export by running main. In addition, an env variable provides global and explicit state to disable conditional wheel imports (ex. Exl2 and torch) which caused errors at first. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 12:34:32 -04:00
turboderp	e97ad9cb27	RUFF	2024-07-08 03:51:14 +02:00
turboderp	8bbce3455c	RUFF	2024-07-08 03:49:26 +02:00
kingbri	5e82b7eb69	API: Add standalone method to fetch OpenAPI docs Generates and stores an export of the openapi.json file for use in static websites. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-07 21:35:52 -04:00
turboderp	4cf79c5ae1	Clear tokenizer_data cache when unloading model	2024-07-08 03:31:05 +02:00
turboderp	b7e7df1220	Move tokenizer_data cache to global scope	2024-07-08 02:54:49 +02:00
turboderp	4d0bb1ffc3	Cache creation tokenizer_data in LMFE	2024-07-08 00:51:59 +02:00
turboderp	bb8b02a60a	Wrap arch_compat_overrides in try block Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.	2024-07-07 07:54:05 +02:00
kingbri	773639ea89	Model: Fix flash-attn checks If flash attention is already turned off by exllamaV2 itself, don't try creating a paged generator. Also condense all the redundant logic into one if statement. Also check arch_compat_overrides to see if flash attention should be disabled for a model arch (ex. Gemma 2) Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-06 20:58:24 -04:00
kingbri	27d2d5f3d2	Config + Model: Allow for default fallbacks from config for model loads Previously, the parameters under the "model" block in config.yml only handled the loading of a model on startup. This meant that any subsequent API request required each parameter to be filled out or use a sane default (usually defaults to the model's config.json). However, there are cases where admins may want an argument from the config to apply if the parameter isn't provided in the request body. To help alleviate this, add a mechanism that works like sampler overrides where users can specify a flag that acts as a fallback. Therefore, this change both preserves the source of truth of what parameters the admin is loading and adds some convenience for users that want customizable defaults for their requests. This behavior may change in the future, but I think it solves the issue for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-06 17:50:58 -04:00
kingbri	d03752e31b	Issues: Fix template Correct Discord invite link. Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-23 21:52:01 -04:00
kingbri	45fae89af6	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-23 21:50:17 -04:00
kingbri	c5ea2abe24	Dependencies: Update ExllamaV2 v0.1.6 Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-23 21:45:04 -04:00
kingbri	d85b526644	Dependencies: Pin numpy v2.x breaks many upstream dependencies (torch). Pin until repos are fixed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-23 21:40:09 -04:00
DocShotgun	107436f601	Dependencies: Fix AMD triton (#139 )	2024-06-18 15:19:27 +02:00
Brian Dashore	06ee610a97	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-17 03:56:47 +00:00
kingbri	c575105e41	ExllamaV2: Cleanup log placements Move the large import errors into the check functions themselves. This helps reduce the difficulty in interpreting where errors are coming from. Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-16 00:16:03 -04:00
Glenn Maynard	8da7644571	Fix exception unloading models. (#138 ) self.generator is None if a model load fails or is cancelled.	2024-06-15 23:44:29 +02:00
DocShotgun	85387d97ad	Fix disabling flash attention in exl2 config (#136 ) * Model: Fix disabling flash attention in exl2 config * Model: Pass no_flash_attn to draft config * Model: Force torch flash SDP off in compatibility mode	2024-06-12 20:00:46 +02:00
DocShotgun	156b74f3f0	Revision to paged attention checks (#133 ) * Model: Clean up paged attention checks * Model: Move cache_size checks after paged attn checks Cache size is only relevant in paged mode * Model: Fix no_flash_attention * Model: Remove no_flash_attention Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.	2024-06-09 17:28:11 +02:00
DocShotgun	55d979b7a5	Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134 ) * Dependencies: Add wheels for Python 3.12 * Model: Switch fp8 cache to Q8 cache * Model: Add ability to set draft model cache mode * Dependencies: Bump exllamav2 to 0.1.5 * Model: Support Q6 cache * Config: Add Q6 cache and draft_cache_mode to config sample	2024-06-09 17:27:39 +02:00
DocShotgun	dcd9428325	Model: Warn if cache size is too small for CFG (#132 )	2024-06-05 19:40:14 +02:00
DocShotgun	e391d84e40	More extensive checks for paged mode support (#121 ) * Model: More extensive checks for paged attention Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode. This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention. * Tree: Format * Tree: Lint * Model: Check GPU architecture first Check GPU arch prior to checking whether flash attention 2 is installed	2024-06-05 09:33:21 +02:00
turboderp	dbdcb38ad7	Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129 )	2024-06-04 02:32:39 +02:00
turboderp	e889fa3efe	Bump exllamav2 to v0.1.4 (#128 )	2024-06-04 02:32:08 +02:00
Orion	6cc3bd9752	feat: list support in message.content (#122 )	2024-06-03 19:57:15 +02:00
turboderp	1951f7521c	Forward exceptions from _stream_collector to stream_generate_(chat)_completion (#126 )	2024-06-03 19:42:45 +02:00
turboderp	0eb8fa5d1e	[fix] Bring draft progress and model progress in sync with model loader (#125 ) * Bring draft progress and model progress in sync with model loader * Fix formatting	2024-06-03 19:41:02 +02:00
turboderp	a011c17488	Revert "Forward exceptions from _stream_collector to stream_generate_chat_completion" This reverts commit `1bb8d1a312`.	2024-06-02 15:37:37 +02:00
turboderp	1bb8d1a312	Forward exceptions from _stream_collector to stream_generate_chat_completion	2024-06-02 15:13:30 +02:00
kingbri	e95e67a000	OAI: Add validation to "n" n must be greater than 1 to generate. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	e2a8b6e8ae	OAI: Add "n" support for streaming generations Use a queue-based system to get choices independently and send them in the overall streaming payload. This method allows for unordered streaming of generations. The system is a bit redundant, so maybe make the code more optimized in the future. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	c8371e0f50	OAI: Copy gen params for "n" For multiple generations in the same request, nested arrays kept their original reference, resulting in duplications. This will occur with any collection type. For optimization purposes, a deepcopy isn't run for the first iteration since original references are created. This is not the most elegant solution, but it works for the described cases. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	b944f8d756	OAI: Add "n" for non-streaming generations This adds the ability to add multiple choices to a generation. This is only available for non-streaming gens for now, it requires some more work to port over to streaming. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	8d31a5aed1	Dependencies: Update Flash Attention 2 v2.5.9.post1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:45:35 -04:00
Brian Dashore	516b52b341	Merge pull request #112 from DocShotgun/main Separate new prompt tokens from those reused from cache in metric logging	2024-05-27 18:04:43 -04:00
kingbri	19961f4126	Dependencies: Update ExllamaV2 v0.1.1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:38:07 -04:00
kingbri	04cbed16e8	Update README Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:37:57 -04:00
kingbri	4087586449	Start: Create config.yml if it doesn't exist While TabbyAPI doesn't need a config.yml to run, new users can get confused by the task of copying config_sample.yml to config.yml. Therefore, automatically do this in the start script to immediately expose options to the user. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:37:52 -04:00

1 2 3 4 5 ...

641 commits