jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
DocShotgun	7084081b1f	Tree: Lint	2024-05-26 18:27:30 -07:00
kingbri	116cf56c87	Model: Auto-round cache size on init Cache size must be a multiple of 256 to work properly in ExllamaV2. Take the config value and set the cache size to one multiple above the remainder of the cache size divided by 256. This is because cache size can never be lower than max_seq_len. If max_seq_len isn't a multiple of 256, this method will never yield a number that's lower than max_seq_len since it's no longer a source of truth. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:24:54 -04:00
DocShotgun	ce5e2ec8de	Logging: Clarify new vs cached tokens in prompt processing	2024-05-26 18:21:17 -07:00
Brian Dashore	3dcae8b023	Merge pull request #111 from DocShotgun/main Add support for specifying k/v cache size	2024-05-26 20:52:21 -04:00
kingbri	bec919e202	Config: Change cache_size description and location Makes more sense to place cache_size with the other cache options. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 20:50:56 -04:00
DocShotgun	7ab7ffd562	Tree: Format	2024-05-26 15:48:18 -07:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	d710a1b441	OAI: Switch to background task for disconnect checks Waiting for request disconnect takes some extra time and allows generation chunks to pile up, resulting in large payloads being sent at once not making up a smooth stream. Use the polling method in non-streaming requests by creating a background task and then check if the task is done, signifying that the request has been disconnected. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:52:20 -04:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	094c7b1734	Model: Fix paged and FA2 checks If a user is using GPU split, check compute capability on only those GPUs. Autosplit assumes that all GPUs will be used. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 11:29:31 -04:00
kingbri	9fbbc5afca	Tree: Swap from map to list comprehensions List comprehensions are the more "pythonic" way to approach mapping values to a list. They're also more flexible across different collection types rather than the inbuilt map method. It's best to keep one convention rather than splitting down two. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	46d0d13914	Model/Grammar: Fix filter append call No need to use extend if the array is length 1. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	a46ee62d03	Model: Clarify warning and device check on load FA2 v2.5.7 and up is not supported below ampere and on AMD GPUs. Clarify the error message and explain what happens as a result. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	47582c2440	Dependencies: Update ExllamaV2 v0.1.0 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	43cd7f57e8	API + Model: Add blocks and checks for various load requests Add a sequential lock and wait until jobs are completed before executing any loading requests that directly alter the model. However, we also need to block any new requests that come in until the load is finished, so add a condition that triggers once the lock is free. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	408c66a1f2	Model: Change FA2 and paged attention checks The dynamic generator requires Flash attention 2.5.7 or higher to be installed. This is only supported on Nvidia's 30 series and higher. If a card is AMD or lower than the 30 series, switch to compatability mode which functions the same way as the older generator, except without parallel batching and any features that depend on it, such as CFG. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	c2d3675408	Model: Add min_tokens support In the form of min_new_tokens. Stopping strings take priority. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	5f0fb9c4ff	Model: Add CFG support Dynamic generator needed multiple prompts to be tokenized and sent for them to be sampled in serial, but generated in parallel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	06ff47e2b4	Model: Use true async jobs and add logprobs The new async dynamic job allows for native async support without the need of threading. Also add logprobs and metrics back to responses. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	32ae62feac	Model: Add filter support to dynamic gen Dynamic gen takes in filters differently. Adjust to set the filter list per class rather than in the generation function. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	8ccd8fe5f8	Model: Initial dynamic generator support Adds basic support for ExllamaV2's dynamic generator. Can generate a streaming and non-streaming completion. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	c474076b22	Concurrency: Remove release_semaphore method At any point for any request cancellation, the semaphore will be decremented. This is an issue since an arbitrary request can desync the semaphore, causing multiple tasks to be processed at once and break generation. Remove this from the networking handlers and therefore, remove the release_semaphore function itself. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-19 10:42:26 -04:00
kingbri	b9fd8555fe	Sampling: Copy over iterable overrides If an override was iterable, any modifications to the returned value would alter the reference to the global storage dict. Therefore, copy the structure if it's an iterable so any modification won't alter the original override. Also apply this for the function that checks for forced overrides. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 21:38:28 -04:00
kingbri	0e9385e023	API: Fix usage reporting for chat completions Resolves #106 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 00:03:15 -04:00
kingbri	e4bb709305	Model: Fix usage stats in non-streaming gens The wrong key was being returned from the model to the API. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:44:50 -04:00
kingbri	213430a122	Model/Grammar: Remove lmfe checks lmfe is a required dependency, so checks are no longer needed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 22:24:28 -04:00
Brian Dashore	b255847c2a	Merge pull request #105 from DocShotgun/main Add support for regex pattern constraints	2024-05-12 22:22:12 -04:00
DocShotgun	abe411c6fb	API + Model: Add support for regex pattern constraints Adds the ability to constrain generation via regex pattern using lm-format-enforcer.	2024-05-12 19:10:43 -07:00
Ycros	57525219d0	Fix: Properly handle banned_strings and decode_special tokens (#104 ) * Fix: Actually pass banned_strings to the generation call. * decode_special_tokens was missing as well. * syntax	2024-05-12 20:47:45 +00:00
Brian Dashore	611f00818b	Merge pull request #103 from DocShotgun/main Minor fixes for sampler override	2024-05-12 16:47:12 -04:00
DocShotgun	dad34237ba	Samplers: Add example override for generate_window	2024-05-12 00:39:01 -07:00
DocShotgun	9463ecfa40	Samplers: Minor fixes for sampler override * Add missing settings to sample_preset.yml * Fix override for skip_special_tokens	2024-05-12 00:31:31 -07:00
kingbri	c8ec742be9	Samplers: Expose skew sampling Skew is an extra unused sampler in ExllamaV2. Add it in for coverage. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:41:01 -04:00
kingbri	6f4012d20d	API: Add preset listing for sampler overrides Querying the overrides list endpoint now returns the selected preset and a list of presets to use. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:34:51 -04:00
kingbri	b4bc941cbe	Tree: Lint Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 22:42:39 -04:00
kingbri	2da3fb2caf	Start: Bump ROCm error version ROCm support is for 6.0 now. Update that. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 21:57:51 -04:00
kingbri	7bebc085ec	Model: Remove legacy checks v0.0.21 has these features implemented. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:23 -04:00
kingbri	cd78728a77	Dependencies: Update ExllamaV2 v0.0.21 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-11 19:26:03 -04:00
Brian Dashore	5432f523cb	Merge pull request #102 from DocShotgun/main Add support for min_tokens and banned_strings	2024-05-10 21:21:57 -04:00
kingbri	366d57cf45	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:41 -04:00
kingbri	7eee936a3f	Model: Remove old code and fix API handling skip_special_tokens is in stable exl2. Also default the parameters if they are not present in the function signature. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-10 21:20:00 -04:00
DocShotgun	c0b631ba92	API: Add banned_strings From exllamav2: List of strings that the generator will refuse to output. As soon as a partial match happens, a checkpoint is saved that the generator can rewind to if need be. Subsequent tokens are then held until the full string is resolved (match or no match) and either emitted or discarded, accordingly.	2024-05-10 13:53:55 -07:00
DocShotgun	a1df22668b	API: Add min_tokens Bans the EOS token until the generation reaches a minimum length. This will not prevent the model from otherwise ending the generation early by outputting other stop conditions.	2024-05-10 12:30:17 -07:00
Brian Dashore	643b53e347	Create FUNDING.yml Add ko-fi link. Signed-off-by: kingbri <bdashore3@gmail.com>	2024-05-09 19:00:41 +00:00
Brian Dashore	c4f7af160e	Merge pull request #101 from Bakharovsky/fix_exllamav2_cuda_version Fix: the link to the exllamav2 build for cuda 11.8	2024-05-08 16:32:22 -04:00
Arseniy Bakharovsky	33c86be45c	Update pyproject.toml	2024-05-08 03:31:15 +04:00
kingbri	ae879a623f	Main: Add await to an async function load_loras wasn't properly updated. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-02 21:24:43 -04:00
kingbri	ab526f7278	Revert "API: Remove unncessary Optional signatures" This reverts commit `7556dcf134`. The Optionals allowed requests to send "null" in the body for optional parameters which should be allowed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-02 21:23:48 -04:00
kingbri	7556dcf134	API: Remove unncessary Optional signatures Optional isn't necessary if the function signature has a default value. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-01 00:04:52 -04:00
kingbri	ae75db1829	Downloader: Cleanup on exception Otherwise a file exists error will show up if any exception happens but cancel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-30 23:26:22 -04:00

... 2 3 4 5 6 ...

641 commits