jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	5c94894a1a	Dependencies: Update Flash Attention v2.5.6 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-30 16:58:24 -04:00
kingbri	b11aac51e2	Model: Add torch.inference_mode() to generator function Provides a speedup to model forward. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-30 10:45:28 -04:00
kingbri	e8b6a02aa8	API: Move prompt template construction to utils Best to move the inner workings within its inner function. Also fix an edge case where stop strings can be a string rather than an array. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-29 02:24:13 -04:00
kingbri	190a0b26c3	Model: Fix generation when stream = false References #91. Check if the length of the generation array is > 0 after popping the finish reason. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-29 02:15:56 -04:00
kingbri	d4280e1378	Dependencies: Add pytorch-triton-rocm Required for AMD installs. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-28 11:02:56 -04:00
kingbri	271f5ba7a4	Templates: Modify alpaca and chatml Add the stop_strings metadata parameter. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-27 22:28:41 -04:00
kingbri	dc456f4cc2	Templates: Add stop_strings meta param Adding the stop_strings var to chat templates will allow for the template creator to specify stopping strings to add onto chat completions. Thes get appended with existing stopping strings that are passed in the API request. However, a sampler override with force: true will override all stopping strings. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-27 22:22:07 -04:00
kingbri	277c540c98	Colab: Update Switch to pyproject Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-24 21:48:48 -04:00
kingbri	db62d1e649	OAI: Log request errors to console Previously, some request errors were only sent to the client, but some clients don't log the full error, so log it in console. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-23 20:29:17 -04:00
kingbri	26496c4db2	Dependencies: Require tokenizers This is used for some models and isn't too big in size (compared to other huggingface dependencies), so include it by default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-23 01:12:21 -04:00
kingbri	1755f284cf	Model: Prompt users to install extras if dependencies don't exist Ex: tokenizers, lmfe, outlines. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-22 22:13:55 -04:00
kingbri	f952b81ccf	API: Remove uvicorn signal handler injection This causes spamming of warn statements on SIGINT. The message also gets printed on a normal shutdown (that isn't in the middle of a request). Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 23:23:45 -04:00
kingbri	6dfcbbd813	Common: Migrate request utils to networking Helps organize the project better. Utils is meant to be for simple functions like unwrap. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 23:21:57 -04:00
kingbri	2961c5f3f9	API: Handle request disconnect on non-streaming gens Works the same way as streaming gens. If the request is cancelled, it will log an error to the user and release the semaphore if it's holding anything. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 23:12:59 -04:00
kingbri	44b7319710	Start: Print pip install command Helps for debugging. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 18:14:48 -04:00
kingbri	5055a98e41	Model: Wrap load in inference_mode Some tensors were being taken out of inference mode during each iteration of exllama's load_autosplit_gen. This causes errors since autograd is off. Therefore, make the shared load_gen_sync function have an overarching inference_mode context to prevent forward issues. This should allow for the generator to iterate across each thread call. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 18:06:50 -04:00
kingbri	37a80334a8	Dependencies: Add packaging This is a required dependency. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 11:27:27 -04:00
kingbri	56fdfb5f8e	OAI: Add stream to gen params Good for logging. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 00:55:44 -04:00
kingbri	69e41e994c	Model: Fix generation with non-streaming and logprobs Finish_reason was giving an empty offset. Fix this by grabbing the finish reason first and then handling the static generation as normal. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 00:47:24 -04:00
kingbri	345bcc30c7	Dependencies: Add extras feature Installs all optional dependencies to the venv. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 00:09:38 -04:00
kingbri	51b289cab2	Actions: Fix workflows Adopt to new pyproject install method Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	1e7cf1e5a4	Start: Prompt user for GPU/lib There is no platform agnostic way to fetch CUDA/ROCm's versions since environment variables change and users don't necessarily need CUDA or ROCm installed to run pytorch (pytorch installs the necessary libs if they don't exist). Therefore, prompt the user for their GPU lib and store the result in a textfile so the user doesn't need to constantly enter a preference. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	7e669527ed	Model: Fix tokenizer bugs Some tokenizer variables don't get cleaned up on init, so these can persist. Clean these up manually before creating a new tokenizer for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	07d9b7cf7b	Model: Add abort on generation When the model is processing a prompt, add the ability to abort on request cancellation. This is also a catch for a SIGINT. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	7020a0a2d1	Dependencies: Update Exllamav2 v0.0.16 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	061e1d94c2	Ruff: Migrate to pyproject Removes unnecessary ruff.toml. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	1059101b23	Dependencies: Remove requirements-*.txt files Pyproject.toml replaces these files. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	72b08624a3	Start: Update to use pyproject Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	b1ca435695	Tree: Add pyproject.toml This will manage dependencies from now on since it's a more flexible file that's similar to other packaging utilities like npm and cargo. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	b74603db59	Model: Log metrics before yielding a stop Yielding the finish reason before the logging causes the function to terminate early. Instead, log before yielding and breaking out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 01:17:04 -04:00
kingbri	09a4c79847	Model: Auto-scale max_tokens by default If max_tokens is None, it automatically scales to fill up the context. This does not mean the generation will fill up that context since EOS stops also exist. Originally suggested by #86 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:54:59 -04:00
kingbri	8cbb59d6e1	Model: Cleanup some comments Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:20:45 -04:00
kingbri	4f75fb5588	Model: Adjust max output len Max output len should be hardcoded to 16 since it's the amount of tokens to predict per forward pass. 16 is a good value for both normal inference and speculative decoding which also helps save vram compared to 2048 which was the previous default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 22:16:53 -04:00
kingbri	2704ff8344	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 16:02:29 -04:00
kingbri	5c7fc69ded	API: Fix finish_reason returns OAI expects finish_reason to be "stop" or "length" (there are others, but they're not in the current scope of this project). Make all completions and chat completions responses return this from the model generation itself rather than putting a placeholder. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 15:59:28 -04:00
kingbri	25f5d4a690	API: Cleanup permission endpoint Don't return an OAI specific type from a common file. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 15:13:26 -04:00
kingbri	3c08f46c51	Endpoints: Add key permission checker This is a definite way to check if an authorized key is API or admin. The endpoint only runs if the key is valid in the first place to keep inline with the API's security model. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 00:53:27 -04:00
kingbri	c9a6d9ae1f	Model: Switch to begin_stream_ex Allows for dynamically passing logprobs params instead of assuming on initialization of the generator. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 14:41:16 -04:00
kingbri	08bcc6307a	Config: Update description part 2 Forgot to change wording. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 01:07:23 -04:00
kingbri	7abbac098a	Config: Update Q4 in comments Wasn't present when the option was added. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 01:04:12 -04:00
kingbri	14d8ec2007	Signal: Fix signal handlers for uvicorn Add the ability to override uvicorn's signal handler in addition to using main's signal handler for any SIGINTs before the API server starts. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	95e44c20d6	Model: Fix load if model didn't load properly If the model didn't load properly, the container still exists until unload is called. However, the name check still registered as true. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	2755fd1af0	API: Fix blocking iterator execution Run these iterators on the background thread. On startup, the API spawns a background thread as needed to run sync code on without blocking the event loop. Use asyncio's run_thread function since it allows for errors to be propegated. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	7fded4f183	Tree: Switch to async generators Async generation helps remove many roadblocks to managing tasks using threads. It should allow for abortables and modern-day paradigms. NOTE: Exllamav2 itself is not an asynchronous library. It's just been added into tabby's async nature to allow for a fast and concurrent API server. It's still being debated to run stream_ex in a separate thread or manually manage it using asyncio.sleep(0) Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	33e2df50b7	API: Disable SSE ping chunks These are mainly used for some clients that ping to see if the request is alive. However, we don't need this. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-14 20:47:05 -04:00
kingbri	7006fa4cc8	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 23:33:18 -04:00
kingbri	efc01d947b	API + Model: Add speculative ngram decoding Speculative ngram decoding is like speculative decoding without the draft model. It's not as useful because it only decodes on predictable sequences, but it depends on the usecase. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 23:32:11 -04:00
kingbri	2ebefe8258	Logging: Move metrics to gen logging This didn't have a place in the generation function. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 23:13:55 -04:00
kingbri	1ec8eb9620	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 00:02:55 -04:00
kingbri	8e4745920c	Requirements: Update Ruff v0.3.2 Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 00:02:55 -04:00

... 4 5 6 7 8 ...

641 commits