jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	b944f8d756	OAI: Add "n" for non-streaming generations This adds the ability to add multiple choices to a generation. This is only available for non-streaming gens for now, it requires some more work to port over to streaming. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	d710a1b441	OAI: Switch to background task for disconnect checks Waiting for request disconnect takes some extra time and allows generation chunks to pile up, resulting in large payloads being sent at once not making up a smooth stream. Use the polling method in non-streaming requests by creating a background task and then check if the task is done, signifying that the request has been disconnected. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:52:20 -04:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	06ff47e2b4	Model: Use true async jobs and add logprobs The new async dynamic job allows for native async support without the need of threading. Also add logprobs and metrics back to responses. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	0e9385e023	API: Fix usage reporting for chat completions Resolves #106 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 00:03:15 -04:00
kingbri	fb1d2f34c1	OAI: Add response_prefix and fix BOS token issues in chat completions response_prefix is used to add a prefix before generating the next message. This is used in many cases such as continuining a prompt (see #96). Also if a template has BOS token specified, add_bos_token will append two BOS tokens. Add a check which strips a starting BOS token from the prompt if it exists. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-25 00:54:43 -04:00
kingbri	cab789e685	Templates: Migrate to class Having many utility functions for initialization doesn't make much sense. Instead, handle anything regarding template creation inside the class which reduces the amount of function imports. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-21 23:28:14 -04:00
kingbri	2a0aaa2e8a	OAI: Add ability to pass extra vars in jinja templates A chat completion can now declare extra template_vars to pass when a template is rendered, opening up the possibility of using state outside of huggingface's parameters. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-11 09:49:25 -04:00
kingbri	f9f8c97c6d	Templates: Fix stop_string parsing Template modules grab all set vars, including ones that use runtime vars. If a template var is set to a runtime var and a module is created, an UndefinedError fires. Use make_module instead to pass runtime vars when creating a template module. Resolves #92 Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-02 00:44:04 -04:00
kingbri	e8b6a02aa8	API: Move prompt template construction to utils Best to move the inner workings within its inner function. Also fix an edge case where stop strings can be a string rather than an array. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-29 02:24:13 -04:00
kingbri	6dfcbbd813	Common: Migrate request utils to networking Helps organize the project better. Utils is meant to be for simple functions like unwrap. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-21 23:21:57 -04:00
kingbri	07d9b7cf7b	Model: Add abort on generation When the model is processing a prompt, add the ability to abort on request cancellation. This is also a catch for a SIGINT. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-20 15:21:37 -04:00
kingbri	2704ff8344	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 16:02:29 -04:00
kingbri	5c7fc69ded	API: Fix finish_reason returns OAI expects finish_reason to be "stop" or "length" (there are others, but they're not in the current scope of this project). Make all completions and chat completions responses return this from the model generation itself rather than putting a placeholder. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-18 15:59:28 -04:00
kingbri	7fded4f183	Tree: Switch to async generators Async generation helps remove many roadblocks to managing tasks using threads. It should allow for abortables and modern-day paradigms. NOTE: Exllamav2 itself is not an asynchronous library. It's just been added into tabby's async nature to allow for a fast and concurrent API server. It's still being debated to run stream_ex in a separate thread or manually manage it using asyncio.sleep(0) Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-16 23:23:31 -04:00
kingbri	1ec8eb9620	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-13 00:02:55 -04:00
kingbri	6f03be9523	API: Split functions into their own files Previously, generation function were bundled with the request function causing the overall code structure and API to look ugly and unreadable. Split these up and cleanup a lot of the methods that were previously overlooked in the API itself. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-12 23:59:30 -04:00

17 commits