jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	a51acb9db4	Templates: Switch to async jinja engine This prevents any possible blocking of the event loop due to template rendering. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-17 12:03:41 -04:00
kingbri	b4752c1e62	Templates: Revert to load metadata on runtime Metadata is generated via a template's module. This requires a single iteration through the template. If a template tries to access a passed variable that doesn't exist, it will error. Therefore, generate the metadata at runtime to prevent these errors from happening. To optimize further, cache the metadata after the first generation to prevent the expensive call of making a template module. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-17 11:44:42 -04:00
Ben Gitter	70b9fc95de	[WIP] OpenAI Tools Support/Function calling (#154 ) * returning stop str if exists from gen * added chat template for firefunctionv2 * pulling tool vars from template * adding parsing for tool inputs/outputs * passing tool data from endpoint to chat template, adding tool_start to the stop list * loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format * non streaming generation prototype * cleaning template * Continued work with type, ingestion into template, and chat template for fire func * Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib. * Ruff Formating * Moved stop string and tool updates out of prompt creation func Updated tool pydantic to match OAI Support for streaming Updated generate tool calls to use flag within chat_template and insert tool reminder * Llama 3.1 chat templates Updated fire func template * renamed llama3.1 to chatml_with_headers.. * update name of template * Support for calling a tool start token rather than the string. Simplified tool_params Warning when gen_settings are being overidden becuase user set temp to 0 Corrected schema and tools to correct types for function args. Str for some reason * draft groq tool use model template * changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally) * Clean up comments and code in chat comp * Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call. * changes example back to args as json rather than string of json * Standardize chat templates to each other * cleaning/rewording * stop elements can also be ints (tokens) * Cleaning/formatting * added special tokens for tools and tool_response as specified in description * Cleaning * removing aux templates - going to live in llm-promp-templates repo instead * Tree: Format Signed-off-by: kingbri <bdashore3@proton.me> * Chat Completions: Don't include internal tool variables in OpenAPI Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The location of these variables may need to be changed in the future. Signed-off-by: kingbri <bdashore3@proton.me> * Templates: Deserialize metadata on template load Since we're only looking for specific template variables that are static in the template, it makes more sense to render when the template is initialized. Signed-off-by: kingbri <bdashore3@proton.me> * Tools: Fix comments Adhere to the format style of comments in the rest of the project. Signed-off-by: kingbri <bdashore3@proton.me> --------- Co-authored-by: Ben Gitter <gitterbd@gmail.com> Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-17 00:16:25 -04:00
Brian Dashore	1bf062559d	Merge pull request #158 from AlpinDale/embeddings feat: add embeddings support via Infinity-emb	2024-07-31 20:33:12 -04:00
kingbri	f13d0fb8b3	Embeddings: Add model load checks Same as the normal model container. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 11:17:36 -04:00
kingbri	fbf1455db1	Embeddings: Migrate and organize Infinity Use Infinity as a separate backend and handle the model within the common module. This separates out the embeddings model from the endpoint which allows for model loading/unloading in core. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 11:00:23 -04:00
kingbri	ac1afcc588	Embeddings: Use response classes instead of dicts Follows the existing code style. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-29 14:15:40 -04:00
kingbri	3f21d9ef96	Embeddings: Switch to Infinity Infinity-emb is an async batching engine for embeddings. This is preferable to sentence-transformers since it handles scalable usecases without the need for external thread intervention. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-29 13:42:03 -04:00
kingbri	c9a5d2c363	OAI: Refactor embeddings Move files and rewrite routes to adhere to Tabby's code style. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-28 14:10:51 -04:00
kingbri	2773517a16	API: Add setup function to routers This helps prepare the router before exposing it to the parent app. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-26 22:24:33 -04:00
Brian Dashore	6365427d38	Merge pull request #155 from Vhallo/main Simple Typo Fix	2024-07-26 21:35:50 -04:00
kingbri	884b6f5ecd	API: Add log options for initialization Make each API log their respective URLs to help inform users. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-26 21:32:05 -04:00
AlpinDale	5adfab1cbd	ruff: formatting	2024-07-26 02:53:14 +00:00
AlpinDale	f20cd330ef	feat: add embeddings support via sentence-transformers	2024-07-26 02:45:07 +00:00
Vhallo	b2064bbfb4	Typo fix in completion.py	2024-07-23 23:49:43 +02:00
Vhallo	88e4b108b4	Typo fix in chat_completion.py	2024-07-23 23:48:50 +02:00
kingbri	9ad69e8ab6	API: Migrate universal routes to core Place OAI specific routes in the appropriate folder. This is in preperation for adding new API servers that can be optionally enabled. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-23 14:08:48 -04:00
kingbri	64c2cc85c9	OAI: Migrate model depends into proper file Use amongst multiple routers. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-23 13:59:56 -04:00
kingbri	d1706fb067	OAI: Remove double logging if request is cancelled Uvicorn can log in both the request disconnect handler and the CancelledError. However, these sometimes don't work and both need to be checked. But, don't log twice if one works. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-22 21:48:59 -04:00
kingbri	cae94b920c	API: Add ability to use request IDs Identify which request is being processed to help users disambiguate which logs correspond to which request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-21 21:01:05 -04:00
kingbri	38185a1ff4	Auth: Fix key check coalesce Prefer the auth-specific headers before the generic authorization header. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-19 10:08:57 -04:00
kingbri	c1b61441f4	OAI: Fix usage chunk return Place the logic into their proper utility functions and cleanup the code with formatting. Also, OAI's docs specify that a [DONE] return is needed when everything is finished. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-12 14:37:20 -04:00
Volodymyr Kuznetsov	b149d3398d	OAI: support stream_options argument	2024-07-11 18:37:50 -07:00
kingbri	9fc3fc4c54	OAI: Amend comments Clarify what the user can and can't see. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:50 -04:00
kingbri	1f46a1130c	OAI: Restrict list permissions for API keys API keys are not allowed to view all the admin's models, templates, draft models, loras, etc. Basically anything that can be viewed on the filesystem outside of anything that's currently loaded is not allowed to be returned unless an admin key is present. This change helps preserve user privacy while not erroring out on list endpoints that the OAI spec requires. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:50 -04:00
kingbri	dfb4c51d5f	OAI: Fix function idioms Make functions mean the same thing to avoid confusion. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:50 -04:00
kingbri	b9a58ff01b	Auth: Make key permission check work on Requests Pass a request and internally unwrap the headers. In addition, allow X-admin-key to get checked in an API key request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:49 -04:00
kingbri	5c293499bd	OAI: Reorder functions Reordering routes changes the order of appearance on documentation. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:27:08 -04:00
kingbri	521d21b9f2	OAI: Add return types for docs Adding return types allows for responses to get included in the autogenerated docs. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:23:41 -04:00
kingbri	27d2d5f3d2	Config + Model: Allow for default fallbacks from config for model loads Previously, the parameters under the "model" block in config.yml only handled the loading of a model on startup. This meant that any subsequent API request required each parameter to be filled out or use a sane default (usually defaults to the model's config.json). However, there are cases where admins may want an argument from the config to apply if the parameter isn't provided in the request body. To help alleviate this, add a mechanism that works like sampler overrides where users can specify a flag that acts as a fallback. Therefore, this change both preserves the source of truth of what parameters the admin is loading and adds some convenience for users that want customizable defaults for their requests. This behavior may change in the future, but I think it solves the issue for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-06 17:50:58 -04:00
DocShotgun	156b74f3f0	Revision to paged attention checks (#133 ) * Model: Clean up paged attention checks * Model: Move cache_size checks after paged attn checks Cache size is only relevant in paged mode * Model: Fix no_flash_attention * Model: Remove no_flash_attention Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.	2024-06-09 17:28:11 +02:00
DocShotgun	55d979b7a5	Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134 ) * Dependencies: Add wheels for Python 3.12 * Model: Switch fp8 cache to Q8 cache * Model: Add ability to set draft model cache mode * Dependencies: Bump exllamav2 to 0.1.5 * Model: Support Q6 cache * Config: Add Q6 cache and draft_cache_mode to config sample	2024-06-09 17:27:39 +02:00
Orion	6cc3bd9752	feat: list support in message.content (#122 )	2024-06-03 19:57:15 +02:00
turboderp	1951f7521c	Forward exceptions from _stream_collector to stream_generate_(chat)_completion (#126 )	2024-06-03 19:42:45 +02:00
turboderp	a011c17488	Revert "Forward exceptions from _stream_collector to stream_generate_chat_completion" This reverts commit `1bb8d1a312`.	2024-06-02 15:37:37 +02:00
turboderp	1bb8d1a312	Forward exceptions from _stream_collector to stream_generate_chat_completion	2024-06-02 15:13:30 +02:00
kingbri	e95e67a000	OAI: Add validation to "n" n must be greater than 1 to generate. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	e2a8b6e8ae	OAI: Add "n" support for streaming generations Use a queue-based system to get choices independently and send them in the overall streaming payload. This method allows for unordered streaming of generations. The system is a bit redundant, so maybe make the code more optimized in the future. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	c8371e0f50	OAI: Copy gen params for "n" For multiple generations in the same request, nested arrays kept their original reference, resulting in duplications. This will occur with any collection type. For optimization purposes, a deepcopy isn't run for the first iteration since original references are created. This is not the most elegant solution, but it works for the described cases. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
kingbri	b944f8d756	OAI: Add "n" for non-streaming generations This adds the ability to add multiple choices to a generation. This is only available for non-streaming gens for now, it requires some more work to port over to streaming. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-28 00:52:30 -04:00
DocShotgun	7ab7ffd562	Tree: Format	2024-05-26 15:48:18 -07:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	d710a1b441	OAI: Switch to background task for disconnect checks Waiting for request disconnect takes some extra time and allows generation chunks to pile up, resulting in large payloads being sent at once not making up a smooth stream. Use the polling method in non-streaming requests by creating a background task and then check if the task is done, signifying that the request has been disconnected. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:52:20 -04:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	9fbbc5afca	Tree: Swap from map to list comprehensions List comprehensions are the more "pythonic" way to approach mapping values to a list. They're also more flexible across different collection types rather than the inbuilt map method. It's best to keep one convention rather than splitting down two. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	43cd7f57e8	API + Model: Add blocks and checks for various load requests Add a sequential lock and wait until jobs are completed before executing any loading requests that directly alter the model. However, we also need to block any new requests that come in until the load is finished, so add a condition that triggers once the lock is free. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	408c66a1f2	Model: Change FA2 and paged attention checks The dynamic generator requires Flash attention 2.5.7 or higher to be installed. This is only supported on Nvidia's 30 series and higher. If a card is AMD or lower than the 30 series, switch to compatability mode which functions the same way as the older generator, except without parallel batching and any features that depend on it, such as CFG. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	06ff47e2b4	Model: Use true async jobs and add logprobs The new async dynamic job allows for native async support without the need of threading. Also add logprobs and metrics back to responses. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	0e9385e023	API: Fix usage reporting for chat completions Resolves #106 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-17 00:03:15 -04:00
kingbri	6f4012d20d	API: Add preset listing for sampler overrides Querying the overrides list endpoint now returns the selected preset and a list of presets to use. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-12 01:34:51 -04:00

1 2

82 commits