jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
AUTOMATIC	056527ceb3	add logprobs support for exl3	2025-08-03 11:42:32 +03:00
Brian	03d72a37be	Merge pull request #371 from DocShotgun/main Config: Remove developer arg cuda_malloc_backend	2025-08-01 14:02:57 -04:00
DocShotgun	102af306e5	Config: Remove developer arg cuda_malloc_backend * cudaMallocAsync is now enabled by default on supported configurations	2025-08-01 10:59:13 -07:00
kingbri	113643c0df	Main: Enable cudaMallocAsync backend by default Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable the backend. This is an env var that needs to be set, so it's not really possible to set it via config.yml. This used to be experimental, but it's probably fine to keep it enabled since it only provides a benefit. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-27 22:31:38 -04:00
kingbri	0b4ca567f8	API: Persist request IDs and append full_text to finish chunk Adding these to each generation chunk helps remove redundancy and unecessary request ID operations. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-25 12:27:44 -04:00
kingbri	e77fa0b7a8	Docs: Edit inline loading for breaking changes Add the model key for the YAML examples. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-24 18:11:42 -04:00
kingbri	ab04a6ed60	Dependencies: Bump ExllamaV3 v0.0.5 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-18 22:56:35 -04:00
kingbri	bf936f5c39	Dependencies: Update exllamav2 v0.3.2 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-13 23:33:12 -04:00
Brian	2419d2d0a3	Merge pull request #364 from theroyallab/tool-calls Streamline tool calling	2025-07-11 11:34:10 -04:00
kingbri	707d005aad	API: Default tool call ID and type Doing this helps reduce the model's burden of generating the tool call ID and type (which is always "function"). Follow mistral's spec for tool call IDs by using a 9 character alphanumeric string. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-11 01:11:09 -04:00
kingbri	5b1db3ad83	API: Don't do a second re-render when tool calling Re-rendering the template is an expensive operation when it's possible to just concatenate the prompt and current generation text together. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-06 11:32:36 -04:00
kingbri	3dfa965019	API: Add tool_call_id for role = tool If a message with role = tool is present, the tool_call_id should also be given to the template. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-05 21:52:58 -04:00
kingbri	1c3f84151f	Docs: Update tool calling For new variables and format. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-05 21:43:04 -04:00
kingbri	871f71c4e7	Templates: Adjust tool call example Use the new tool call variables and formatting. Also prettify the template. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-05 21:42:23 -04:00
kingbri	879f4cee7e	API: Modify tool calling for wider compat When revisiting tool calls, the formats have more or less become standard. For greater compatibility with templates, primarily use the message.tools parameter and remove the extra custom metadata that is no longer required. However, unlike other backends, tabbyAPI still uses template metadata to declare what the tool start string is. This allows for template-level customization along with giving more power to the user while the server exists to consume rather than work on a case-by-case basis. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-05 14:28:12 -04:00
kingbri	b6a26da50c	API: Fix tool call serialization To render in the template, tool call start tokens needed to have less checks and remove the line to convert message.tool_calls to a dict since that breaks the rest of the chain by disconnecting the types. model_dump on the message itself already accomplishes this. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-04 15:02:49 -04:00
kingbri	d23fefbecd	API + Model: Fix application of defaults use_as_default was not being properly applied into model overrides. For compartmentalization's sake, apply all overrides in a single function to avoid clutter. In addition, fix where the traditional /v1/model/load endpoint checks for draft options. These can be applied via an inline config, so let any failures fallthrough. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-03 14:37:34 -04:00
kingbri	d339139fb6	Config: Deep merge model overrides Anything below the first level of kwargs was not being merged properly. A more bulletproof solution would be to refactor the loading code to separate draft and normal model parameters. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-07-03 12:17:09 -04:00
kingbri	0152a1665b	Downloader: Switch to use API sizes Rather than relying on Content-Length which can be unreliable, ping the API to get file sizes and work from there. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-30 12:49:53 -04:00
kingbri	03ff4c3128	Downloader: Handle if Content-Length is undefined Usually, the client and server both are aware of the file size by sending a Content-Length header. However, HuggingFace has changed their headers and now does not always send Content-Length. In this case, show an indeterminate progressbar and mark as complete once the download finishes. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-30 11:43:22 -04:00
turboderp	0ae878712e	Exl3: Clear image embedding cache on unload	2025-06-25 23:56:21 +02:00
Brian	e362319a4d	Merge pull request #358 from theroyallab/breaking Breaking changes for configuration	2025-06-17 23:10:16 -04:00
kingbri	a02d39de31	Model: Remove rogue print Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-17 23:09:07 -04:00
kingbri	2913ce29fc	API: Add timings to usage stats It's useful for the client to know what the T/s and total time for generation are per-request. Works with both completions and chat completions. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-17 22:54:51 -04:00
kingbri	5d94d4d022	Merge branch 'main' into breaking	2025-06-17 22:24:32 -04:00
turboderp	122d87ac36	Tree: Format	2025-06-15 19:33:14 +02:00
turboderp	21c5af48e1	Tree: Format	2025-06-15 19:30:38 +02:00
turboderp	1c9891bf04	Exl3: Add vision capability	2025-06-15 19:22:51 +02:00
turboderp	4605c0f6bd	Common: Refactor get_image to common functions	2025-06-15 19:20:36 +02:00
turboderp	d357f100d0	Dependencies: Bump ExllamaV3	2025-06-15 19:12:45 +02:00
turboderp	a0c16bba2a	Exl2: Fix banned_strings (move outside of assign_gen_params)	2025-06-15 16:51:42 +02:00
kingbri	2096c9bad2	Model: Default max_seq_len to 4096 A common problem in TabbyAPI is that users who want to get up and running with a model always had issues with max_seq_len causing OOMs. This is because model devs set max context values in the millions which requires a lot of VRAM. To idiot-proof first time setup, make the fallback default 4096 so users can run their models. If a user still wants to use the model's max_seq_len, set it to -1. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:24 -04:00
kingbri	322f9b773a	Model: Migrate inline config to new format This matches config.yml and all model overrides should go under the "model" block. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:24 -04:00
kingbri	a3c780ae58	API: Core: Remove load/template aliases These added extra complexity and should be removed and replaced with a single parameter. Changes: - /v1/model/load must use model_name and draft_model_name - /v1/model/embedding/load must use embedding_model_name - /v1/template/switch must use prompt_template_name Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:24 -04:00
kingbri	0ea56382f0	Dependencies: Fix unsupported dependency error Log the package name provided to the check function. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:02 -04:00
kingbri	f4ee56ba13	Update README Include ExllamaV3 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-06-13 14:57:01 -04:00
turboderp	691a080ac7	Dependencies: Bump ExllamaV3 and ExllamaV2	2025-05-31 23:55:04 +02:00
kingbri	2d89c96879	API: Re-add BOS token stripping in template render Matching YALS, if the model has add_bos_token enabled, then remove an extra BOS token at the start of the prompt. This usually happens with misconfigured templates such as Llama 3. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-24 21:11:53 -04:00
kingbri	10fbe043a4	API: Fix typing for chat templates in CC requests Tools must be None by default. Chat completion message content can be None, a string, or a list, so default to None. Exclude all None values from a CC message since the template can say the variable "exists" despite being None, causing an error. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-24 21:06:05 -04:00
kingbri	0c4cc1eba3	Model: Add prompt logging to ExllamaV3 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 22:05:18 -04:00
Brian	729caaeddc	Merge pull request #346 from gakada/main Exl3: some models aren't functional without add_bos?	2025-05-17 22:05:15 -04:00
kingbri	0646d358a2	Main: Log auth and sampler overrides after model load Like YALS, logging all pertinent information after model load makes it easier to parse by the user. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 18:10:34 -04:00
kingbri	54b8a20a19	API: Fix types for chat completions Messages were mistakenly being sent as Pydantic objects, but templates expect dictionaries. Properly convert these before render. In addition, initialize all Optional lists as an empty list since this will cause the least problems when interacting with other parts of API code, such as templates. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 18:10:34 -04:00
gakada	ba6248eec0	Exl3: fix add_bos in generator	2025-05-17 19:10:49 +09:00
Brian	81170eee00	Merge pull request #312 from davidallada/add-file-based-logging Add file based logging in addition to the normal console logs	2025-05-17 01:24:19 -04:00
kingbri	17f3dca6fc	Packaging: Add agnostic method to check version of packages Some packages such as ExllamaV2 and V3 require specific versions for the latest features. Rather than creating repetitive functions, create an agnostic function to check the installed package and then report to the user to upgrade. This is also sent to requests for loading and unloading, so keep the error short. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 01:04:24 -04:00
kingbri	084916c04f	Model: Fix autosplit reserve crash with GPU split ExllamaV3 does not accept autosplit_reserve and gpu_split at the same time. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 00:51:14 -04:00
kingbri	0858b6d4b2	Tree: Format Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 00:46:40 -04:00
kingbri	fa534fe551	Dependencies: Update Ruff v0.11.10 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-17 00:46:25 -04:00
kingbri	390daeb92f	Model: Create universal HFModel class The HFModel class serves to coalesce all config files that contain random keys which are required for model usage. Adding this base class allows us to expand as HuggingFace randomly changes their JSON schemas over time, reducing the brunt that backend devs need to feel when their next model isn't supported. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-05-13 18:12:38 -04:00

1 2 3 4 5 ...

1050 commits