jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	4aebe8a2a5	Config: Use an explicit "auto" value for rope_alpha Using "auto" for rope alpha removes ambiguity on how to explicitly enable automatic rope calculation. The same behavior of None -> auto calculate still exists, but can be overwritten if a model's tabby_config.yml includes `rope_alpha`. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-31 22:59:56 -04:00
kingbri	364032e39e	Config: Remove developement flag from tensor parallel Exists in stable ExllamaV2 version. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-22 14:15:19 -04:00
kingbri	871c89063d	Model: Add Tensor Parallel support Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-22 14:15:19 -04:00
kingbri	3e42211c3e	Config: Embeddings: Make embeddings_device a default when API loading When loading from the API, the fallback for embeddings_device will be the same as the config. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-01 13:59:49 -04:00
Brian Dashore	1bf062559d	Merge pull request #158 from AlpinDale/embeddings feat: add embeddings support via Infinity-emb	2024-07-31 20:33:12 -04:00
kingbri	46304ce875	Model: Properly pass in max_batch_size from config The override wasn't being passed in before. Also, the default is now none since Exl2 can automatically calculate the max batch size. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 18:42:25 -04:00
kingbri	dc3dcc9c0d	Embeddings: Update config, args, and parameter names Use embeddings_device as the parameter for device to remove ambiguity. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 15:32:26 -04:00
kingbri	bfa011e0ce	Embeddings: Add model management Embedding models are managed on a separate backend, but are run in parallel with the model itself. Therefore, manage this in a separate container with separate routes. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 15:19:27 -04:00
kingbri	3f21d9ef96	Embeddings: Switch to Infinity Infinity-emb is an async batching engine for embeddings. This is preferable to sentence-transformers since it handles scalable usecases without the need for external thread intervention. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-29 13:42:03 -04:00
kingbri	42bc4adcfb	Config: Add option to set priority to realtime Realtime process priority assigns resources to point to tabby's processes. Running as administrator will give realtime priority while running as a normal user will set as high priority. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-24 21:50:06 -04:00
kingbri	5c082b7e8c	Async: Add option to use Uvloop/Winloop These are faster event loops for asyncio which should improve overall performance. Gate these under an experimental flag for now to stress test these loops. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-24 18:59:20 -04:00
kingbri	300f034233	API: Add config option to select servers Always enable the core endpoints and allow servers to be selected as needed. Use the OAI server by default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-23 14:27:42 -04:00
kingbri	3826815edb	API: Add request logging Log all the parts of a request if the config flag is set. The logged fields are all server side anyways, so nothing is being exposed to clients. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-22 21:40:00 -04:00
kingbri	6019c93637	Networking: Gate sending tracebacks over the API It's possible that tracebacks can give too much info about a system when sent over the API. Gate this under a flag to send them only when debugging since this feature is still useful. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-14 10:30:11 -04:00
kingbri	27d2d5f3d2	Config + Model: Allow for default fallbacks from config for model loads Previously, the parameters under the "model" block in config.yml only handled the loading of a model on startup. This meant that any subsequent API request required each parameter to be filled out or use a sane default (usually defaults to the model's config.json). However, there are cases where admins may want an argument from the config to apply if the parameter isn't provided in the request body. To help alleviate this, add a mechanism that works like sampler overrides where users can specify a flag that acts as a fallback. Therefore, this change both preserves the source of truth of what parameters the admin is loading and adds some convenience for users that want customizable defaults for their requests. This behavior may change in the future, but I think it solves the issue for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-06 17:50:58 -04:00
DocShotgun	55d979b7a5	Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134 ) * Dependencies: Add wheels for Python 3.12 * Model: Switch fp8 cache to Q8 cache * Model: Add ability to set draft model cache mode * Dependencies: Bump exllamav2 to 0.1.5 * Model: Support Q6 cache * Config: Add Q6 cache and draft_cache_mode to config sample	2024-06-09 17:27:39 +02:00
kingbri	bec919e202	Config: Change cache_size description and location Makes more sense to place cache_size with the other cache options. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 20:50:56 -04:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	408c66a1f2	Model: Change FA2 and paged attention checks The dynamic generator requires Flash attention 2.5.7 or higher to be installed. This is only supported on Nvidia's 30 series and higher. If a card is AMD or lower than the 30 series, switch to compatability mode which functions the same way as the older generator, except without parallel batching and any features that depend on it, such as CFG. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-25 21:16:14 -04:00
kingbri	d759a15559	Model: Fix chunk size handling Wrong class attribute name used for max_attention_size and fixes declaration of the draft model's chunk_size. Also expose the parameter to the end user in both config and model load. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-07 18:39:19 -04:00
kingbri	46ac3beea9	Templates: Support list style chat_template keys HuggingFace updated transformers to provide templates in a list for tokenizers. Update to support this new format. Providing the name of a template for the "prompt_template" value in config.yml will also look inside the template list. In addition, log if there's a template exception, but continue model loading since it shouldn't shut down the application. Signed-off-by: kingbri <bdashore3@proton.me>	2024-04-07 11:20:25 -04:00
kingbri	08bcc6307a	Config: Update description part 2 Forgot to change wording. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 01:07:23 -04:00
kingbri	7abbac098a	Config: Update Q4 in comments Wasn't present when the option was added. Signed-off-by: kingbri <bdashore3@proton.me>	2024-03-17 01:04:12 -04:00
DocShotgun	8245488926	Additional clarification for override_base_seq_len	2024-03-02 09:29:50 -08:00
kingbri	949248fb94	Config: Add experimental torch cuda malloc backend This option saves some VRAM, but does have the chance to error out. Add this in the experimental config section. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 21:45:56 -05:00
kingbri	2f568ff573	Config: Expose auto GPU split reserve config The GPU reserve is used as a VRAM buffer to prevent GPU overflow when automatically deciding how to load a model on multiple GPUs. Make this configurable. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 22:09:50 -05:00
kingbri	58590a6c57	Config: Add option to force streaming off Many APIs automatically ask for request streaming without giving the user the option to turn it off. Therefore, give the user more freedom by giving a server-side kill switch. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-07 21:09:59 -05:00
kingbri	c0ad647fa7	Model: Auto-detect a one GPU setup and fix gpu_split_auto It makes more sense to use gpu split parameters when the user has >1 GPUs. Otherwise, set split and split_auto to False and save the user some VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 23:08:57 -05:00
kingbri	849179df17	Model: Make loading use less VRAM The model loader was using more VRAM on a single GPU compared to base exllamav2's loader. This was because single GPUs were running using the autosplit config which allocates an extra vram buffer for safe loading. Turn this off for single-GPU setups (and turn it off by default). This change should allow users to run models which require the entire card with hopefully faster T/s. For example, Mixtral with 3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom on Windows. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 22:29:56 -05:00
kingbri	1919bf7705	Launch: Make exllamav2 requirement more friendly Add the ability to use an unsafe config flag if needed and migrate the exl2 check to a different file within the exl2 backend code. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	fc4570220c	API + Model: Add new parameters and clean up documentation The example JSON fields were changed because of the new sampler default strategy. Fix these by manually changing the values. Also add support for fasttensors and expose generate_window to the API. It's recommended to not adjust generate_window as it's dynamically scaled based on max_seq_len by default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	6c30f24c83	Tree: Unify sampler parameters and add override support Unify API sampler params into a superclass which should make them easier to manage and inherit generic functions from. Not all frontends expose all sampling parameters due to connections with OAI (that handles sampling themselves with the exception of a few sliders). Add the ability for the user to customize fallback parameters from server-side. In addition, parameters can be forced to a certain value server-side in case the repo automatically sets other sampler values in the background that the user doesn't want. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	6b04463051	API: Fix CFG reporting THe model endpoint wasn't reporting if CFG is on. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-02 13:54:16 -05:00
kingbri	b378773d0a	Model: Add CFG support CFG, or classifier-free guidance helps push a model in different directions based on what the user provides. Currently, CFG is ignored if the negative prompt is blank (it shouldn't be used in that way anyways). Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-02 01:46:51 -05:00
kingbri	4136f19058	Config: Make the sample a drop-in solution With the new wiki, all parameters are fully documented along with comments in the YAML file itself. This should help new users who pull, copy the config, and can't start the API due to subsections being uncommented and read. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-29 01:36:21 -05:00
kingbri	ec929728d9	Model: Read scale_pos_emb from config In newer versions of exllamav2, this value is read from the model's config.json. This value will still default to 1.0 anyways. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-28 21:14:24 -05:00
kingbri	c72d30918c	Config: Default None -> Empty in comments Empty makes more sense when talking about empty fields. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-28 00:32:29 -05:00
kingbri	3622710582	API: Fix num_experts_per_token reporting This wasn't linked to the model config. This value can be 1 if a MoE model isn't loaded. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-28 00:31:14 -05:00
kingbri	8fa764bfbe	Auth: Add option to disable authentication This creates a massive security hole, but it's gated behind a flag for users who only use localhost. A warning will pop up when users disable authentication. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-21 23:40:16 -05:00
kingbri	72e19dbc12	Config: Change default dirs in sample Models and draft models default to the models directory while loras default to the loras directory. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-21 22:35:03 -05:00
kingbri	bee758dae9	Config: Clarify rope parameters Blank = automatic calculation of alpha value. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-20 21:15:06 -05:00
kingbri	ab10b263fd	Model: Add override base seq len Some models (such as mistral and mixtral) set their base sequence length to 32k due to assumptions of support for sliding window attention. Therefore, add this parameter to override the base sequence length of a model which helps with auto-calculation of rope alpha. If auto-calculation of rope alpha isn't being used, the max_seq_len parameter works fine as is. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-20 00:45:39 -05:00
kingbri	ce2602df9a	Model: Fix max seq len handling Previously, the max sequence length was overriden by the user's config and never took the model's config.json into account. Now, set the default to 4096, but include config.prepare when selecting the max sequence length. The yaml and API request now serve as overrides rather than parameters. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-19 23:37:52 -05:00
kingbri	de9a19b5d3	Templating: Add generation prompt appending Append generation prompts if given the flag on an OAI chat completion request. This appends the "assistant" message to the instruct prompt. Defaults to true since this is intended behavior. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	f631dd6ff7	Templates: Switch to Jinja2 Jinja2 is a lightweight template parser that's used in Transformers for parsing chat completions. It's much more efficient than Fastchat and can be imported as part of requirements. Also allows for unblocking Pydantic's version. Users now have to provide their own template if needed. A separate repo may be usable for common prompt template storage. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	ad8807a830	Model: Add support for num_experts_by_token New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended for people who know what they're doing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 18:03:01 -05:00
kingbri	083df7d585	Tree: Add generation logging support Generations can be logged in the console along with sampling parameters if the user enables it in config. Metrics are always logged at the end of each prompt. In addition, the model endpoint tells the user if they're being logged or not for transparancy purposes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:43:35 -05:00
kingbri	db87efde4a	OAI: Add ability to specify fastchat prompt template Sometimes fastchat may not be able to detect the prompt template from the model path. Therefore, add the ability to set it in config.yml or via the request object itself. Also send the provided prompt template on model info request. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 15:43:58 -05:00
DocShotgun	7380a3b79a	Implement lora support (#24 ) * Model: Implement basic lora support * Add ability to load loras from config on launch * Supports loading multiple loras and lora scaling * Add function to unload loras * Colab: Update for basic lora support * Model: Test vram alloc after lora load, add docs * Git: Add loras folder to .gitignore * API: Add basic lora-related endpoints * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Revert bad CRLF line ending changes * API: Add basic lora-related endpoints (fixed) * Add /loras/ endpoint for querying available loras * Add /model/lora endpoint for querying currently loaded loras * Add /model/lora/load endpoint for loading loras * Add /model/lora/unload endpoint for unloading loras * Move lora config-checking logic to main.py for better compat with API endpoints * Model: Unload loras first when unloading model * API + Models: Cleanup lora endpoints and functions Condenses down endpoint and model load code. Also makes the routes behave the same way as model routes to help not confuse the end user. Signed-off-by: kingbri <bdashore3@proton.me> * Loras: Optimize load endpoint Return successes and failures along with consolidating the request to the rewritten load_loras function. Signed-off-by: kingbri <bdashore3@proton.me> --------- Co-authored-by: kingbri <bdashore3@proton.me> Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>	2023-12-08 23:38:08 -05:00
DocShotgun	39f7a2aabd	Expose draft_rope_scale	2023-12-05 12:59:32 -08:00

1 2

60 commits