From 698d8339cb0b24e6f63d8a4bfc24daac60e416ee Mon Sep 17 00:00:00 2001 From: kingbri <8082010+kingbri1@users.noreply.github.com> Date: Wed, 19 Mar 2025 11:47:49 -0400 Subject: [PATCH] Config + Docs: Clarify YaRN rope scaling changes In ExllamaV2, if a model has YaRN support, linear RoPE options are not applied. Users can set max_seq_len and exl2 will take care of the rest. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com> --- config_sample.yml | 3 +++ docs/02.-Server-options.md | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/config_sample.yml b/config_sample.yml index 745433b..a13e64e 100644 --- a/config_sample.yml +++ b/config_sample.yml @@ -95,6 +95,9 @@ model: # Used with tensor parallelism. gpu_split: [] + # NOTE: If a model has YaRN rope scaling, it will automatically be enabled by ExLlama. + # rope_scale and rope_alpha settings won't apply in this case. + # Rope scale (default: 1.0). # Same as compress_pos_emb. # Use if the model was trained on long context with rope. diff --git a/docs/02.-Server-options.md b/docs/02.-Server-options.md index 6a4f1d9..b319f76 100644 --- a/docs/02.-Server-options.md +++ b/docs/02.-Server-options.md @@ -67,8 +67,8 @@ Note: Most of the options here will only apply on initial model load/startup (ep | gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. | | autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.

Represented as an array of MB per GPU used. | | gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. | -| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) | -| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. | +| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb)

Note: If the model has YaRN support, this option will not apply. | +| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.

Note: If the model has YaRN support, this option will not apply. | | cache_mode | String ("FP16") | Cache mode for the model.

Options: FP16, Q8, Q6, Q4 | | cache_size | Int (max_seq_len) | Size of the K/V cache

Note: If using CFG, the cache size should be 2 * max_seq_len. | | chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |