From 698d8339cb0b24e6f63d8a4bfc24daac60e416ee Mon Sep 17 00:00:00 2001
From: kingbri <8082010+kingbri1@users.noreply.github.com>
Date: Wed, 19 Mar 2025 11:47:49 -0400
Subject: [PATCH] Config + Docs: Clarify YaRN rope scaling changes

In ExllamaV2, if a model has YaRN support, linear RoPE options are
not applied. Users can set max_seq_len and exl2 will take care of
the rest.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
---
 config_sample.yml          | 3 +++
 docs/02.-Server-options.md | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/config_sample.yml b/config_sample.yml
index 745433b..a13e64e 100644
--- a/config_sample.yml
+++ b/config_sample.yml
@@ -95,6 +95,9 @@ model:
   # Used with tensor parallelism.
   gpu_split: []
 
+  # NOTE: If a model has YaRN rope scaling, it will automatically be enabled by ExLlama.
+  # rope_scale and rope_alpha settings won't apply in this case.
+
   # Rope scale (default: 1.0).
   # Same as compress_pos_emb.
   # Use if the model was trained on long context with rope.
diff --git a/docs/02.-Server-options.md b/docs/02.-Server-options.md
index 6a4f1d9..b319f76 100644
--- a/docs/02.-Server-options.md
+++ b/docs/02.-Server-options.md
@@ -67,8 +67,8 @@ Note: Most of the options here will only apply on initial model load/startup (ep
 | gpu_split_auto        | Bool (True)                      | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled.                                                                                                                            |
 | autosplit_reserve     | List[Int] ([96])                 | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used.                                                                                                                |
 | gpu_split             | List[Float] ([])                 | Float array of GBs to split a model between GPUs.                                                                                                                                                                              |
-| rope_scale            | Float (1.0)                      | Adjustment for rope scale (or compress_pos_emb)                                                                                                                                                                                |
-| rope_alpha            | Float (None)                     | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.                                                                                                                                    |
+| rope_scale            | Float (1.0)                      | Adjustment for rope scale (or compress_pos_emb)<br><br>Note: If the model has YaRN support, this option will not apply.                                                                                                                                                                                |
+| rope_alpha            | Float (None)                     | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.<br><br>Note: If the model has YaRN support, this option will not apply.                                                                                                                                    |
 | cache_mode            | String ("FP16")                  | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4                                                                                                                                                                     |
 | cache_size            | Int (max_seq_len)                | Size of the K/V cache<br><br>Note: If using CFG, the cache size should be 2 * max_seq_len.                                                                                                                                     |
 | chunk_size            | Int (2048)                       | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed.                                                                                                                    |