Config + Docs: Clarify YaRN rope scaling changes

In ExllamaV2, if a model has YaRN support, linear RoPE options are not applied. Users can set max_seq_len and exl2 will take care of the rest. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-03-19 11:47:49 -04:00 · 2025-03-19 11:47:49 -04:00 · 698d8339cb
commit 698d8339cb
parent a20abe2d33
2 changed files with 5 additions and 2 deletions
--- a/docs/02.-Server-options.md
+++ b/docs/02.-Server-options.md
@ -67,8 +67,8 @@ Note: Most of the options here will only apply on initial model load/startup (ep
 | gpu_split_auto        | Bool (True)                      | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled.                                                                                                                            |
 | autosplit_reserve     | List[Int] ([96])                 | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used.                                                                                                                |
 | gpu_split             | List[Float] ([])                 | Float array of GBs to split a model between GPUs.                                                                                                                                                                              |
-| rope_scale            | Float (1.0)                      | Adjustment for rope scale (or compress_pos_emb)                                                                                                                                                                                |
-| rope_alpha            | Float (None)                     | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.                                                                                                                                    |
+| rope_scale            | Float (1.0)                      | Adjustment for rope scale (or compress_pos_emb)<br><br>Note: If the model has YaRN support, this option will not apply.                                                                                                                                                                                |
+| rope_alpha            | Float (None)                     | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.<br><br>Note: If the model has YaRN support, this option will not apply.                                                                                                                                    |
 | cache_mode            | String ("FP16")                  | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4                                                                                                                                                                     |
 | cache_size            | Int (max_seq_len)                | Size of the K/V cache<br><br>Note: If using CFG, the cache size should be 2 * max_seq_len.                                                                                                                                     |
 | chunk_size            | Int (2048)                       | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed.                                                                                                                    |