Config + Docs: Clarify YaRN rope scaling changes
In ExllamaV2, if a model has YaRN support, linear RoPE options are not applied. Users can set max_seq_len and exl2 will take care of the rest. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This commit is contained in:
parent
a20abe2d33
commit
698d8339cb
2 changed files with 5 additions and 2 deletions
|
|
@ -67,8 +67,8 @@ Note: Most of the options here will only apply on initial model load/startup (ep
|
|||
| gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
|
||||
| autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
|
||||
| gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. |
|
||||
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) |
|
||||
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
|
||||
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb)<br><br>Note: If the model has YaRN support, this option will not apply. |
|
||||
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.<br><br>Note: If the model has YaRN support, this option will not apply. |
|
||||
| cache_mode | String ("FP16") | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4 |
|
||||
| cache_size | Int (max_seq_len) | Size of the K/V cache<br><br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
|
||||
| chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue