diff --git a/config_sample.yml b/config_sample.yml index 745433b..a13e64e 100644 --- a/config_sample.yml +++ b/config_sample.yml @@ -95,6 +95,9 @@ model: # Used with tensor parallelism. gpu_split: [] + # NOTE: If a model has YaRN rope scaling, it will automatically be enabled by ExLlama. + # rope_scale and rope_alpha settings won't apply in this case. + # Rope scale (default: 1.0). # Same as compress_pos_emb. # Use if the model was trained on long context with rope. diff --git a/docs/02.-Server-options.md b/docs/02.-Server-options.md index 6a4f1d9..b319f76 100644 --- a/docs/02.-Server-options.md +++ b/docs/02.-Server-options.md @@ -67,8 +67,8 @@ Note: Most of the options here will only apply on initial model load/startup (ep | gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. | | autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.

Represented as an array of MB per GPU used. | | gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. | -| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) | -| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. | +| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb)

Note: If the model has YaRN support, this option will not apply. | +| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len.

Note: If the model has YaRN support, this option will not apply. | | cache_mode | String ("FP16") | Cache mode for the model.

Options: FP16, Q8, Q6, Q4 | | cache_size | Int (max_seq_len) | Size of the K/V cache

Note: If using CFG, the cache size should be 2 * max_seq_len. | | chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |