Model: Add Tensor Parallel support

Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-16 16:35:19 -04:00 · 2024-08-16 16:35:19 -04:00 · 871c89063d
commit 871c89063d
parent 5002617eac
4 changed files with 109 additions and 53 deletions
--- a/config_sample.yml
+++ b/config_sample.yml
@ -109,6 +109,12 @@ model:
  # Only use this if the model's base sequence length in config.json is incorrect (ex. Mistral 7B)
  #override_base_seq_len:

+  # Load model with tensor parallelism
+  # If a GPU split isn't provided, the TP loader will fallback to autosplit
+  # Enabling ignores the gpu_split_auto and autosplit_reserve values
+  # NOTE: Requires a development build of exllamav2
+  #tensor_parallel: False
+
  # Automatically allocate resources to GPUs (default: True)
  # NOTE: Not parsed for single GPU users
  #gpu_split_auto: True
@ -118,6 +124,7 @@ model:
  #autosplit_reserve: [96]

  # An integer array of GBs of vram to split between GPUs (default: [])
+  # Used with tensor parallelism
  # NOTE: Not parsed for single GPU users
  #gpu_split: [20.6, 24]