Model: Add Tensor Parallel support
Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
parent
5002617eac
commit
871c89063d
4 changed files with 109 additions and 53 deletions
|
|
@ -109,6 +109,12 @@ model:
|
|||
# Only use this if the model's base sequence length in config.json is incorrect (ex. Mistral 7B)
|
||||
#override_base_seq_len:
|
||||
|
||||
# Load model with tensor parallelism
|
||||
# If a GPU split isn't provided, the TP loader will fallback to autosplit
|
||||
# Enabling ignores the gpu_split_auto and autosplit_reserve values
|
||||
# NOTE: Requires a development build of exllamav2
|
||||
#tensor_parallel: False
|
||||
|
||||
# Automatically allocate resources to GPUs (default: True)
|
||||
# NOTE: Not parsed for single GPU users
|
||||
#gpu_split_auto: True
|
||||
|
|
@ -118,6 +124,7 @@ model:
|
|||
#autosplit_reserve: [96]
|
||||
|
||||
# An integer array of GBs of vram to split between GPUs (default: [])
|
||||
# Used with tensor parallelism
|
||||
# NOTE: Not parsed for single GPU users
|
||||
#gpu_split: [20.6, 24]
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue