Model: Adjust draft_gpu_split and add to config

The previous code overrode the existing gpu split and device idx values. This now sets an independent draft_gpu_split value and adjusts the gpu_devices check only if the draft_gpu_split array is larger than the gpu_split array. Draft gpu split is not Tensor Parallel, and defaults to gpu_split_auto if a split is not provided. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-08 16:09:46 -05:00 · 2025-02-08 16:09:46 -05:00 · beb6d8faa5
commit beb6d8faa5
parent bd8256d168
3 changed files with 22 additions and 7 deletions
--- a/config_sample.yml
+++ b/config_sample.yml
@ -20,7 +20,7 @@ network:
  # Turn on this option if you are ONLY connecting from localhost.
  disable_auth: false

-  # Disable fetching external content in response to requests, such as images from URLs.
+  # Disable fetching external content in response to requests,such as images from URLs.
  disable_fetch_requests: false

  # Send tracebacks over the API (default: False).
@ -166,6 +166,10 @@ draft_model:
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  draft_cache_mode: FP16

+  # An integer array of GBs of VRAM to split between GPUs (default: []).
+  # If this isn't filled in, the draft model is autosplit.
+  draft_gpu_split: []
+
 # Options for Loras
 lora:
  # Directory to look for LoRAs (default: loras).