Model: Add Tensor Parallel support

Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-16 16:35:19 -04:00 · 2024-08-16 16:35:19 -04:00 · 871c89063d
commit 871c89063d
parent 5002617eac
4 changed files with 109 additions and 53 deletions
--- a/endpoints/core/types/model.py
+++ b/endpoints/core/types/model.py
@ -96,6 +96,9 @@ class ModelLoadRequest(BaseModel):
        default_factory=lambda: get_config_default("cache_size"),
        examples=[4096],
    )
+    tensor_parallel: Optional[bool] = Field(
+        default_factory=lambda: get_config_default("tensor_parallel", False)
+    )
    gpu_split_auto: Optional[bool] = Field(
        default_factory=lambda: get_config_default("gpu_split_auto", True)
    )