Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me> |
||
|---|---|---|
| .. | ||
| core | ||
| Kobold | ||
| OAI | ||
| server.py | ||
| utils.py | ||