Revision to paged attention checks (#133)

* Model: Clean up paged attention checks * Model: Move cache_size checks after paged attn checks Cache size is only relevant in paged mode * Model: Fix no_flash_attention * Model: Remove no_flash_attention Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
2024-06-09 08:28:11 -07:00 · 2024-06-09 08:28:11 -07:00 · 156b74f3f0
commit 156b74f3f0
parent 55d979b7a5
3 changed files with 99 additions and 94 deletions
--- a/endpoints/OAI/types/model.py
+++ b/endpoints/OAI/types/model.py
@ -94,7 +94,6 @@ class ModelLoadRequest(BaseModel):
        default=None,
        examples=[1.0],
    )
-    no_flash_attention: Optional[bool] = False
    # low_mem: Optional[bool] = False
    cache_mode: Optional[str] = "FP16"
    chunk_size: Optional[int] = 2048