Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.
Same applies for chunk size.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Torch - 2.6.0
ExllamaV2 - 0.2.8
Flash-attn - 2.7.4.post1
Cuda wheels are now 12.4 instead of 12.1, feature names need to be
migrated over.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
platform.system() was not called in some places, breaking the
ternary on Windows.
Pip's --upgrade flag does not actually update dependencies to their
latest versions. That's what the --upgrade-strategy eager flag is for.
Tell the user where their start preferences are coming from.
Signed-off-by: kingbri <bdashore3@proton.me>
If flash attention is already turned off by exllamaV2 itself, don't
try creating a paged generator. Also condense all the redundant
logic into one if statement.
Also check arch_compat_overrides to see if flash attention should
be disabled for a model arch (ex. Gemma 2)
Signed-off-by: kingbri <bdashore3@proton.me>
Move the large import errors into the check functions themselves.
This helps reduce the difficulty in interpreting where errors are
coming from.
Signed-off-by: kingbri <bdashore3@proton.me>
* Model: Clean up paged attention checks
* Model: Move cache_size checks after paged attn checks
Cache size is only relevant in paged mode
* Model: Fix no_flash_attention
* Model: Remove no_flash_attention
Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.
Loguru is a flexible logger that allows for easier hooking and imports
into Rich with no problems. Also makes progress bars stick to the
bottom of the terminal window.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the ability to use an unsafe config flag if needed and migrate
the exl2 check to a different file within the exl2 backend code.
Signed-off-by: kingbri <bdashore3@proton.me>