Model: Change FA2 and paged attention checks

The dynamic generator requires Flash attention 2.5.7 or higher to
be installed. This is only supported on Nvidia's 30 series and higher.

If a card is AMD or lower than the 30 series, switch to compatability
mode which functions the same way as the older generator, except
without parallel batching and any features that depend on it, such as
CFG.

Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
kingbri 2024-05-24 22:33:47 -04:00 committed by Brian Dashore
parent c2d3675408
commit 408c66a1f2
3 changed files with 31 additions and 35 deletions

View file

@ -100,9 +100,6 @@ model:
# Leave blank to automatically calculate alpha
#rope_alpha: 1.0
# Disable Flash-attention 2. Set to True for GPUs lower than Nvidia's 3000 series. (default: False)
#no_flash_attention: False
# Enable different cache modes for VRAM savings (slight performance hit).
# Possible values FP16, FP8, Q4. (default: FP16)
#cache_mode: FP16
@ -111,6 +108,12 @@ model:
# NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
#chunk_size: 2048
# Set the maximum amount of prompts to process at one time (batch)
# This will be automatically adjusted depending on the cache size.
# A max batch size of 1 processes prompts one at a time.
# NOTE: Only available for Nvidia ampere (30 series) and above GPUs
#max_batch_size: 20
# Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
# If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
# of the template you want to use.
@ -122,10 +125,6 @@ model:
# NOTE: For MoE models (ex. Mixtral) only!
#num_experts_per_token:
# Enables CFG support (default: False)
# WARNING: This flag disables Flash Attention! (a stopgap fix until it's fixed in upstream)
#use_cfg: False
# Enables fasttensors to possibly increase model loading speeds (default: False)
#fasttensors: true