No description

Find a file

city-unit e70729b0c0 Update Docker Squash commit that merges #43, #44, and #45 Create .dockerignore Make compose marginally better Un-scuffed the Dockerfile		2023-12-28 18:26:04 -05:00
.github/workflows	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
loras	Implement lora support (#24 )	2023-12-08 23:38:08 -05:00
models	Tree: Update documentation and configs	2023-11-16 02:30:33 -05:00
OAI	Model: Repetition penalty range -> penalty range	2023-12-28 18:16:10 -05:00
templates	Templates: Update folder	2023-12-18 23:53:47 -05:00
tests	Tests: Remove logger class	2023-12-25 15:20:39 -05:00
.dockerignore	Update Docker	2023-12-28 18:26:04 -05:00
.gitignore	Update gitignore	2023-12-18 23:53:47 -05:00
.ruff.toml	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
auth.py	feat: logging (#39 )	2023-12-23 04:33:31 +00:00
config.py	Config: Resolve filepath	2023-12-23 23:57:33 -05:00
config_sample.yml	Config: Default None -> Empty in comments	2023-12-28 00:32:29 -05:00
docker-compose.yml	Update Docker	2023-12-28 18:26:04 -05:00
Dockerfile	Update Docker	2023-12-28 18:26:04 -05:00
formatting.bat	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
formatting.sh	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
gen_logging.py	Logging: Clarify preferences	2023-12-23 21:08:10 -05:00
generators.py	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
LICENSE	Create LICENSE	2023-11-16 17:43:23 -05:00
logger.py	feat: logging (#39 )	2023-12-23 04:33:31 +00:00
main.py	API: Fix num_experts_per_token reporting	2023-12-28 00:31:14 -05:00
model.py	Model: Repetition penalty range -> penalty range	2023-12-28 18:16:10 -05:00
README.md	Templates: Switch to Jinja2	2023-12-18 23:53:47 -05:00
requirements-amd.txt	feat: logging (#39 )	2023-12-23 04:33:31 +00:00
requirements-cu118.txt	Requirements: Update to Flash Attention 2.4.1	2023-12-25 14:40:08 -05:00
requirements-dev.txt	feat: workflows for formatting/linting (#35 )	2023-12-22 16:20:35 +00:00
requirements-nowheel.txt	feat: logging (#39 )	2023-12-23 04:33:31 +00:00
requirements.txt	Requirements: Update to Flash Attention 2.4.1	2023-12-25 14:40:08 -05:00
start.bat	Tree: Format and cleanup start	2023-12-27 01:17:31 -05:00
start.py	Tree: Format	2023-12-28 00:31:59 -05:00
start.sh	Start: Add shell script	2023-12-27 23:53:14 -05:00
TabbyAPI_Colab_Example.ipynb	Colab: Expose new config arguments	2023-12-22 01:53:13 -08:00
templating.py	Templates: Add error handling for template errors	2023-12-22 11:59:47 -05:00
utils.py	feat: logging (#39 )	2023-12-23 04:33:31 +00:00

README.md

TabbyAPI

Note

Need help? Join the Discord Server and get the Tabby role. Please be nice when asking questions.

A FastAPI based application that allows for generating text using an LLM (large language model) using the https://github.com/turboderp/exllamav2.

Disclaimer

This API is still in the alpha phase. There may be bugs and changes down the line. Please be aware that you might need to reinstall dependencies if needed.

Help Wanted

Please check the issues page for issues that contributors can help on. We appreciate all contributions. Please read the contributions section for more details about issues and pull requests.

If you want to add samplers, add them in the exllamav2 library and then link them to tabbyAPI.

Prerequisites

To get started, make sure you have the following installed on your system:

Python 3.x (preferably 3.11) with pip
CUDA 12.x (you can also use CUDA 11.8 or ROCm 5.6, but there will be more work required to install dependencies such as Flash Attention 2)

NOTE: For Flash Attention 2 to work on Windows, CUDA 12.x must be installed!

Installing

Clone this repository to your machine: git clone https://github.com/theroyallab/tabbyAPI
Navigate to the project directory: cd tabbyAPI
Create a python environment:
1. Through venv (recommended)
  1. python -m venv venv
  2. On Windows (Using powershell or Windows terminal): .\venv\Scripts\activate. On Linux: source venv/bin/activate
2. Through conda
  1. conda create -n tabbyAPI python=3.11
  2. conda activate tabbyAPI
Install the requirements file based on your system:
1. Cuda 12.x: pip install -r requirements.txt
2. Cuda 11.8: pip install -r requirements-cu118.txt
3. ROCm 5.6: pip install -r requirements-amd.txt

Configuration

A config.yml file is required for overriding project defaults. If you are okay with the defaults, you don't need a config file!

If you do want a config file, copy over config_sample.yml to config.yml. All the fields are commented, so make sure to read the descriptions and comment out or remove fields that you don't need.

Launching the Application

Make sure you are in the project directory and entered into the venv
Run the tabbyAPI application: python main.py

Updating

To update tabbyAPI, just run pip install --upgrade -r requirements.txt using the requirements.txt for your configuration (ex. CUDA 11.8 or ROCm 5.6)

Update Exllamav2

Warning

These instructions are meant for advanced users.

If the version of exllamav2 doesn't meet your specifications, you can install the dependency from various sources.

NOTE:

TabbyAPI will print a warning if a sampler isn't found due to the exllamav2 version being too low.
Any upgrades using a requirements file will result in overwriting your installed wheel. To fix this, change requirements.txt locally, create an issue or PR, or install your version of exllamav2 after upgrades.

Here are ways to install exllamav2:

From a wheel/release (Recommended)
1. Find the version that corresponds with your cuda and python version. For example, a wheel with cu121 and cp311 corresponds to CUDA 12.1 and python 3.11
From pip: pip install exllamav2
1. This is a JIT compiled extension, which means that the initial launch of tabbyAPI will take some time. The build may also not work due to improper environment configuration.
From source

API Documentation

Docs can be accessed once you launch the API at http://<your-IP>:<your-port>/docs

If you use the default YAML config, it's accessible at http://localhost:5000/docs

Authentication

TabbyAPI uses an API key and admin key to authenticate a user's request. On first launch of the API, a file called api_tokens.yml will be generated with fields for the admin and API keys.

If you feel that the keys have been compromised, delete api_tokens.yml and the API will generate new keys for you.

API keys and admin keys can be provided via the following request headers:

x-api-key and x-admin-key respectively
Authorization with the Bearer prefix

DO NOT share your admin key unless you want someone else to load/unload a model from your system!

Authentication Requrirements

All routes require an API key except for the following which require an admin key

/v1/model/load
/v1/model/unload

Chat Completions

/v1/chat/completions now uses Jinja2 for templating. Please read Huggingface's documentation for more information of how chat templates work.

Also make sure to set the template name in config.yml to the template's filename.

Common Issues

AMD cards will error out with flash attention installed, even if the config option is set to False. Run pip uninstall flash_attn to remove the wheel from your system.
- See #5
Exllamav2 may error with the following exception: ImportError: DLL load failed while importing exllamav2_ext: The specified module could not be found.
- First, make sure to check if the wheel is equivalent to your python version and CUDA version. Also make sure you're in a venv or conda environment.
- If those prerequisites are correct, the torch cache may need to be cleared. This is due to a mismatching exllamav2_ext.
  - In Windows: Find the cache at C:\Users\<User>\AppData\Local\torch_extensions\torch_extensions\Cache where <User> is your Windows username
  - In Linux: Find the cache at ~/.cache/torch_extensions
  - look for any folder named exllamav2_ext in the python subdirectories and delete them.
  - Restart TabbyAPI and launching should work again.

Supported Model Types

TabbyAPI uses Exllamav2 as a powerful and fast backend for model inference, loading, etc. Therefore, the following types of models are supported:

Exl2 (Highly recommended)
GPTQ
FP16 (using Exllamav2's loader)

Alternative Loaders/Backends

If you want to use a different model type than the ones listed above, here are some alternative backends with their own APIs:

Contributing

If you have issues with the project:

Describe the issues in detail
If you have a feature request, please indicate it as such.

If you have a Pull Request

Describe the pull request in detail, what, and why you are changing something

Developers and Permissions

Creators/Developers: