Tree: Update documentation and configs

Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
kingbri 2023-11-16 02:30:33 -05:00
parent 2248705c4a
commit 03f45cb0a3
5 changed files with 88 additions and 95 deletions

4
.gitignore vendored
View file

@ -178,3 +178,7 @@ pyrightconfig.json
# User configuration # User configuration
config.yml config.yml
api_tokens.yml api_tokens.yml
# Models folder
models/*
!models/place_your_models_here.txt

134
README.md
View file

@ -1,133 +1,97 @@
# TabbyAPI
# tabbyAPI A FastAPI based application that allows for generating text using an LLM (large language model) using the [exllamav2 backend](https://github.com/turboderp/exllamav2).
tabbyAPI is a FastAPI-based application that provides an API for generating text using a language model. This README provides instructions on how to launch and use the tabbyAPI. ## Disclaimer
This API is still in the alpha phase. There may be bugs and changes down the line. Please be aware that you might need to reinstall dependencies if needed.
## Prerequisites ## Prerequisites
Before you get started, ensure you have the following prerequisites installed on your system: To get started, make sure you have the following installed on your system:
- Python 3.x (with pip) - Python 3.x (preferably 3.11) with pip
- Dependencies listed in `requirements.txt`
## Installation - CUDA 12.1 or 11.8
1. Clone the repository to your local machine: NOTE: For Flash Attention 2 to work on Windows, CUDA 12.1 **must** be installed!
git clone https://github.com/Splice86/tabbyAPI.git ## Installing
1. Clone this repository to your machine: `git clone https://github.com/theroyallab/tabbyAPI`
2. Navigate to the project directory: 2. Navigate to the project directory: `cd tabbyAPI`
cd tabbyAPI 3. Create a virtual environment:
1. `python -m venv venv`
3. Create a virtual environment (optional but recommended): 2. On Windows: `.\venv\Scripts\activate`. On Linux: `source venv/bin/activate`
python -m venv venv 4. Install torch using the instructions found [here](https://pytorch.org/get-started/locally/)
source venv/bin/activate
5. Install an exllamav2 wheel from [here](https://github.com/turboderp/exllamav2/releases):
4. Install project dependencies using pip: 1. Find the version that corresponds with your cuda and python version. For example, a wheel with `cu121` and `cp311` corresponds to CUDA 12.1 and python 3.11
pip install -r requirements.txt 6. Install the other requirements via: `pip install -r requirements.txt`
## Configuration
5. Install exllamav2 to your venv Copy over `config_sample.yml` to `config.yml`. All the fields are commented, so make sure to read the descriptions and comment out or remove fields that you don't need.
git clone https://github.com/turboderp/exllamav2.git ## Launching the Application
cd exllamav2 1. Make sure you are in the project directory and entered into the venv
pip install -r requirements.txt 2. Run the tabbyAPI application: `python main.py`
python setup.py install ## API Documentation
Docs can be accessed once you launch the API at `http://<your-IP>:<your-port>/docs`
If you use the default YAML config, it's accessible at `http://localhost:5000/docs`
## Launch the tabbyAPI Application ## Authentication
To start the tabbyAPI application, follow these steps: TabbyAPI uses an API key and admin key to authenticate a user's request. On first launch of the API, a file called `api_tokens.yml` will be generated with fields for the admin and API keys.
1. Ensure you are in the project directory and the virtual environment is activated (if used). If you feel that the keys have been compromised, delete `api_tokens.yml` and the API will generate new keys for you.
2. Run the tabbyAPI application: API keys and admin keys can be provided via:
- `x-api-key` and `x-admin-key` respectively
python main.py - `Authorization` with the `Bearer ` prefix
3. The tabbyAPI application should now be running. You can access it by opening a web browser and navigating to `http://localhost:8000` (if running locally). DO NOT share your admin key unless you want someone else to load/unload a model from your system!
## Usage #### Authentication Requrirements
The tabbyAPI application provides the following endpoint: All routes require an API key except for the following which require an **admin** key
- '/v1/model' Retrieves information about the currently loaded model. - `/v1/model/load`
- '/v1/model/load' Loads a new model based on provided data and model configuration.
- '/v1/model/unload' Unloads the currently loaded model from the system.
- '/v1/completions' Use this endpoint to generate text based on the provided input data.
### Example Request (using `curl`) - `/v1/model/unload`
curl -X POST \ ## Contributing
-H "Content-Type: application/json" \
-H "Authorization: Bearer 2261702e8a220c6c4671a264cd1236ce" \
-d '{
"model": "airoboros-mistral2.2-7b-exl2",
"prompt": ["A tabby","is"],
"stream": true,
"top_p": 0.73,
"stop": "[",
"max_tokens": 360,
"temperature": 0.8,
"mirostat_mode": 2,
"mirostat_tau": 5,
"mirostat_eta": 0.1
}' \
http://127.0.0.1:8012/v1/completions
If you have issues with the project:
- Describe the issues in detail
### Parameter Guide - If you have a feature request, please indicate it as such.
*note* This stuff still needs to be expanded and updated If you have a Pull Request
{ - Describe the pull request in detail, what, and why you are changing something
"model": "airoboros-mistral2.2-7b-exl2",
"prompt": ["A tabby","is"],
"stream": true,
"top_p": 0.73,
"stop": "[",
"max_tokens": 360,
"temperature": 0.8,
"mirostat_mode": 2,
"mirostat_tau": 5,
"mirostat_eta": 0.1
}
Model: "airoboros-mistral2.2-7b-exl2" ## Developers and Permissions
This specifies the specific language model being used. It's essential for the API to know which model to employ for generating responses.
Prompt: ["Hello there! My name is", "Brian", "and I am", "an AI"] Creators/Developers:
The prompt *QUESTION* why is it a list of strings instead of a single string?
Stream: true
Whether the response should be streamed back or not.
Top_p: 0.73 - kingbri
cumulative probability threshold
Stop: "[" - Splice86
The stop parameter defines a string that stops the generation.
Max_tokens: 360 - Turboderp
This parameter determines the maximum number of tokens.
Temperature: 0.8
Temperature controls the randomness of the generated text.
Mirostat_mode: 2
?
Mirostat_tau: 5
?
Mirostat_eta: 0.1
?

View file

@ -1,14 +1,39 @@
# Network options # Options for networking
network: network:
host: "0.0.0.0" # The IP to host on (default: 127.0.0.1).
port: 8012 # Use 0.0.0.0 to expose on all network adapters
# Only used if you want to initially load a model host: "127.0.0.1"
# The port to host on (default: 5000)
port: 5000
# Options for model overrides and loading
model: model:
model_dir: "D:/models" # Overrides the directory to look for models (default: "models")
model_name: "airoboros-mistral2.2-7b-exl2" # Make sure to use forward slashes, even on Windows (or escape your backslashes).
# model_dir: "your model directory path"
# An initial model to load. Make sure the model is located in the model directory!
# A model can be loaded later via the API. This does not have to be specified
# model_name: "A model name"
# The below parameters apply only if model_name is set
# Maximum model context length (default: 4096)
max_seq_len: 4096 max_seq_len: 4096
gpu_split: "auto"
# Automatically allocate resources to GPUs (default: True)
gpu_split_auto: True
# An integer array of GBs of vram to split between GPUs (default: [])
# gpu_split: [20.6, 24]
# Rope scaling parameters (default: 1.0)
rope_scale: 1.0 rope_scale: 1.0
rope_alpha: 1.0 rope_alpha: 1.0
# Disable Flash-attention 2. Recommended for GPUs lower than Nvidia's 3000 series. (default: False)
no_flash_attention: False no_flash_attention: False
# Enable low vram optimizations in exllamav2 (default: False)
low_mem: False low_mem: False

View file

Binary file not shown.