Tree: Migrate docs into repository

This will auto-publish to the Github wiki via an action. Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-02-17 23:39:35 -05:00 · 2025-02-17 23:39:35 -05:00 · 5614b342a7
commit 5614b342a7
parent 9f649647f0
11 changed files with 732 additions and 0 deletions
--- a/docs/01.-Getting-Started.md
+++ b/docs/01.-Getting-Started.md
@ -0,0 +1,165 @@
 ## Prerequisites
 To get started, make sure you have the following installed on your system:
 - [Python 3.x](https://www.python.org/downloads/release/python-3117/) (preferably 3.11) with pip
    - Do NOT install python from the Microsoft store! This will cause issues with pip.
    - Alternatively, you can use miniconda if it's present on your system.
 > [!NOTE]
 > Prefer a video guide? Watch the step-by-step tutorial on [YouTube](https://www.youtube.com/watch?v=03jYz0ijbUU)
 > [!NOTE]
 > You can install [miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) on your system which will give you the benefit of having both python and conda!
 > [!WARNING]
 > CUDA and ROCm aren't prerequisites because torch can install them for you. However, if this doesn't work (ex. DLL load failed), install the CUDA toolkit or ROCm on your system.
 > 
 > - [CUDA 12.x](https://developer.nvidia.com/cuda-downloads)
 >   
 > - [ROCm 6.1](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.0/how-to/prerequisites.html)
 >   
 > [!WARNING]
 > Sometimes there may be an error with Windows that VS build tools needs to be installed. This means that there's a package that isn't supported for your python version.
 > You can install [VS build tools 17.8](https://aka.ms/vs/17/release.ltsc.17.8/vs_buildtools.exe) and build the wheel locally. In addition, open an issue stating that a dependency is building a wheel.
 ## Installing
 ### For Beginners
 1. Clone this repository to your machine: `git clone https://github.com/theroyallab/tabbyAPI`
 2. Navigate to the project directory: `cd tabbyAPI`
 3. Run the appropriate start script (`start.bat` for Windows and `start.sh` for linux).
    1. Follow the on-screen instructions and select the correct GPU library.
    2. Assuming that the prerequisites are installed and can be located, a virtual environment will be created for you and dependencies will be installed.
 4. The API should start with no model loaded
 ### For Advanced Users
 > [!NOTE]
 > TabbyAPI has recently switched to use pyproject.toml. These instructions may look different than before.
 1. Follow steps 1-2 in the [For Beginners](#for-beginners) section
 2. Create a python environment through venv:
    1. `python -m venv venv`
    2. Activate the venv
        1. On Windows: `.\venv\Scripts\activate`
        2. On Linux: `source venv/bin/activate`
 3. Install the pyproject features based on your system:
    1. Cuda 12.x: `pip install -U .[cu121]`
    2. ROCm 5.6: `pip install -U .[amd]`
 4. Start the API by either
    1. Run `start.bat/sh`. The script will check if you're in a conda environment and skip venv checks.
    2. Run `python main.py` to start the API. This won't automatically upgrade your dependencies.
 ## Configuration
 Loading solely the API may not be your optimal usecase. Therefore, a config.yml exists to tune initial launch parameters and other configuration options.
 A config.yml file is required for overriding project defaults. **If you are okay with the defaults, you don't need a config file!**
 If you do want a config file, copy over `config_sample.yml` to `config.yml`. All the fields are commented, so make sure to read the descriptions and comment out or remove fields that you don't need.
 In addition, if you want to manually set the API keys, copy over `api_keys_sample.yml` to `api_keys.yml` and fill in the fields. However, doing this is less secure and autogenerated keys should be used instead.
 You can also access the configuration parameters under [2. Configuration](https://github.com/theroyallab/tabbyAPI/wiki/2.-Configuration) in this wiki!
 ## Where next?
 1. Take a look at the [usage docs](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage)
 2. Get started with [community projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects): Find loaders, UIs, and more created by the wider AI community. Any OAI compatible client is also supported.
 ## Updating
 There are a couple ways to update TabbyAPI:
 1. **Update scripts** - Inside the update_scripts folder, you can run the following scripts:
    1. `update_deps`: Updates dependencies to their latest versions.
    2. `update_deps_and_pull`: Updates dependencies and pulls the latest commit of the Github repository.
 These scripts exit after running their respective tasks. To start TabbyAPI, run `start.bat` or `start.sh`.
 2. **Manual** - Install the pyproject features and update dependencies depending on your GPU:
    1. `pip install -U .[cu121]` = CUDA 12.x
    2. `pip install -U .[amd]` = ROCm 6.0
 If you don't want to update dependencies that come from wheels (torch, exllamav2, and flash attention 2), use `pip install .` or pass the `--nowheel` flag when invoking the start scripts.
 ### Update Exllamav2
 > [!WARNING]
 > These instructions are meant for advanced users.
 > [!IMPORTANT]
 > If you're installing a custom Exllamav2 wheel, make sure to use `pip install .` when updating! Otherwise, each update will overwrite your custom exllamav2 version.
 NOTE:
 - TabbyAPI enforces the latest Exllamav2 version for compatibility purposes.
 - Any upgrades using a pyproject gpu lib feature will result in overwriting your installed wheel.
    - To fix this, change the feature in `pyproject.toml` locally, create an issue or PR, or install your version of exllamav2 after upgrades.
 Here are ways to install exllamav2:
 1. From a [wheel/release](https://github.com/turboderp/exllamav2#method-2-install-from-release-with-prebuilt-extension) (Recommended)
    1. Find the version that corresponds with your cuda and python version. For example, a wheel with `cu121` and `cp311` corresponds to CUDA 12.1 and python 3.11
 2. From [pip](https://github.com/turboderp/exllamav2#method-3-install-from-pypi): `pip install exllamav2`
    1. This is a JIT compiled extension, which means that the initial launch of tabbyAPI will take some time. The build may also not work due to improper environment configuration.
 3. From [source](https://github.com/turboderp/exllamav2#method-1-install-from-source)
 ## Other installation methods
 These are short-form instructions for other methods that users can use to install TabbyAPI.
 > [!WARNING]
 > Using methods other than venv may not play nice with startup scripts. Using these methods indicates that you're an advanced user and know what you're doing.
 ### Conda
 1. Install [Miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) with python 3.11 as your base python
 2. Create a new conda environment `conda create -n tabbyAPI python=3.11`
 3. Activate the conda environment `conda activate tabbyAPI`
 4. Install optional dependencies if they aren't present
    1. CUDA via
        1. CUDA 12 - `conda install -c "nvidia/label/cuda-12.2.2" cuda`
    2. Git via `conda install -k git`
 5. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
 6. Continue installation steps from:
    1. [For Beginners](#for-beginners) - Step 3. The start scripts detect if you're in a conda environment and skips the venv check.
    2. [For Advanced Users](#For-advanced-users) - Step 3
 ### Docker
 1. Install Docker and docker compose from the [docs](https://docs.docker.com/compose/install/
 2. Install the Nvidia container compatibility layer
    1. For Linux: [Nvidia container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
    2. For Windows: [Cuda Toolkit on WSL](https://docs.nvidia.com/cuda/wsl-user-guide/index.html)
 3. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
 4. Enter the tabbyAPI directory by `cd tabbyAPI`.
 	1. Optional: Set up a config.yml or api_tokens.yml ([configuration](#configuration))
 5. Update the volume mount section in the `docker/docker-compose.yml` file
 ```yml
 volumes:
   # - /path/to/models:/app/models                       # Change me
   # - /path/to/config.yml:/app/config.yml               # Change me
   # - /path/to/api_tokens.yml:/app/api_tokens.yml       # Change me
 ```
 6. Optional: If you'd like to build the dockerfile from source, follow the instructions below in `docker/docker-compose.yml`:
 ```yml
     # Uncomment this to build a docker image from source
     #build:
     #  context: ..
     #  dockerfile: ./docker/Dockerfile
     # Comment this to build a docker image from source
     image: ghcr.io/theroyallab/tabbyapi:latest
 ```
 7. Run `docker compose -f docker/docker-compose.yml up` to build the dockerfile and start the server.
--- a/docs/02.-Server-options.md
+++ b/docs/02.-Server-options.md
@ -0,0 +1,87 @@
 ## Server Options
 TabbyAPI primarily uses a config.yml file to adjust various options. This is the preferred way and has the ability to adjust all options of TabbyAPI.
 CLI arguments are also included, but those serve to *override* the options set in config.yml. Therefore, they act a bit differently compared to other programs, especially with booleans.
 Example: A user sets `gpu_split_auto` to True. The CLI arg will then be `--gpu_split-auto False` to override that previous config.yml setting.
 In addition, some config.yml options are too complex to represent as command args, so those are not included with the argparser.
 All of these options have descriptive comments above them. You should not need to reference this documentation page unless absolutely necessary.
 ### Networking Options
 | Config Option   | Type (Default)         | Description                                                  |
 |-----------------|------------------------|--------------------------------------------------------------|
 | host            | String (127.0.0.1)     | Set the IP address used for hosting TabbyAPI                 |
 | port            | Int (5000)             | Set the TCP Port use for TabbyAPI                            |
 | disable_auth    | Bool (False)           | Disables API authentication                                  |
 | send_tracebacks | Bool (False)           | Send server tracebacks to client.<br><br>Note: It's not recommended to enable this if sharing the instance with others. |
 | api_servers     | List[String] (["OAI"]) | API servers to enable. Possible values `"OAI", "Kobold"`     |
 ### Logging Options
 Note: With CLI args, all logging parameters are prefixed by `log-`. For example, `prompt` will be `--log-prompt true/false`.
 | Config Option     | Type (Default) | Description                                            |
 |-------------------|----------------|--------------------------------------------------------|
 | prompt            | Bool (False)   | Logs prompts to the console                            |
 | generation_params | Bool (False)   | Logs request generation options to the console         |
 | requests          | Bool (False)   | Logs a request's URL, Body, and Headers to the console |
 ### Sampling Options
 Note: This block is for sampling overrides, not samplers themselves.
 | Config Option   | Type (Default) | Description                                                  |
 |-----------------|----------------|--------------------------------------------------------------|
 | override_preset | String (None)  | Startup the given sampler override preset in the sampler_overrides folder |
 ### Developer Options
 Note: These are experimental flags that may be removed at any point.
 | Config Option             | Type (Default) | Description                                                  |
 |---------------------------|----------------|--------------------------------------------------------------|
 | unsafe_launch             | Bool (False)   | Skips dependency checks on startup. Only recommended for debugging. |
 | disable_request_streaming | Bool (False)   | Forcefully disables streaming requests                       |
 | cuda_malloc_backend       | Bool (False)   | Uses pytorch's CUDA malloc backend to load models. Helps save VRAM.<br><br>Safe to enable. |
 | uvloop                    | Bool (False)   | Use a faster asyncio event loop. Can increase performance.<br><br>Safe to enable. |
 | realtime_process_priority | Bool (False)   | Set the process priority to "Realtime". Administrator/sudo access required, otherwise the priority is set to the highest it can go in userland. |
 ### Model Options
 Note: Most of the options here will only apply on initial model load/startup (ephemeral). They will not persist unless you add the option name to `use_as_default`.
 | Config Option         | Type (Default)    | Description                                                  |
 |-----------------------|-------------------|--------------------------------------------------------------|
 | model_dir             | String ("models") | Directory to look for models.<br><br>Note: Persisted across subsequent load requests |
 | use_dummy_models      | Bool (False)      | Send a dummy OAI model card when calling the `/v1/models` endpoint. Used for clients which enforce specific OAI models.<br><br>Note: Persisted across subsequent load requests |
 | model_name            | String (None)     | Folder name of a model to load. The below parameters will not apply unless this is filled out. |
 | use_as_default        | List[String] ([]) | Keys to use by default when loading models. For example, putting `cache_mode` in this array will make every model load with that value unless specified by the API request.<br><br>Note: Also applies to the `draft` sub-block |
 | max_seq_len           | Float (None)      | Maximum sequence length of the model. Uses the value from config.json if not specified here. |
 | override_base_seq_len | Float (None)      | Overrides the base sequence length of a model. You probably don't want to use this. max_seq_len is better.<br><br>Note: This is only required for automatic RoPE alpha calculation AND if the model has an incorrect base sequence length (ex. Mistral 7b) |
 | tensor_parallel       | Bool (False)      | Use tensor parallelism to load the model. This ignores the value of gpu_split_auto. |
 | gpu_split_auto        | Bool (True)       | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
 | autosplit_reserve     | List[Int] ([96])  | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
 | gpu_split             | List[Float] ([])  | Float array of GBs to split a model between GPUs.            |
 | rope_scale            | Float (1.0)       | Adjustment for rope scale (or compress_pos_emb)              |
 | rope_alpha            | Float (None)      | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
 | cache_mode            | String ("FP16")   | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4   |
 | cache_size            | Int (max_seq_len) | <br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
 | chunk_size            | Int (2048)        | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
 | max_batch_size        | Int (None)        | The absolute maximum amount of prompts to process at one time. This value is automatically adjusted based on cache size. |
 | prompt_template       | String (None)     | Name of a jinja2 chat template to apply for this model. Must be located in the `templates` directory. |
 | num_experts_per_token | Int (None)        | Number of experts to use per-token for MoE models. Pulled from the config.json if not specified. |
 | fasttensors           | Bool (False)      | Possibly increases model loading speeds.                     |
 ### Draft Model Options
 Note: Sub-block of Model Options. Same rules apply.
 | Config Option    | Type (Default)    | Description                                                  |
 |------------------|-------------------|--------------------------------------------------------------|
 | draft_model_dir  | String ("models") | Directory to look for draft models.<br><br>Note: Persisted across subsequent load requests |
 | draft_model_name | String (None)     | String: Folder name of a draft model to load.                |
 | draft_rope_scale | Float (1.0)       | String: RoPE scale value for the draft model.                |
 | draft_rope_alpha | Float (1.0)       | RoPE alpha value for the draft model. Leave blank for auto-calculation. |
 | draft_cache_mode | String ("FP16")   | Cache mode for the draft model.<br><br>Options: FP16, Q8, Q6, Q4 |
 ### Lora Options
 Note: Sub-block of Mode Options. Same rules apply.
 | Config Option | Type (Default)   | Description                                                  |
 |---------------|------------------|--------------------------------------------------------------|
 | lora_dir      | String ("loras") | Directory to look for loras.<br><br>Note: Persisted across subsequent load requests |
 | loras         | List[loras] ([]) | List of lora objects to apply to the model. Each object contains a name and scaling. |
 | name          | String (None)    | Folder name of a lora to load.<br><br>Note: An element of the `loras` key |
 | scaling       | Float (1.0)      | "Weight" to apply the lora on the parent model. For example, applying a lora with 0.9 scaling will lower the amount of application on the parent model.<br><br>Note: An element of the `loras` key |
 ### Embeddings Options
 Note: Most of the options here will only apply on initial embedding model load/startup (ephemeral).
 | Config Option        | Type (Default)    | Description                                                  |
 |----------------------|-------------------|--------------------------------------------------------------|
 | embedding_model_dir  | String ("models") | Directory to look for embedding models.<br><br>Note: Persisted across subsequent load requests |
 | embeddings_device    | String ("cpu")    | Device to load an embedding model on.<br><br>Options: cpu, cuda, auto<br><br>Note: Persisted across subsequent load requests |
 | embedding_model_name | String (None)     | Folder name of an embedding model to load using infinity-emb. |
--- a/docs/03.-Usage.md
+++ b/docs/03.-Usage.md
@ -0,0 +1,46 @@
 ## Usage
 TabbyAPI's main use-case is to be an API server for running ExllamaV2 models.
 ### API Server
 Currently TabbyAPI supports clients that use the [OpenAI](https://platform.openai.com/docs/api-reference) standard and [KoboldAI](https://lite.koboldai.net/koboldcpp_api)'s API.
 In addition, there are expanded parameters to generation endpoints along with administrative endpoints for loading, unloading, loras, sampling overrides, etc.
 > [!NOTE]
 > If you are a developer and want to add full TabbyAPI support to your app, it's recommended to use the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI).
 Below is an example CURL request using the OpenAI completions endpoint:
 ```
 curl http://localhost:5000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model": "meta-llama/Meta-Llama-3-8B",
  "prompt": "Once upon a time,",
  "max_tokens": 400,
  "stream": false,
  "min_p": 0.05,
  "repetition_penalty": 1.05
 }'
 ```
 ### Authentication
 Every call to a TabbyAPI endpoint requires some form of authentication. Keys have two types of permissions:
 - API: Accesses non-invasive endpoints (ex. generation, model list fetching)
 - Admin: Allowed to access protected endpoints that deal with resources (ex. loading, unloading)
 In addition, when calling list endpoints, API keys will only fetch the currently loaded object while admin keys will list the entire directory. For example, calling `/v1/models` will return a list of the user-configured models directory only if an admin key is passed.
 Therefore, it's recommended to keep the admin key for yourself and only share the api key with users.
 If these keys get compromised, shut down your server, delete the `api_tokens.yml` file, and restart. This will generate new keys which you can share with users.
 To bypass authentication checks, set `disable_auth` to `True` in config.yml. However, turning off authentication without a third-party solution will make your instance open to the world.
 ### Difficult to get started?
 Is the API difficult? Don't want to load models with `config.yml`? That's okay! Not everyone is a master user of AI products when starting out. 
 For newer users, it's recommended to use a UI that allows for managing TabbyAPI via API endpoints.
 To find UI projects, take a look at [Community Projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects) for more information.
 The [Discord](https://discord.gg/sYQxnuD7Fj) is also a great place to ask for help. Please be nice when asking questions as all the developers are volunteers who have lives outside of TabbyAPI. 
--- a/docs/04.-Chat-Completions.md
+++ b/docs/04.-Chat-Completions.md
@ -0,0 +1,31 @@
 ## Chat Completions
 TabbyAPI builds on top of the HuggingFace "chat templates" standard for OAI style chat completions (`/v1/chat/completions`).
 If you'd like more detail, look at the [autogenerated documentation](https://theroyallab.github.io/tabbyAPI/#operation/chat_completion_request_v1_chat_completions_post).
 ### Custom Templates
 By default, TabbyAPI will try to pull the chat template from a model's `chat_template` key within  a model's `tokenizer_config.json`, but you can also make a custom jinja file. To learn how to create a HuggingFace compatible jinja2 template, Please read [Huggingface's documentation](https://huggingface.co/docs/transformers/main/chat_templating).
 If you create a custom template for a model, consider PRing it to the [templates repository](https://github.com/theroyallab/llm-prompt-templates)
 In addition, there's also support to specify stopping strings within the chat template. This can be achieved by adding `{%- set stop_strings = ["string1"] -%}` at the top of the jinja file. In this case, `string1` will be appended to your completion as a stopping string.
 > [!WARNING]
 > Make sure to add `{%- -%}` for any top-level metadata. If this is not provided, the top of the rendered prompt will have extra whitespace. This does not apply for comments `{# #}`
 To use a custom template, place it in the templates folder, and make sure to set the `prompt_template` field in `config.yml` (see [model config](https://github.com/theroyallab/tabbyAPI/wiki/2.-Server-options#model-options)) to the template's filename.
 ### Template Variables
 A chat completions request to TabbyAPI also supports custom template variables in the form of a key/value object in the JSON body. Here's an example:
 ```json
 "template_vars": {
    "test_var": "hello!"
 }
 ```
 Now let's pass the custom var in the following template:
 ```jinja2
 I'm going to say {{ test_var }}
 ```
 Running render on this template will now result in: `I'm going to say hello!`
--- a/docs/05.-FAQ.md
+++ b/docs/05.-FAQ.md
@ -0,0 +1,23 @@
 ## FAQ
 - What OS is supported?
  - Windows and Linux
 - I'm confused, how do I do anything with this API?
  - That's okay! Not everyone is an AI mastermind on their first try. The start scripts and `config.yml` aim to guide new users to the right configuration. The [Usage](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage) page explains how the API works. [Community Projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects) contain UIs that help interact with TabbyAPI via API endpoints. The [Discord Server](https://discord.gg/sYQxnuD7Fj) is also a place to ask questions, but please be nice.
 - How do I interface with the API?
  - The wiki is meant for user-facing documentation. Devs are recommended to use the autogenerated documentation for [OpenAI](https://theroyallab.github.io/tabbyAPI) and [Kobold](https://theroyallab.github.io/tabbyAPI/kobold) servers
 - What does TabbyAPI run?
  - TabbyAPI uses Exllamav2 as a powerful and fast backend for model inference, loading, etc. Therefore, the following types of models are supported:
    - Exl2 (Highly recommended)
    - GPTQ
    - FP16 (using Exllamav2's loader)
 - Exllamav2 may error with the following exception: `ImportError: DLL load failed while importing exllamav2_ext: The specified module could not be found.`
  - First, make sure to check if the wheel is equivalent to your python version and CUDA version. Also make sure you're in a venv or conda environment.
  - If those prerequisites are correct, the torch cache may need to be cleared. This is due to a mismatching exllamav2_ext.
    - In Windows: Find the cache at `C:\Users\<User>\AppData\Local\torch_extensions\torch_extensions\Cache` where `<User>` is your Windows username
    - In Linux: Find the cache at `~/.cache/torch_extensions`
    - look for any folder named `exllamav2_ext` in the python subdirectories and delete them.
    - Restart TabbyAPI and launching should work again.
--- a/docs/06.-Sharing.md
+++ b/docs/06.-Sharing.md
@ -0,0 +1,35 @@
 ## Sharing
 This page of the documentation serves to illustrate the in-between case of exposing a local instance without having to pay for a domain and set up a complicated reverse proxy.
 ### Ngrok
 Ngrok is the recommended method for sharing a local instance over the internet. Throughput is faster than the alternatives, but there are limits on their free tier.
 To get started:
 1. Sign up and install ngrok using instructions from [their website](https://ngrok.com/docs/getting-started/)
   1. The ngrok for Windows install uses chocolatey, a package manager that isn't installed on Windows by default. Instead, use the winget package manager to install ngrok instead by running `winget install --id=Ngrok.Ngrok -e`
 2. Start your TabbyAPI instance
 3. In a separate terminal, open an ngrok instance using the command `ngrok http 5000` (or the port that TabbyAPI is running on)
 4. Copy the public URL that shows up and send that to your users.
 ### Cloudflared
 Cloudflared is a free local tunneling service provided by Cloudflare. Free URLs do not have restrictions, but clients will have slower throughput.
 To get started:
 1. Download the cloudflared binary from [their releases](https://github.com/cloudflare/cloudflared/releases)
 2. Open a terminal to where you downloaded cloudflared.
 3. Start your TabbyAPI instance
 4. Run `./cloudflared tunnel --url 127.0.0.1:5000` to start a quick tunnel on that port
 5. Copy the provided HTTPS url and send that to your users.
 ### Tailscale
 Tailscale is a product that uses the WireGuard protocol to provide a mesh network VPN for connecting your devices anywhere you go. Think of it as a private LAN that's accessible from anywhere.
 > [!NOTE]
 > This is not a method for exposing your TabbyAPI instance to the world. If you want that, use the other two services.
 To get started:
 1. Set your TabbyAPI ip to `0.0.0.0` otherwise you will not be able to access your instance outside your local machine.
 2. Sign up and get started on [Tailscale's website](https://tailscale.com/), then install the client.
 3. Connect to your tailscale account on both your host and client machine.
 4. Select the Tailscale icon (usually in the system tray) and get the IP of your host device. This is usually identified by the hostname.
 5. You can now access your TabbyAPI instance via `<tailscale IP>:5000` instead of localhost as long as your are connected to your tailnet.
--- a/docs/07.-AI-Horde.md
+++ b/docs/07.-AI-Horde.md
@ -0,0 +1,24 @@
 ## Connecting to the AI Horde
 The [AI Horde](https://aihorde.net/) is a FOSS crowdsourced compute pool to serve models to users who are not able to access them. TabbyAPI is currently the only horde compatible LLM server that runs on Windows with parallel batching.
 To get started, Horde requires an API key, which you can acquire [here](https://stablehorde.net/register)
 The LLM branch of AI Horde does not use the OpenAI standard, but uses KoboldAI's API. Here are the steps to configure your TabbyAPI instance for hosting:
 1. In `config.yml`, set the `api_servers` value to include `"Kobold"` which will enable the KoboldAI API.
 2. Horde doesn't support API key authentication. Therefore, you need to enable `disable_auth` in `config.yml`
 3. After those config changes, launch TabbyAPI as normal and your server should be Horde-ready.
 Now, the horde needs a "worker" to interface between its servers and your TabbyAPI backend. To accomplish that, here are the steps:
 1. Clone the AI horde worker repository
   1. `git clone https://github.com/Haidra-Org/AI-Horde-Worker`
 2. Open the repository in your favorite editor and copy `bridgeData_template.yaml` to `bridgeData.yaml`. This will serve as your configuration file.
 Afterwards, you'd need to adjust the options based on your preferences. Here are the recommended values to adjust:
 - `api_key`: The Horde API key you registered for above.
 - `max_threads`: How many requests should be run at once? A higher value should also use a higher batch size and cache size in Tabby.
 - `scribe_name`: The name of your worker that's displayed in Horde.
 - `kai_url`: The URL to your TabbyAPI backend. This will usually be `http://localhost:5000`
 - `max_length`: The maximum number of tokens for every request. 512 is recommended.
 - `max_context_length`: The maximum context length of the worker itself. It's recommended to set this to the model's `max_seq_len`.
 Finally, launch the worker by running `.\horde-scribe-bridge.bat` or `./horde-scribe-bridge.sh` depending on your system.
--- a/docs/08.-Sampling.md
+++ b/docs/08.-Sampling.md
@ -0,0 +1,150 @@
 # Supported Samplers
 Samplers are used to alter raw probabilities during response generations. Users can tune these to adjust what outputs they get.
 > [!NOTE]
 > 
 > Sampling is not a catch-all solution if your generations are behaving the wrong way! These factors can also fall to the prompt, frontend, model, etc. Please do not set arbitrary sampler values without understanding what they do first!
 ## Penalties
 Repetition Penalty -
 - API request field: `repetition_penalty`
 - Default: `1.0` - Off
 - Description: Multiplicative method of preventing repetition of previous tokens in the context.
 Frequency Penalty -
 - API request field: `frequency_penalty`
 - Default: `0.0` - Off
 - Description: A constant value added each time each time a token is sampled, reducing the probability for that specific token.
 Presence Penalty -
 - API request field: `presence_penalty`
 - Default: `0.0` - Off
 - Description: Additive method of preventing repetition of previous tokens in the context. Encourages new ideas to get generated. Unlike frequency penalty, this is a one-off application. tldr; repetition penalty, but additive.
 Penalty Range -
 > [!NOTE]
 > 
 > Unlike other backends, `0` disables penalties entirely!
 - API Request: `penalty_range` or `repetition_range` or `repetition_penalty_range`
 - Default: `-1`
  - When frequency OR presence penalty is enabled, a penalty_range value of `-1` applies the penalty to only the output tokens. A lower range is advised.
  - Otherwise a penalty range value of `-1` = max sequence length
 - Description: Amount of tokens to look behind when applying penalties.
  - For frequency and presence penalty, this should be a low value to avoid "backing the model into a corner" when selecting similar tokens, resulting in large amounts of synonym repeats (aka "thesaurus mode").
 ## Alphabet Soup
 Top-P -
 - API request field: `top_p`
 - Default: `1.0` - Off
 Min-P -
 - API request field: `min_p`
 - Default: `0.0` - Off
 Top-K -
 - API request field: `top_k`
 - Default: `0.0` - Off
 Top-A -
 - API request field: `top_a`
 - Default: `0.0` - Off
 ## Miscellaneous
 Temperature -
 - API request field: `temperature`
 - Default: `1.0` - Off
 - Description: A constant value applied to softmax calculation. A higher temperature = more randomness when choosing the next token.
 Temp last -
 - API request field: `temp_last`
 - Default: `false` - Off
 - Description: Places temperature application last in the sampling stack. Necessary for min-P sampling.
 Typical -
 - API request field: `typical`
 - Default: `1.0` - Off
 Tail-free Sampling -
 - API request field: `tfs`
 - Default: `1.0` - Off
 Logit bias -
 - API request field: `logit_bias`
 - Default: `None` - Off
 - Example: `[{"1": 50}, {"2": 75}]` - An array of bias objects
 - Description: Adds a positive or negative value to change the occurrence of a specific token. Format: `{"token": bias}` where bias is from `-100` to `100`.
 Mirostat mode -
 - API request field: `mirostat_mode`
 - Default: `0` - Off
  - Exllamav2 only applies mirostat when `mirostat mode = 2`
 Mirostat tau -
 - API request field: `mirostat_tau`
 - Default: `1.5` - Off unless mirostat_mode = 2
 Mirostat eta -
 - API request field: `mirostat_eta`
 - Default: `0.1` - Off unless mirostat_mode = 2
--- a/docs/09.-Community-Projects.md
+++ b/docs/09.-Community-Projects.md
@ -0,0 +1,14 @@
 ## Community Projects
 Due to TabbyAPI solely being an API server, the community has provided custom-tailored applications to communicate with a TabbyAPI backend.
 > [!NOTE]
 > If you would like to showcase a community project, feel free to contact a dev!
 Below is a list of projects that integrate TabbyAPI:
 - [TabbyAPI Gradio Loader](https://github.com/theroyallab/tabbyAPI-gradio-loader) by Doctor Shotgun - Gradio WebUI for loading/unloading models and loras via TabbyAPI.
 - [ST TabbyAPI Loader](https://github.com/theroyallab/ST-tabbyapi-loader) by kingbri - SillyTavern extension for loading and unloading models via TabbyAPI.
 - [SillyTavern Launcher](https://github.com/SillyTavern/SillyTavern-Launcher) by deffcolony - CLI manager for AI tools which includes packaged setup and management of TabbyAPI.
 - [SillyTavern](https://github.com/SillyTavern/SillyTavern) by Cohee and RossAscends - An AI frontend for power users that fully supports TabbyAPI's generation options.
 - [Agnaistic](https://agnai.chat/) by Sceik - Multi-user AI role-play chat frontend that fully supports TabbyAPI's generation options.
 - [DS-LLM-WebUI](https://github.com/DocShotgun/ds-llm-webui) by Doctor Shotgun - A tool use assistant for local LLMs which is fully designed around TabbyAPI.
 - [PolyMind](https://github.com/itsme2417/PolyMind) by Itsme2417 - A multimodal function calling LLM webui that is designed to be used with TabbyAPI.
--- a/docs/10.-Tool-Calling.md
+++ b/docs/10.-Tool-Calling.md
@ -0,0 +1,145 @@
 # Tool Calling in TabbyAPI
 > [!NOTE]
 > Before getting started here, please look at the [Custom templates](https://github.com/theroyallab/tabbyAPI/wiki/04.-Chat-Completions#custom-templates) page for foundational concepts.
 > 
 > Thanks to [Storm](https://github.com/gittb) for creating this documentation page.
 TabbyAPI's tool calling implementation aligns with the [OpenAI Standard](https://platform.openai.com/docs/api-reference), following the [OpenAI Tools Implementation](https://platform.openai.com/docs/guides/function-calling) closely.
 ## Features and Limitations
 TabbyAPI's tool implementation supports:
 - Tool calling when streaming
 - Calling multiple tools per turn
 Current limitations:
 - No support for `tool_choice` parameter (always assumed to be auto)
 - `strict` parameter not yet supported (OAI format ensured, but dtype and argument name choices not yet enforced)
 ## Model Support
 TabbyAPI exposes controls within the `prompt_template` to accommodate models specifically tuned for tool calling and those that aren't. By default, TabbyAPI includes `chatml_with_headers_tool_calling.jinja`, a generic template built to support the Llama 3.1 family and other models following the ChatML (with headers) format.
 For more templates, check out [llm-prompt-templates](https://github.com/theroyallab/llm-prompt-templates).
 ## Usage
 In order to use tool calling in TabbyAPI, you must select a `prompt_template` that supports tool calling when loading your model. 
 For example, if you are using a Llama 3.1 Family model you can simply modify your `config.yml`'s `prompt_template:` to use the default tool calling template like so:
   ```yaml
   model:
     ...
     prompt_template: chatml_with_headers_tool_calling
   ```
 If loading via `/v1/model/load`, you would also need to specify a tool-supporting `prompt_template`.
 ## Creating a Tool Calling Prompt Template
 Here's how to create a TabbyAPI tool calling prompt template:
 1. Define proper metadata:
    Tool Call supporting `prompt_templates` can have the following fields as metadata:
    - `tool_start` This is a string that we expect the model to write when initating a tool call. **(Required)**
    - `tool_end` This is a string the model expects after completing a tool call.
    Here is an example of these being defined:
   ```jinja
   {# Metadata #} 
   {% set stop_strings = ["<|im_start|>", "<|im_end|>"] %}
   {% set message_roles = ['system', 'user', 'assistant', 'tool'] %}
   {% set tool_start = "<|tool_start|>" %}
   {% set tool_end = "<|tool_end|>" %}
   ```
   `tool_start` and `tool_end` should be selected based on which model you decide to use. For example, [Groq's Tool calling models](https://huggingface.co/Groq/Llama-3-Groq-70B-Tool-Use) expects `<tool_call>` and `</tool_call>` while [Llama3 FireFunctionV2's](https://huggingface.co/fireworks-ai/llama-3-firefunction-v2) model expects only `functools` to start the call, without a `tool_end`
 2. Define an `initial_system_prompt`:
   While the name of your `inital_system_prompt` can vary, it's purpose does not. This inital prompt is typically a simple instruction set followed by accessing the `tools_json` variable. This will contain the function specification the user provided to the `tools` endpoint in their client when the chat completion request. Inside the template we can call this like so: `{{ tools_json }}`.
   Note: Depending on the model you are using, it's possible your model may expect a special set of tokens to surround the function specifications. Feel free to surround `tools_json` with these tokens.
   ```jinja
   {% set initial_system_prompt %}
   Your instructions here...
   Available functions:
   {{ tools_json }}
   {% endset %}
   ```
   You'll then want to make sure to provide this to the model in the first message it recieves. Here is a simple example:
   ```jinja
   {%- if loop.first -%}
   {{ bos_token }}{{ start_header }}{{ role }}{{ end_header }}
   {{ inital_system_prompt }}
   {{ content }}{{ eos_token }}
   ```
 3. Handle messages with the `tool` role:
   After a tool call is made, a *well behaved* client will respond to the model with a new message containing the role `tool`. This is a response to a tool call containing the results of it's execution.
   The simplest implementation of this will be to ensure your `message_roles` list within your prompt template contains `tool`. Further customization may be required for models that expect specific tokens surrounding tool reponses. An example of this customization is the Groq family of models from above. They expect special tokens surrounding their tool responses such as:
   ```jinja
   {% if role == 'tool' %}
   <tool_response>{{ content }}</tool_response>
   {% endif %}
   ```
 4. Preserve tool calls from prior messages:
   When creating a tool calling `prompt_template`, ensure you handle previous tool calls from the model gracefully. Each `message` object within `messages` exposed within the `prompt_template` could also contain `tool_calls_json`. This field will contain tool calls made by the assistant in previous turns, and must be handled appropriatly so that the model understands what previous actions it has taken (and can properly identify what tool response ID belongs to which call).
   This will require using the `tool_start` (and possibly `tool_end`) from above to wrap the `tool_call_json` like so:
   ```jinja
   {% if 'tool_calls_json' in message and message['tool_calls_json'] %}
   {{ tool_start }}{{ message['tool_calls_json'] }}{{ tool_end }}
   {% endif %}
   ```
 5. Handle tool call generation:
   ```jinja
   {% set tool_reminder %}
   Available Tools:
   {{ tools_json }}
   Tool Call Format Example:
   {{ tool_start }}{{ example_tool_call }}
   Prefix & Suffix: Begin tool calls with {{ tool_start }} and end with {{ tool_end }}.
   Argument Types: Use correct data types for arguments (e.g., strings in quotes, numbers without).
   {% endset %}
   {% if tool_precursor %}
   {{ start_header }}system{{ end_header }}
   {{ tool_reminder }}{{ eos_token }}
   {{ start_header }}assistant{{ end_header }}
   {{ tool_precursor }}{{ tool_start }}
   {% else %}
   {{ start_header }}assistant{{ end_header }}
   {% endif %}
   ```
   This clever bit of temporal manipulation allows us to slip in a reminder as a system message right before the model generates a tool call, but after it writes the `tool_start` token. This is possible due to TabbyAPI revisitng the `prompt_template` after a `tool_start` token is detected. Here's how it works:
   - We detect `tool_precursor`, which signals the model is about to generate a tool call.
   - We then inject a system message with our `tool_reminder`.
   - Finally, we initialize an assistant message using `tool_precursor` as the content.
   This creates the illusion that the model just happened to remember the available tools and proper formatting right before generating the tool call. It's like giving the model a little nudge at exactly the right moment, enhancing its performance without altering what the user sees.
 When creating your own tool calling `prompt_template`, it's best to reference the default `chatml_with_headers_tool_calling.jinja` template as a starting point.
 ## Support and Bug Reporting
 For bugs, please create a detailed issue with the model, prompt template, and conversation that caused it. Alternatively, join our [Discord](https://discord.gg/sYQxnuD7Fj) and ask for Storm.
--- a/docs/Home.md
+++ b/docs/Home.md
@ -0,0 +1,12 @@
 > [!IMPORTANT]
 > This documentation is under construction. URLs may change at any time. Thanks!
 Welcome to TabbyAPI!
 This wiki aims to provide a place for documenting various aspects of this project.
 Not sure where to start? Check out the [Getting Started](https://github.com/theroyallab/tabbyAPI/wiki/01.-Getting-Started) page.
 Are you a developer? Take a look at the [Usage](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage) page and [autogenerated documentation](https://theroyallab.github.io/tabbyAPI)
 Have issues? Check out the [FAQ](https://github.com/theroyallab/tabbyAPI/wiki/05.-FAQ) page.