jalr/tabbyAPI-ollama

Author	SHA1	Message	Date
kingbri	cce97deea5	Model: Switch logprobs to use post-sampling Previously, pre-sampling logprobs were used from the raw logits, but newer versions of exl2 allow for returning token probs post-sampling. Convert these to logprobs and send to the user. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 21:51:25 -05:00
kingbri	949248fb94	Config: Add experimental torch cuda malloc backend This option saves some VRAM, but does have the chance to error out. Add this in the experimental config section. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 21:45:56 -05:00
kingbri	664e2c417e	Model: Fix GPU split args loading Autosplit was overwriting a manual GPU split if the YAML parameter wasn't set. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-14 17:42:20 -05:00
kingbri	a79c42ff4c	Sampling: Make validators simpler Injecting into Pydantic fields caused issues with serialization for documentation rendering. Rather than reinvent the wheel again, switch to a chain of if statements for now. This may change in the future if subclasses from the base sampler request need to be validated as well. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-11 15:28:43 -05:00
kingbri	f627485534	OAI: Fix completion token fetching The generator returns generated_tokens in the dict. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-11 01:12:13 -05:00
kingbri	7e730e3507	Sampling: Add universal validation system Rather than maintaining yet another function to validate sampler ranges/values, embed them in fields which allows for less maintainence in the future. Also add validation for existing samplers that can corrupt the sampling stack if set improperly. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-10 14:59:23 -05:00
kingbri	9f1d891490	Packages: Fix exllamav2 version check Post versions are ok to use for checking if the user is on the correct exllamav2 wheel. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-10 14:00:26 -05:00
kingbri	8d8cf5dc69	Model: Fix dynatemp fallback Set to 1.0 if the condition fails. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-10 12:02:31 -05:00
Brian Dashore	17636ed899	Create pull request template Asks users to give more information when committing a pull request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-09 14:53:29 -05:00
Brian Dashore	c3601bdd18	Issues: Disable blank issues Users must follow the appropriate issue templates Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-09 14:48:03 -05:00
Brian Dashore	aa56ff829f	Add issue templates Creates templates for issues to help guide users in the right direction when making a bug report or request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-09 14:43:33 -05:00
kingbri	2f568ff573	Config: Expose auto GPU split reserve config The GPU reserve is used as a VRAM buffer to prevent GPU overflow when automatically deciding how to load a model on multiple GPUs. Make this configurable. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 22:09:50 -05:00
kingbri	43bba526bf	Model: Fix logprobs unwrapping Take a log of the token probs since they're already normalized which reflects the proper value. Also, don't error out if a token prob doesn't exist in the dict and return None instead from zip. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c7428f0bcd	API: Add logprobs for chat completions Adds chat completion logprob support using OAI's spec. Tokens are not converted to tiktoken here since that will add an extra dependency for no real reason. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c02fe4d1db	API: Fix response creation Change chat completion and text completion responses to be more flexible. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	0af6a38af3	Model: Add logprobs support Returns token offsets, selected tokens, probabilities of tokens post-sampling, and normalized probability of selecting a token pre-sampling (for efficiency purposes). Only for text completions. Chat completions in a later commit. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	2642ef7156	OAI: Update logprobs type Some logprobs cannot exist, so make the type optional Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	284f20263f	API: Clean up tokenizing endpoint Split the get tokens function into separate wrapper encode and decode functions for overall code cleanliness. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
AliCat	bb48f77ca1	Neutralize samplers (#59 ) * Update sample_preset.yml Neutralized the samplers. * Sampling: Fix dynatemp defaults Default max temp and min temp is 1.0 * Sampling: Fix TFS defaults Default is 1.0 --------- Co-authored-by: AliCat <86847834+alicat22@users.noreply.github.com> Co-authored-by: kingbri <bdashore3@proton.me>	2024-02-08 00:23:09 -05:00
kingbri	321c9a1ea9	Requirements: Fix FA2 version number The URL wasn't edited correctly Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-07 21:37:30 -05:00
kingbri	58590a6c57	Config: Add option to force streaming off Many APIs automatically ask for request streaming without giving the user the option to turn it off. Therefore, give the user more freedom by giving a server-side kill switch. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-07 21:09:59 -05:00
kingbri	d0027bce32	Requirements: Update flash attention 2 for Windows Version 2.5.2 Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-07 20:44:23 -05:00
kingbri	c0ad647fa7	Model: Auto-detect a one GPU setup and fix gpu_split_auto It makes more sense to use gpu split parameters when the user has >1 GPUs. Otherwise, set split and split_auto to False and save the user some VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 23:08:57 -05:00
kingbri	849179df17	Model: Make loading use less VRAM The model loader was using more VRAM on a single GPU compared to base exllamav2's loader. This was because single GPUs were running using the autosplit config which allocates an extra vram buffer for safe loading. Turn this off for single-GPU setups (and turn it off by default). This change should allow users to run models which require the entire card with hopefully faster T/s. For example, Mixtral with 3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom on Windows. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 22:29:56 -05:00
kingbri	fedebadc81	Model: Fix generate window fallback Use max_seq_len as the numerator, not the max_tokens. Mismatched parameter. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 14:48:42 -05:00
kingbri	543a9b68c8	Requirements: Update Exllamav2 to 0.0.13.post1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-04 21:25:57 -05:00
kingbri	f10a5cfee6	Auth: Create keys on different exception FileNotFoundError is the proper exception to catch here rather than OSError. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-04 01:56:42 -05:00
erinmaybe	fa2acb2828	Adds aliases for min_temp and max_temp (#58 ) * Adds aliases for min_temp and max_temp * Sampling: Add dynatemp_exponent alias	2024-02-03 21:51:29 -05:00
kingbri	a769d90bad	Args: Fix developer group Wasn't being added to the parser Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-03 00:16:47 -05:00
kingbri	f1ea15d77e	Model: Remove backwards compatability hacks Now that exllamav2 is required to be the latest, don't add attribute checks unless the feature is not in the release build. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:53:45 -05:00
kingbri	6eeb62b82c	Requirements: Update exllamav2, torch, and FA2 Torch to 2.2, exllamav2 to 0.0.13, FA2 to 2.4.2 on Windows and 2.5.2 on Linux. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:53:42 -05:00
kingbri	1919bf7705	Launch: Make exllamav2 requirement more friendly Add the ability to use an unsafe config flag if needed and migrate the exl2 check to a different file within the exl2 backend code. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	b827bcbb44	Sampling: Cleanup and update Cleanup how overrides are handled, class naming, and adopt exllamav2's model class to enforce latest stable version methods rather than adding multiple backwards compatability checks. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	2ea063cea9	Tree: Require exllamav2 version for startup Exllamav2 is currently supported on all GPUs and versions. Therefore, it should be expected that users use the latest version of exllamav2 to get the latest features. Doing this helps reduce checks that don't really serve any purpose. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	d3781920b3	OAI: Split up utility functions Just like types, put utility functions in their own separate module based on the route. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	634d299fd9	Sampling: Fix smoothing factor default fallback default_factory, not default_factor Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:35:15 -05:00
Alexander Abushady	d7c18855e7	added quadratic sampling (#56 ) * added quadratic sampling * Update sample_preset.yml * oops missed a spot * Sampling: Fix smoothing factor semantics	2024-02-02 22:12:59 -05:00
kingbri	4a7b8b1b7a	Samplers: Add dynamic temperature Does not work if max_temp is less than or equal to min_temp. Sampler validation will have to be refactored in the future, so the dynamic temperature check will also be changed. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-31 01:20:59 -05:00
kingbri	3605067898	Requirements: Don't use torch 2.2 Pytorch released 2.2 without letting the community know first. Pin the torch version to 2.1.2 until exllamav2 builds for torch 2.2 Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-29 23:30:10 -05:00
kingbri	751627e571	OAI: Add fasttensors to model load endpoint Also fix logging when loading prompt templates. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 01:08:02 -05:00
kingbri	fc4570220c	API + Model: Add new parameters and clean up documentation The example JSON fields were changed because of the new sampler default strategy. Fix these by manually changing the values. Also add support for fasttensors and expose generate_window to the API. It's recommended to not adjust generate_window as it's dynamically scaled based on max_seq_len by default. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	90fb41a77a	Model: Fix prompt template initialization The previous commit iterated through multiple try conditions which made it so the user has to provide a dummy prompt template. Now, template loading is fallback based. Run through a loop of functions and return if one of them succeeds. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	740b0215dd	Model: Dynamically scale generate_window Allows for adjustment of reservation space at the end of the context before rolling it. This should be scaled as a model's max_seq_len goes up. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	b14c5443fd	API: Add sampler override switching Allow users to switch the currently overriden samplers via the API so a restart isn't required to switch the overrides. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	de0ba7214c	API: Add template switching and unload endpoints Templates can be switched and unloaded without reloading the entire model. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	6c30f24c83	Tree: Unify sampler parameters and add override support Unify API sampler params into a superclass which should make them easier to manage and inherit generic functions from. Not all frontends expose all sampling parameters due to connections with OAI (that handles sampling themselves with the exception of a few sliders). Add the ability for the user to customize fallback parameters from server-side. In addition, parameters can be forced to a certain value server-side in case the repo automatically sets other sampler values in the background that the user doesn't want. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	78f920eeda	Tree: Refactor code organization Move common functions into their own folder and refactor the backends to use their own folder as well. Also cleanup imports and alphabetize import statments themselves. Finally, move colab and docker into their own folders as well. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	ee99349a78	Requirements: Bump exllamav2 0.0.12 Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-22 21:13:31 -05:00
kingbri	902e841c39	Main: Add logging for API routes Helps users get started with accessing the docs. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-10 23:50:11 -05:00
kingbri	7a29664f06	API: Add alias names to field descriptions Helps with understanding API aliases. These aliases should not be used but are helpful for developers who want frontend compat. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-08 23:00:33 -05:00

... 6 7 8 9 10 ...

641 commits