Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.
Also, remove model_is_loaded since we simply check if the container
is None now.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>