Generative AI: LLM Locally

One of the most fascinating breakthroughs has been in the realm of generative AI, particularly those specialized in text.

These innovative models, like the artist with a blank canvas, craft sentences, paragraphs, and stories, stitching together words in ways that were once the exclusive domain of human intellect.

No longer just tools for querying databases or executing commands, these AI are akin to novelists, poets, and playwrights,furthermore, they are equipped with the ability to program and even create full projects on their own.

  • GPT-3 and GPT-4 from OpenAI are two of the most well-known LLMs. They are both large language models with billions of parameters, and they can be used for a variety of tasks, such as generating text, translating languages, and writing different kinds of creative content.
  • PaLM (Pathways Language Model) is a 540 billion parameter LLM from Google AI. It is one of the largest LLMs ever created, and it can perform a wide range of tasks, including question answering, coding, and natural language inference.
  • Other Models 馃 馃憞
    • LaMDA (Language Model for Dialogue Applications) is a 137 billion parameter LLM from Google AI. It is designed specifically for dialogue applications, such as chatbots and virtual assistants.
    • Chinchilla is a 300 billion parameter LLM from DeepMind. It is one of the most efficient LLMs available, and it can be used for a variety of tasks, such as machine translation and text summarization.
    • LLaMA (Large Language Model Meta AI) is a 65 billion parameter LLM from Meta AI. It is designed to be more accessible than other LLMs, and it is available in smaller sizes that require less computing power. Llama has also spawned a number of open source derivatives:
      • Vicuna is a 33 billion parameter LLM that is based on LLaMA. It is fine-tuned on a dataset of human conversations, and it is designed for dialogue applications.
      • Orca is a 13 billion parameter LLM that is based on LLaMA. It is designed to be efficient and easy to use, and it can be used for a variety of tasks, such as text generation, translation, and question answering.
      • Guanaco is a family of LLMs that are based on LLaMA. They come in a variety of sizes, from 7 billion to 65 billion parameters. They are designed for a variety of tasks, such as machine translation, question answering, and natural language inference.
    • More Open Source models? Have a look as well to Falcon

While the promise of this technology sounds almost like science fiction and there’s considerable hype surrounding it, there’s truly no better way to understand its capabilities than to experience it firsthand. So, why merely read about it when you can delve into its intricate workings yourself?

Let’s demystify the buzz and see what these models are genuinely capable of. In this post, I’ll guide you on how to interact with these state-of-the-art LLM models locally, and the best part? You can do it for free and using just the CPU.

Installing LLMs locally: Vicuna

We need an interface to use our LLMs and there is a perfect project that uses a Gradio Web UI.

In general the instructions of this projects work and we can replicate it fairly easy, but I thought to simplify the dependencies setup with Docker. So what you will need is:

SelfHosting TextGenerationWebUI

I already created the container and pushed it to dockerhub to avoid the quite long waiting time of dependencies installations etc.

The docker-compose / Stack to use in Portainer is as simple as this:

version: '3'

    image: fossengineer/oobabooga_cpu
    container_name: genai_ooba
      - "7860:7860"
    working_dir: /app
    command: tail -f /dev/null #keep it running
    volumes: #Choose your way
     # - C:/Path/to/Models/AI/Docker_Vol:/app/text-generation-webui/models
     # - /home/AI_Local:/app/text-generation-webui/models
     # - appdata_ooba:/app/text-generation-webui/models

# volumes:
#   appdata_ooba:     
I want to build my own 馃惓 馃憞

For CPU, this is what you need:

pip3 install torch torchvision torchaudio --index-url
pip install -r requirements_cpu_only.txt
#python --listen

Which can be captured with:

# Use the specified Python base image
FROM python:3.11-slim

# Set the working directory in the container

# Install necessary packages
RUN apt-get update && apt-get install -y \
    git \

# Install PyTorch, torchvision, and torchaudio
RUN pip3 install torch torchvision torchaudio --index-url

# Clone the private repository
RUN git clone

WORKDIR /app/text-generation-webui

# Install additional Python requirements from a requirements file
#COPY requirements_cpu_only.txt .
RUN pip install -r requirements_cpu_only.txt  

#podman build -t textgenwebui .

Then use this stack to deploy:

version: '3'

    image: textgenwebui
    container_name: textgenwebui
      - "7860:7860"
    working_dir: /app/text-generation-webui
    command: python --listen #tail -f /dev/null #keep it running
    volumes: #Choose your way
     # - C:\Users\user\Desktop\AI:/app/text-generation-webui/models
     # - /home/AI_Local:/app/text-generation-webui/models
      - appdata_ooba:/app/text-generation-webui/models


This will spin up a docker container with Python and Oobabooga’s Web UI dependencies already installed.

Inside this container, we just miss one thing, the LLM models: for that, download it in your PC and setup the proper Bind volume in the docker yml file above, so that the container is able to see the .bin files.

Adding a LLM Model

  • You can Try with GGUF models are a single file and should be placed directly into models.
    • GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML
    • Thanks to you can convert from .HF/.GGML/Lora to . gguf
  • The remaining model types (like 16-bit transformers models and GPTQ models) are made of several files and must be placed in a subfolder.
We can try for example with Vicuna 馃憞
  • Go to HuggingFace and download one of the models:
    • I tried it with ggml-vic7b-uncensored-q5_1.bin
  • Deploy the yml above with the folder in your system that contains the .bin file
  • Then execute: conda init bash
  • Restart the interactive terminal and execute the following
conda activate textgen 
cd text-generation-webui
python --listen

With those commands we activated the conda textgen environment, then navigated to the folder where all the action happens and execute the Python server (when doing it inside a docker container we need the –listen flag)

TextGenWebUI Local UI


Ways to Evaluate LLMs

How to try Safely LLMs with Docker?

You can use a Python container and install the dependencies in a fresh environment with:

version: '3'

    image: python:3.11-slim
    container_name: python-dev
    command: tail -f /dev/null
      - python_dev:/app
    working_dir: /app  # Set the working directory to /app
      - "8501:8501"


A Detailed Video to use TextGenWebUI with Docker