AI Apps are cool. Understanding RAGs and VectorDB

By the end of this post, you will have few ideas to build alternatives to: https://notebooklm.google/

F/OSS Vector DBs

Vector databases are specialized databases designed to store and efficiently query vector embeddings.

These embeddings are numerical representations of data (like text, images, audio, or even user behavior) that capture the semantic meaning or relationships between the data points.

Instead of storing data as raw text or structured tables, vector databases store them as these multi-dimensional vectors.

Why they’re useful?

Semantic Search: They allow you to search for data based on meaning rather than exact keywords. For example, you could search for “pictures of cats” and the database would return images of cats even if they weren’t explicitly tagged with those words, because the vector embeddings would capture the visual similarity.
Similarity Search: They excel at finding data points that are similar to each other. This is useful for recommendations (e.g., “users who bought this also bought…”), clustering, and anomaly detection.
Machine Learning Applications: They are essential for many machine learning tasks, as they provide an efficient way to store and retrieve the vector representations generated by models.

In short, vector databases allow you to search and analyze data based on its meaning and relationships rather than just its literal content.

This opens up a wide range of possibilities for applications that require understanding the underlying semantics of data.

Some popular (and OSS) vector DBs?

https://www.pinecone.io/learn/vector-database/

The faiss Site
- The faiss Source Code at Github
  - License: MIT ❤️

And we also have few more, like ChromaDB or QDrant!

Chroma

The ChromaDB Site
The ChromaDB Source Code at Github
- License: Apache v2 ✅

Ive already made a guide to setup Chroma DB with Docker

QDrant Vector Database with Docker

You can just do:

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

Or better, with the docker-compose.yml, to spin a qdrant service:

version: '3'
services:
  qdrant:
    container_name: my_qdrant_container
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/path/to/qdrant_data

volumes:
  qdrant_data:

Check Qdrant UI at: http://localhost:6333/dashboard#

Vector Admin

Vector Admin is a project aimed at providing a user-friendly web interface for managing and visualizing data within vector databases.

It’s designed to simplify the interaction with these specialized databases, which can often be complex to query and administer directly.

Purpose: The primary goal is to make vector databases more accessible to users who may not be experts in database administration or complex query languages. It provides a visual way to explore and manipulate vector data.
Functionality: Typical features of a vector admin project like this would include:
- Data Visualization: Displaying vector data in a way that’s understandable to humans, often using techniques like dimensionality reduction (e.g., t-SNE, UMAP) to project high-dimensional vectors onto 2D or 3D space.
- Querying: Providing a way to search the vector database using both keyword search and similarity search (finding vectors close to a given vector). Ideally, this would be a more intuitive interface than writing raw database queries.
- Data Management: Tools for importing, exporting, and managing vector data, including potentially the ability to create, update, and delete vectors.
- Metadata Management: Allowing users to associate metadata (additional information) with the vectors, which can be useful for filtering, searching, and understanding the data.
- Configuration and Monitoring: Providing access to basic configuration settings for the vector database and tools to monitor its performance.
Vector Admin Setup

F/OSS RAGs

RAG (Retrieval-Augmented Generation) frameworks are a type of natural language processing system that combines information retrieval and language generation techniques.

These frameworks aim to improve the quality and relevance of generated text by leveraging external knowledge sources.

In a RAG framework, when a user poses a question or provides a prompt, the system first retrieves relevant information from a large corpus of text data.

The retrieved information is then used to augment the input prompt, providing additional context and knowledge to the language generation model.

LangChain - Probably the most popular RAG Framework
LlamaIndex - This is the default RAG framework of PrivateGPT
EmbedChain
PandasAI
MemGPT

LangChain

LangChain is a framework designed to simplify the development of applications powered by Large Language Models (LLMs).

Think of it as a toolkit that provides building blocks and tools to connect LLMs to other components, allowing you to create more sophisticated and useful LLM applications.

LangChain helps LLMs interact with everything.

It’s a quite broad in scope. Providing components for:

Chains: These are sequences of actions. A chain might involve prompting the LLM, then using the LLM’s output to query a database, and then using that result to prompt the LLM again. LangChain makes it easy to create these complex workflows. This is a core concept in LangChain.
Agents: Agents use LLMs to decide which tools to use to accomplish a task. A tool could be anything: a search engine, a calculator, a database connection, or even another LLM. LangChain provides tools and a framework for agents to use them.
Memory: LangChain helps manage the “memory” of a conversation or interaction. This is essential for building conversational applications where the LLM needs to remember past interactions.
Prompts: LangChain provides tools for managing and designing prompts. This includes prompt templates, which make it easier to create effective and consistent prompts.
Integrations: LangChain integrates with various LLMs (OpenAI, Hugging Face, etc.), vector databases (Chroma, Pinecone, etc.), and other tools.

Does LangChain have similar “packs” or a “hub”?

LangChain’s integrations are generally built directly into the core library.

There isn’t a separate “hub” for community-contributed integrations in the same way as LlamaHub.

How to setup LangChain with Python ⏬

Using LangChain ⏬

Other RAG Frameworks

LLamaIndex

I got to know LlamaIndex RAG thanks to the PrivateGPT project, which uses it as default RAG.

LlamaIndex is a framework that makes it easier to use Large Language Models (LLMs) with your own data.

It provides tools for:

Data Indexing: Structuring your data (documents, PDFs, etc.) into a format that LLMs can easily understand.
Querying: Asking questions about your data in natural language and getting relevant answers.
Building LLM Apps: Creating applications that can access and reason over your data.

It acts as a bridge between LLMs and your information, enabling you to build powerful data-driven applications.

What are LlamaPacks? ⏬

https://blog.llamaindex.ai/introducing-llama-packs-e14f453b913a

Imagine you want to build a LlamaIndex application that can answer questions about your company’s internal documents stored in a Google Drive. Instead of writing all the code from scratch to connect to Google Drive, fetch the documents, index them, and then query them, you could use a LlamaPack.

A LlamaPack is a pre-built, reusable component that handles the integration with a specific data source or tool.

It bundles together:

Code: The Python code necessary to connect to the data source (e.g., Google Drive API), load the data, and format it for LlamaIndex.
Configurations: Any necessary configuration settings or parameters.

LlamaHUB - Finding LlamaPacks ⏬

https://llamahub.ai/

Now, where do you find these LlamaPacks?

That’s where LlamaHub comes in.

LlamaHub is a central repository or directory of LlamaPacks.

It’s a collection of pre-built integrations for various data sources, APIs, and tools.

Think of it like a marketplace or a library where you can find and download LlamaPacks that suit your needs.

Can I use LLamaIndex with Open Source? Yes, Together with Ollama

MemGPT

Solving LLMs context Window limitation, with MemGPT.

The memGPT Code at Github
- License: MIT ❤️

Create LLM agents with long-term memory and custom tools 📚🦙

Mem0 - exEmbedChain

Mem0, ex-EmbedChain

PandasAI

PandasAI

More about Pandas AI ⏬

PandasAI lets you interact with Pandas DataFrames using natural language.

It leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to make data analysis conversational.

PandasAI adds a layer of functionality to Pandas DataFrames, enabling you to query and manipulate data using natural language prompts. It connects to various LLMs (OpenAI, Ollama, etc.) to understand your requests and translate them into Pandas operations.

How to Use PandasAI (Brief Example)

Install: pip install pandasai

Import:

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI # Or other LLM class

Load Data:
```
df = pd.read_csv("your_data.csv")
```

Initialize:

llm = OpenAI(api_token="your_api_key") # Replace with your API key
pandas_ai = PandasAI(llm)

Interact:

pandas_ai.run(df, prompt="What is the average value of column X?")
pandas_ai.run(df, prompt="Plot a histogram of column Y.")

PandasAI translates your natural language prompts into Pandas code and executes it on your DataFrame, returning the results.

You can use it for data exploration, cleaning, transformation, and visualization.

Remember to configure your LLM API key correctly!

Conclusions

People are building a lot of cool things.

For example:

https://github.com/run-llama/chat-llamaindex

Create chat bots that know your data

Even F/OSS No Code RAGs

Like LangFlow.

It uses LangChain in the background.

But provides a cool UI interface and you can set it up with docker:

FAQ

How to Process Unstructured Data

https://github.com/iterative/datachain
- https://github.com/iterative/datachain?tab=Apache-2.0-1-ov-file#readme ✅
- https://pypi.org/project/datachain/

DataChain 🔗 Process and curate unstructured data using local ML models and LLM calls

How to run LLMs Locally

https://www.youtube.com/watch?v=5WCvGyPpWwg

Ollama
Oobabooga - Text Gen Web UI
GPT4All
PrivateGPT
GPT4All
KoboldCpp
LLamaCPP: you need to build it from source + use GGUF format

Confused with Python Dependencies

What are LangChains?

They allow to connect an LLM to our own sources of data. It will have referenced data. It can also take actions for us (like send email).

Document -> Document Chunks -> VectorStore

https://www.youtube.com/watch?v=aywZrzNaKjs

https://www.langchain.com/ You can use it from Python or JS.

Chat with Web - https://www.youtube.com/watch?v=bupx08ZgSFg

Get your LLM application from prototype to production.

A great YT List: https://www.youtube.com/playlist?list=PL4HikwTaYE0GEs7lvlYJQcvKhq0QZGRVn

Retrieval Chains

https://www.youtube.com/watch?v=-Ueh5XBpcoY&list=PL4HikwTaYE0GEs7lvlYJQcvKhq0QZGRVn&index=4
- https://python.langchain.com/docs/modules/data_connection/
  - https://python.langchain.com/docs/integrations/document_loaders/web_base
  - https://github.com/leonvanzyl/langchain-python-tutorial/tree/lesson-4

To explore vector DBs we have Vector Admin, but for regular DB’s we have WhoDB

A lightweight next-gen database explorer - Postgres, MySQL, SQLite, MongoDB, Redis, MariaDB & Elastic Search

Welcome to WhoDB – a powerful, lightweight (~20Mi), and user-friendly database management tool that combines the simplicity of Adminer with superior UX and performance. WhoDB is written in GoLang!

What are Embedding Models?

Embedding models are algorithms or neural networks that transform data into a numerical representation called an embedding.

This embedding is a vector (a list of numbers) that captures the semantic meaning or relationships within the data. The key idea is that similar data points will have embeddings that are close to each other in vector space.

Here’s a breakdown:

Input Data: Embedding models take various types of data as input, including:
- Text: Words, sentences, paragraphs, or entire documents.
- Images: Pixel data or features extracted from images.
- Audio: Sound waves or audio features.
- Video: Frames or video features.
- Other Data: Even things like user behavior, product ratings, or sensor readings can be converted into embeddings.
The Model: The embedding model processes this input data and generates a vector as output. This vector is the embedding. Different models use different techniques to create these embeddings, but the goal is always to capture the underlying meaning or relationships.
Vector Space: The resulting vectors exist in a multi-dimensional space called “vector space.” Each dimension of the vector represents a different feature or characteristic of the data. The position of the vector in this space encodes the semantic information.
Similarity: The crucial property of embeddings is that the distance between two vectors in vector space reflects the similarity between the corresponding data points. Similar items will have embeddings that are close together, while dissimilar items will have embeddings that are far apart. Distance is typically measured using metrics like cosine similarity or Euclidean distance.

Why are embedding models useful?

Semantic Search: They enable search based on meaning, not just keywords. You can search for “red sports car” and get results for images of red sports cars even if they aren’t labeled with those exact words.
Recommendation Systems: They power recommendation engines by finding items that are similar to what a user has liked or interacted with in the past.
Clustering: They allow you to group similar data points together based on their embeddings.
Natural Language Processing (NLP): They are fundamental to many NLP tasks, such as machine translation, sentiment analysis, and question answering.
Image Recognition: They are used in computer vision for tasks like image classification, object detection, and image retrieval.

Examples of embedding models:

Word2Vec (for text): A classic model that learns word embeddings by predicting words based on their context.
GloVe (for text): Another popular word embedding model that leverages global word-word co-occurrence statistics.
FastText (for text): An extension of Word2Vec that can handle out-of-vocabulary words and character-level information.
BERT (for text): A powerful transformer-based model that produces contextualized word embeddings, meaning the embedding of a word depends on its surrounding words.
ResNet (for images): A deep convolutional neural network used for image feature extraction, often used to generate image embeddings.
CLIP (for images and text): A model that learns joint embeddings for images and text, allowing for zero-shot image classification and text-based image search.

Embedding models are essential tools for representing data in a way that captures its semantic meaning, enabling a wide range of applications that rely on understanding relationships and similarities between data points. They are a core component of many modern AI systems.

F/OSS Vector DBs#

Chroma#

QDrant Vector Database with Docker#

Vector Admin#

F/OSS RAGs#

LangChain#

Other RAG Frameworks#

LLamaIndex#

MemGPT#

Mem0 - exEmbedChain#

PandasAI#

Conclusions#

FAQ#

How to Process Unstructured Data#

How to run LLMs Locally#

What are LangChains?#

Retrieval Chains#

What are Embedding Models?#