Testing out RAG (retrieval augmentation generation) for Artificial Intelligence

AlexB23 · Jul 7, 2024

Hello folks. Sometime tomorrow, I will be testing out a local AI on my computer that can retrieve information from documents on my laptop.

Retrieval Augmentation Generation (RAG) is where an AI can use documents as a source for output. After plugging in the Holy Bible, a Catholic Catechism and a book about 2000 years of Church history, I am hoping that this new software can improve my devotionals. If not, it can just work as a regular AI, if I turn the RAG mode off.

RAG info: What is retrieval-augmented generation (RAG)?

My computer is using GPT4ALL v3.00, along with Mistral 7B v0.3 (a step up from v0.2). All this stuff runs offline, and does not need an internet connection.

AlexB23 · Jul 10, 2024

Alright, so the system works, as it now references a Catechism for the Catholics. Now, I am adding medical textbooks to the AI, to see if it knows basic medical science. Maybe later on, I could add some Orthodox and Protestant documents into the system. Welcome to the future guys, where artificial intelligence could be run offline on an airplane or in a forest. Imagine a biologist in a remote area asking the AI for a description of a rare species of plant* or animal, and the AI tells her what it is based upon an image uploaded to the AI. The AI on my computer is not capable of image recognition, but other AI could be.

*Plant identification tool (offline): Pl@ntNet offline/embedded: identify plants anywhere without connection - Pl@ntNet

Catechism PDF added to AI.

elytron · Jul 11, 2024

Right now am sort of wary of installing an AI on my main computer. Maybe if I had a spare desktop I would try and experiment.

In the meantime I will just stick with the online ones which don't have access to my files.

AlexB23 · Jul 11, 2024

elytron said:
Right now am sort of wary of installing an AI on my main computer. Maybe if I had a spare desktop I would try and experiment.

In the meantime I will just stick with the online ones which don't have access to my files.

Thar is understandable.

For me, I am the opposite, and am wary of any online AI, as online AI may collect telemetry. What type of desktop do you have? I have a Microsoft Surface Book 2 which my parents got me for college which I used in 2018 and 2019. It still works well in 2024. For myself, I can prevent the AI from having access to the files, cos the RAG mode can be turned off, or not used.

This is the AI explaining the Non Expedit, some confusing Catholic thing, as I am Catholic myself and someone asked a question about it in the Catholic section of the forum.

AlexB23 · Jul 15, 2024

Just installed appliance instruction manuals onto the AI, so if anything goes wrong, I can troubleshoot my appliances in the future. Luckily, most of my appliances are new, so there will be no issues for a while.

The Liturgist · Aug 6, 2024

AlexB23 said:
Thar is understandable. For me, I am the opposite, and am wary of any online AI, as online AI may collect telemetry. What type of desktop do you have? I have a Microsoft Surface Book 2 which my parents got me for college which I used in 2018 and 2019. It still works well in 2024. For myself, I can prevent the AI from having access to the files, cos the RAG mode can be turned off, or not used.

This is the AI explaining the Non Expedit, some confusing Catholic thing, as I am Catholic myself and someone asked a question about it in the Catholic section of the forum.

View attachment 351451

I have some vintage IBM and Sun Microsystems workstations with a mixture of POWER (like PowerPC Macs, but more powerful, with more instructions) or UltraSPARC RISC and Intel x64 CPUs and some newer workstations and laptops mainly from Lenovo, with a single x64 iMac from 2018, and a few older Dell machines as well, and one HP laptop that does a very good job with cooling.

AlexB23 · Aug 6, 2024

The Liturgist said:
I have some vintage IBM and Sun Microsystems workstations with a mixture of POWER (like PowerPC Macs, but more powerful, with more instructions) or UltraSPARC RISC and Intel x64 CPUs and some newer workstations and laptops mainly from Lenovo, with a single x64 iMac from 2018, and a few older Dell machines as well, and one HP laptop that does a very good job with cooling.

Those vintage systems may struggle with large AI models. Your newer computers might run AI better.

I recommend any computer made in the past 5 years, and more than 16 GB of RAM for smaller AI models, and larger RAM amounts for larger AI models. You would need a small mainframe to run Llama 3.1 405B.

AlexB23 · Sep 12, 2024

Update: It seems that Retrieval Augmentation Generation (RAG) has been a success. The AI struggles with RAG when it comes to political platform documents for some reason, as the AI thinks that the year 2024 is still in the future, but does much better for religious documents (theology) and sports rule books.

The Liturgist · Sep 12, 2024

AlexB23 said:
Those vintage systems may struggle with large AI models. Your newer computers might run AI better. I recommend any computer made in the past 5 years, and more than 16 GB of RAM for smaller AI models, and larger RAM amounts for larger AI models. You would need a small mainframe to run Llama 3.1 405B.

Don’t underestimate the power of my RISC systems - they are massively multithreaded, which is good for AI operations. The same is true of the IBM Power Cell CPU. And the upshot is not having to pound a GPU.

Now that being said, admittedly, a modern CPU with AI optimized cores could eat their lunch, which is why I’m planning either a new workstation with dedicated AI cores or refurbishing some newish rackmount servers for this purpose.

I prefer servers that have remote management, whether they are rackmount or tower, since I really don’t want to get into the OS on an internals respect. I would probably use late model Cisco UCS servers or HP Enterprise or Lenovo.

AlexB23 · Sep 13, 2024

The Liturgist said:
Don’t underestimate the power of my RISC systems - they are massively multithreaded, which is good for AI operations. The same is true of the IBM Power Cell CPU. And the upshot is not having to pound a GPU.

Now that being said, admittedly, a modern CPU with AI optimized cores could eat their lunch, which is why I’m planning either a new workstation with dedicated AI cores or refurbishing some newish rackmount servers for this purpose.

I prefer servers that have remote management, whether they are rackmount or tower, since I really don’t want to get into the OS on an internals respect. I would probably use late model Cisco UCS servers or HP Enterprise or Lenovo.

Interesting stuff. RISC-V is the newest RISC system, developed in the 21st century.

abe99 · May 4, 2025

This is really exciting. I had a similar thought about using AI for God's glory. Thought about putting the bible, catechism, writings of the church fathers and saints, into a RAG or as raw data for training an LLM model so that it could be accessed for writing bible studies, etc. Just put together my rig which includes an RTX 5090 video card with 32 GiB of Vram and have downloaded various LLM models but I'm now just researching RAG implementation vs training an existing model. . I would love to collaborate with you to explore possibilities.

The Liturgist · May 4, 2025

abe99 said:
This is really exciting. I had a similar thought about using AI for God's glory. Thought about putting the bible, catechism, writings of the church fathers and saints, into a RAG or as raw data for training an LLM model so that it could be accessed for writing bible studies, etc. Just put together my rig which includes an RTX 5090 video card with 32 GiB of Vram and have downloaded various LLM models but I'm now just researching RAG implementation vs training an existing model. . I would love to collaborate with you to explore possibilities.

I’d be interested in working with you on this. I’ve been using AI to translate ancient hymns and liturgical texts from the early church. That said I’m getting the best results with chatGPT 4o due to its superior training data, and I’ve been evaluating 4.5o and have been getting even better results.

abe99 · May 4, 2025

That sounds good. I've used chatGPT to summarize theological documents and to write bible studies and I am really impressed by how much it can do.

The twist is that I'm looking to build a LLM model that is specifically trained on Christian content and that will run on a local server that OpenAi does not control or charge money for. Thats why I like the Retrieval Augmentation Generation (RAG) that AlexB23 is talking about. It has great upside.

We should set up a working group that includes AlexB23 to explore this further.

The Liturgist · May 5, 2025

abe99 said:
That sounds good. I've used chatGPT to summarize theological documents and to write bible studies and I am really impressed by how much it can do.

The twist is that I'm looking to build a LLM model that is specifically trained on Christian content and that will run on a local server that OpenAi does not control or charge money for. Thats why I like the Retrieval Augmentation Generation (RAG) that AlexB23 is talking about. It has great upside.

We should set up a working group that includes AlexB23 to explore this further.

Unfortunately he hasn’t logged in since January and I don’t know how to reach him. By the way chatGPT will actually help you run your own servers as long as they aren’t competing with chatGPT. One area I am working on is research into AI behavior and personality, which I prefer to do on my own hardware because chatGPT tends to occasionally reset sessions, and this causes a context loss that temporarily interferes with programmed idiosyncratic behavior. That said it remains to be seen if more limited hardware resources and training data will produce the same results.

abe99 · May 5, 2025

The Liturgist said:
Unfortunately he hasn’t logged in since January and I don’t know how to reach him. By the way chatGPT will actually help you run your own servers as long as they aren’t competing with chatGPT. One area I am working on is research into AI behavior and personality, which I prefer to do on my own hardware because chatGPT tends to occasionally reset sessions, and this causes a context loss that temporarily interferes with programmed idiosyncratic behavior. That said it remains to be seen if more limited hardware resources and training data will produce the same results.

That's too bad. I was hoping that he would share what steps he took to implement his RAG. Oh well, onwards and upwards. Apparently there are two ways that you can go with this, 1.) a RAG application as he did or 2.) actually train an existing model with the new Christian data. I would like to train an existing model on a local server to become free of the above ChatGPT limitations (as well as others.) Having said that, why not use ChatGPT to help free ourselves of ChatGPT, so here goes, My question to ChatGPT: "in which data format would a copy of the christian bible need to be in in order to use it to train an llm model" Answer:
To train a language model (LLM) using a copy of the Christian Bible, the data should be in a clean, machine-readable format. The ideal format depends on the tools and frameworks you're using for training, but the following formats are commonly used:

### 1. **Plain Text (TXT)**

* **Description**: The simplest and most straightforward format. The Bible's text is usually divided into chapters and verses.

* **Structure**: You would typically format the text in a way that the model can recognize the structure of the Bible (books, chapters, and verses). For example:

```
Genesis 1:1 In the beginning, God created the heavens and the earth.
Genesis 1:2 Now the earth was formless and empty...
```

* **Pros**: Easy to process, minimal overhead, widely supported.

* **Cons**: May need preprocessing to clean up and format the text appropriately, especially if you want to train with specific verse references, cross-references, or footnotes.

### 2. **JSON**

* **Description**: A more structured format that can store hierarchical data (book > chapter > verse). This format makes it easy to add metadata (e.g., commentary, translations, etc.) in the future.

* **Example structure**:

```json
{
"Genesis": {
"1": {
"1": "In the beginning, God created the heavens and the earth.",
"2": "Now the earth was formless and empty..."
},
"2": {
"1": "Thus the heavens and the earth were completed..."
}
}
}
```

* **Pros**: Flexible, can easily extend to include additional metadata, good for complex models that might want to learn structure.

* **Cons**: Slightly more complex than plain text; requires parsing.

### 3. **CSV (Comma-Separated Values)**

* **Description**: A tabular format where each row might represent a book, chapter, and verse, with the actual verse text as the value.

* **Example**:

```
Book,Chapter,Verse,Text
Genesis,1,1,"In the beginning, God created the heavens and the earth."
Genesis,1,2,"Now the earth was formless and empty..."
```

* **Pros**: Simple to use, especially if you want a straightforward lookup for each verse, can be easily handled in spreadsheets.

* **Cons**: Not as flexible as JSON or plain text for hierarchical data, may require additional processing for more advanced use cases.

### 4. **XML**

* **Description**: Another structured format similar to JSON, but with a more verbose syntax. XML is useful if you need to capture additional metadata or want a highly detailed hierarchical structure.

* **Example**:

```xml
<Bible>
<Book name="Genesis">
<Chapter number="1">
<Verse number="1">In the beginning, God created the heavens and the earth.</Verse>
<Verse number="2">Now the earth was formless and empty...</Verse>
</Chapter>
</Book>
</Bible>
```

* **Pros**: Extensible, widely used in legacy systems, and great for structured data.

* **Cons**: Verbose and might require more processing power for parsing compared to JSON.

### 5. **TFRecord (for TensorFlow-based models)**

* **Description**: A binary format optimized for TensorFlow. It’s designed to handle large datasets efficiently, especially when working with large-scale training tasks.
* **Pros**: Very efficient for large datasets, works seamlessly with TensorFlow.
* **Cons**: Not as human-readable, requires specific tools to work with.

### Data Preprocessing

Regardless of the format, you'll need to preprocess the Bible's text for training:

1. **Tokenization**: Split the text into tokens (e.g., words or sub-words), which is crucial for feeding the text into the LLM.
2. **Cleaning**: Ensure that any special characters, punctuation, or formatting inconsistencies are resolved.
3. **Metadata**: Depending on your goal, you might want to incorporate metadata like book names, chapter numbers, or verse references.
4. **Text Normalization**: Convert the text to lowercase or perform other normalization steps to avoid training on noise or irrelevant variations in the text.
5. **Split by Sentence/Verse/Chunk**: For context-based training, chunk the data into meaningful segments (e.g., by verse or sentence) to allow the model to learn the relationships between words and concepts effectively.

### Summary

If you're training a general-purpose LLM, **plain text (TXT)** or **JSON** is probably the most straightforward and commonly used format. If you want a more structured approach with metadata, **JSON** or **XML** could be better options. For large-scale training, **TFRecord** (for TensorFlow) might be a good choice if performance is a concern.

abe99 · May 5, 2025

I'm not sure if the other forum members will be happy with these long posts. Let me know if there is an alternative approach I should use to share this information.

timewerx · May 5, 2025

abe99 said:
That sounds good. I've used chatGPT to summarize theological documents and to write bible studies and I am really impressed by how much it can do.

The twist is that I'm looking to build a LLM model that is specifically trained on Christian content and that will run on a local server that OpenAi does not control or charge money for. Thats why I like the Retrieval Augmentation Generation (RAG) that AlexB23 is talking about. It has great upside.

We should set up a working group that includes AlexB23 to explore this further.

I'm using GP4All using the latest Llama 3 8B model to run LLM analysis on local documents completely offline and completely free too.

The Llama 3 8B is already pre-trained in Meta and I've been testing it on scriptures and seeing some surprisingly interesting results. Deepseek AI model is also available on GPTAll but could not test it comprehensively due to very poor performance in my laptop and initial results with simple test data is not superior to Llama 3 8B.

However, my intention is not to automate generation of Bible studies but simply seek the ultimate truth. Thus, I would actually stay away from LLM models that are only trained on Christian content because that will give it strong biases in favor of Christian literature or the mainstream agenda and fail to provide valuable investigative analysis of whatever crucially important is going on at that time.

timewerx · May 5, 2025

abe99 said:
That's too bad. I was hoping that he would share what steps he took to implement his RAG. Oh well, onwards and upwards.

The easiest way is just download GPT4All and choose/download from all available LLM models within the software. You can then use any of these models on local documents on your computer (RAG).

No coding needed and the models available seem to be of very good quality. I've only tested couple most popular model that works with GPT4All. My favorite so far is Llama 3 8B which is pretrained with Meta. Deepseek AI model is also available in GPT4All but somehow found the responses inferior to Llama 3 8B and is very slow on my machine. But I haven't extensively tested Deepseek to have truly accurate review of it.

Download here:

GPT4All – The Leading Private AI Chatbot for Local Language Models

Experience true data privacy with GPT4All, a private AI chatbot that runs local language models on your device. No cloud needed—run secure, on-device LLMs for unlimited offline AI interactions.

www.nomic.ai

abe99 · May 5, 2025

Thanks for the info on GPT4All but I ah already running multiple models locally using ollama, llama.cpp and openwebui, which are all really good. You mentioned Deepseek and I would caution that there was at least one user on Reddit that said that he was robbed of $80,000+ from his crypto wallet after clicking on a link supplied by Deepseek. (I don't have the link right now just do a search on Reddit. I don't want to denegrade them because they are offering some great oprn source models but I personally wont use them because I believe that the communist Govt has an agenda that it extends to its surrogates, and I expected that something like the crypto theft was possible.

Fot those who are worried about using AI on their local systems there is a way to protect yourself and that is by sandboxing or closing off your ai models from the rest of your system but using Docker and other similar software to run your models. That's what I do.

elytron · May 5, 2025

I have been experimenting on Linux, using Alpaca on my desktop locally. This is all very interesting and I am learning.

Testing out RAG (retrieval augmentation generation) for Artificial Intelligence

Christian

Christian

Well-Known Member

Christian

Christian

Traditional Liturgical Christian

Christian

Christian

Traditional Liturgical Christian

Christian

Member

Traditional Liturgical Christian

Member

Traditional Liturgical Christian

Member

Member

the village i--o--t--

the village i--o--t--

Member

Well-Known Member

Similar threads