Testing out RAG (retrieval augmentation generation) for Artificial Intelligence

timewerx · May 5, 2025

abe99 said:
You mentioned Deepseek and I would caution that there was at least one user on Reddit that said that he was robbed of $80,000+ from his crypto wallet after clicking on a link supplied by Deepseek. (I don't have the link right now just do a search on Reddit. I don't want to denegrade them because they are offering some great oprn source models but I personally wont use them because I believe that the communist Govt has an agenda that it extends to its surrogates, and I expected that something like the crypto theft was possible.

I received the Deepseek model supplied by GPT4All. No problem so far, although I uninstalled it from GPT4All because I haven't found it more useful than Llama 3 and is very slow.

I didn't detect any agenda but I haven't run it on Christian scriptures yet. I probably won't even try due to very poor performance on my laptop.

The Liturgist · May 6, 2025

abe99 said:
That's too bad. I was hoping that he would share what steps he took to implement his RAG. Oh well, onwards and upwards. Apparently there are two ways that you can go with this, 1.) a RAG application as he did or 2.) actually train an existing model with the new Christian data. I would like to train an existing model on a local server to become free of the above ChatGPT limitations (as well as others.) Having said that, why not use ChatGPT to help free ourselves of ChatGPT, so here goes, My question to ChatGPT: "in which data format would a copy of the christian bible need to be in in order to use it to train an llm model" Answer:
To train a language model (LLM) using a copy of the Christian Bible, the data should be in a clean, machine-readable format. The ideal format depends on the tools and frameworks you're using for training, but the following formats are commonly used:

### 1. **Plain Text (TXT)**

* **Description**: The simplest and most straightforward format. The Bible's text is usually divided into chapters and verses.

* **Structure**: You would typically format the text in a way that the model can recognize the structure of the Bible (books, chapters, and verses). For example:

```
Genesis 1:1 In the beginning, God created the heavens and the earth.
Genesis 1:2 Now the earth was formless and empty...
```

* **Pros**: Easy to process, minimal overhead, widely supported.

* **Cons**: May need preprocessing to clean up and format the text appropriately, especially if you want to train with specific verse references, cross-references, or footnotes.

### 2. **JSON**

* **Description**: A more structured format that can store hierarchical data (book > chapter > verse). This format makes it easy to add metadata (e.g., commentary, translations, etc.) in the future.

* **Example structure**:

```json
{
"Genesis": {
"1": {
"1": "In the beginning, God created the heavens and the earth.",
"2": "Now the earth was formless and empty..."
},
"2": {
"1": "Thus the heavens and the earth were completed..."
}
}
}
```

* **Pros**: Flexible, can easily extend to include additional metadata, good for complex models that might want to learn structure.

* **Cons**: Slightly more complex than plain text; requires parsing.

### 3. **CSV (Comma-Separated Values)**

* **Description**: A tabular format where each row might represent a book, chapter, and verse, with the actual verse text as the value.

* **Example**:

```
Book,Chapter,Verse,Text
Genesis,1,1,"In the beginning, God created the heavens and the earth."
Genesis,1,2,"Now the earth was formless and empty..."
```

* **Pros**: Simple to use, especially if you want a straightforward lookup for each verse, can be easily handled in spreadsheets.

* **Cons**: Not as flexible as JSON or plain text for hierarchical data, may require additional processing for more advanced use cases.

### 4. **XML**

* **Description**: Another structured format similar to JSON, but with a more verbose syntax. XML is useful if you need to capture additional metadata or want a highly detailed hierarchical structure.

* **Example**:

```xml
<Bible>
<Book name="Genesis">
<Chapter number="1">
<Verse number="1">In the beginning, God created the heavens and the earth.</Verse>
<Verse number="2">Now the earth was formless and empty...</Verse>
</Chapter>
</Book>
</Bible>
```

* **Pros**: Extensible, widely used in legacy systems, and great for structured data.

* **Cons**: Verbose and might require more processing power for parsing compared to JSON.

### 5. **TFRecord (for TensorFlow-based models)**

* **Description**: A binary format optimized for TensorFlow. It’s designed to handle large datasets efficiently, especially when working with large-scale training tasks.
* **Pros**: Very efficient for large datasets, works seamlessly with TensorFlow.
* **Cons**: Not as human-readable, requires specific tools to work with.

### Data Preprocessing

Regardless of the format, you'll need to preprocess the Bible's text for training:

1. **Tokenization**: Split the text into tokens (e.g., words or sub-words), which is crucial for feeding the text into the LLM.
2. **Cleaning**: Ensure that any special characters, punctuation, or formatting inconsistencies are resolved.
3. **Metadata**: Depending on your goal, you might want to incorporate metadata like book names, chapter numbers, or verse references.
4. **Text Normalization**: Convert the text to lowercase or perform other normalization steps to avoid training on noise or irrelevant variations in the text.
5. **Split by Sentence/Verse/Chunk**: For context-based training, chunk the data into meaningful segments (e.g., by verse or sentence) to allow the model to learn the relationships between words and concepts effectively.

### Summary

If you're training a general-purpose LLM, **plain text (TXT)** or **JSON** is probably the most straightforward and commonly used format. If you want a more structured approach with metadata, **JSON** or **XML** could be better options. For large-scale training, **TFRecord** (for TensorFlow) might be a good choice if performance is a concern.

I would advise against the CSV format since it has some notoriously tricky encoding rules and is a nightmare to work with manually; it’s fine for exchanging data between spreadsheet programs that doesn’t require the more complex formatting of ODF or OOXML formatted files but otherwise ugh.

The interesting thing about XML by the way is that technically HTML is a format of XML.

Search

Search

Testing out RAG (retrieval augmentation generation) for Artificial Intelligence

timewerx

the village i--o--t--

More options

The Liturgist

Traditional Liturgical Christian

More options

Need help?

Similar threads