Creating a Q&A Dataset from Stack Overflow for RAG

When building RAG (Retrieval-Augmented Generation), having a good dataset is crucial.

By Lidia on Thu Jan 09 2025

When building RAG (Retrieval-Augmented Generation), having a good dataset is crucial. While many tutorials use private or proprietary datasets, I'll show you how to create your own Q&A dataset using publicly available Stack Overflow data from Hugging Face.

The Goal

We want to create a dataset of high-quality question-answer pairs that we can use to train and test our RAG. Specifically, we'll focus on Python and machine learning-related questions.

The Process

First, let's set up our imports and initialize our data structures:

from datasets import Dataset, DatasetDict, load_dataset from
tqdm import tqdm questions = {}
answers = {}

We'll use the 'mikex86/stackoverflow-posts' dataset from Hugging Face, which contains Stack Overflow posts. Since this is a large dataset, we'll stream it and process it in batches:

dataset = load_dataset('mikex86/stackoverflow-posts',
split='train', streaming=True)

Filtering Criteria

We want to ensure we get high-quality Q&A pairs, so we'll apply several filters:

Questions must have an accepted answer
Questions must have a positive score
Questions must have at least 100 views
Questions must be tagged with either 'python' or 'machine-learning'

Here's how we process each post:

for batch in tqdm(dataset.iter(batch_size=1000), desc="Processing posts"):
    for i in range(len(batch['Id'])):
        post = {key: batch[key][i] for key in batch.keys()}

        if (post['PostTypeId'] == 1 and 
            post['AcceptedAnswerId'] is not None and 
            post['Score'] > 0 and 
            post['ViewCount'] is not None and 
            post['ViewCount'] > 100 and 
            post['Tags'] is not None and 
            ('python' in post['Tags'] or 'machine-learning' in post['Tags'])):
            questions[post['Id']] = {'question': post['Title'] + '\n\n' + post['Body']}

        elif post['PostTypeId'] == 2:
            pid = post['ParentId']
            if pid in questions:
                if pid not in answers:
                    answers[pid] = []
                answers[pid].append(post['Body'])

Creating the Final Dataset

Once we have our questions and answers, we combine them into Q&A pairs and split them into training and test sets:

qa_pairs = [
    {'question': questions[qid]['question'], 'answers': answers[qid]}
    for qid in questions if qid in answers
]

train_size = int(0.9*len(qa_pairs))
dataset_dict = DatasetDict({
    'train': Dataset.from_list(qa_pairs[:train_size]),
    'test': Dataset.from_list(qa_pairs[train_size:])
})

Finally, we save our dataset to disk:

dataset_dict.save_to_disk('stackoverflow_qa_dataset')

Key Features of the Dataset

Each question includes both the title and body for maximum context
All questions have at least one answer
Questions are pre-filtered for quality (positive score, minimum views)
Focus on Python and machine learning ensures technical relevance
90/10 train/test split for model evaluation

Next Steps

With this dataset, you can:

Build embeddings for questions and answers
Train retrieval models
Evaluate different RAG architectures
Fine-tune language models for programming-specific Q&A

Remember that while this dataset focuses on Python and machine learning, you can modify the filtering criteria to create datasets for other programming languages or topics of interest.

Would you like me to elaborate on any part of this process or explain how to use this dataset in RAG? Feel free to contact us at info@patternx.us

Clean, Repair and Optimize Your Data with AI