Creating a Q&A Dataset from Stack Overflow for RAG
When building RAG (Retrieval-Augmented Generation), having a good dataset is crucial.
By Lidia on Thu Jan 09 2025
The Goal
We want to create a dataset of high-quality question-answer pairs that we can use to train and test our RAG. Specifically, we'll focus on Python and machine learning-related questions.
The Process
First, let's set up our imports and initialize our data structures:
from datasets import Dataset, DatasetDict, load_dataset from tqdm import tqdm questions = {} answers = {}
We'll use the 'mikex86/stackoverflow-posts' dataset from Hugging Face, which contains Stack Overflow posts. Since this is a large dataset, we'll stream it and process it in batches:
dataset = load_dataset('mikex86/stackoverflow-posts', split='train', streaming=True)
Filtering Criteria
We want to ensure we get high-quality Q&A pairs, so we'll apply several filters:
- Questions must have an accepted answer
- Questions must have a positive score
- Questions must have at least 100 views
- Questions must be tagged with either 'python' or 'machine-learning'
Here's how we process each post:
for batch in tqdm(dataset.iter(batch_size=1000), desc="Processing posts"): for i in range(len(batch['Id'])): post = {key: batch[key][i] for key in batch.keys()} if (post['PostTypeId'] == 1 and post['AcceptedAnswerId'] is not None and post['Score'] > 0 and post['ViewCount'] is not None and post['ViewCount'] > 100 and post['Tags'] is not None and ('python' in post['Tags'] or 'machine-learning' in post['Tags'])): questions[post['Id']] = {'question': post['Title'] + '\n\n' + post['Body']} elif post['PostTypeId'] == 2: pid = post['ParentId'] if pid in questions: if pid not in answers: answers[pid] = [] answers[pid].append(post['Body'])
Creating the Final Dataset
Once we have our questions and answers, we combine them into Q&A pairs and split them into training and test sets:
qa_pairs = [ {'question': questions[qid]['question'], 'answers': answers[qid]} for qid in questions if qid in answers ] train_size = int(0.9*len(qa_pairs)) dataset_dict = DatasetDict({ 'train': Dataset.from_list(qa_pairs[:train_size]), 'test': Dataset.from_list(qa_pairs[train_size:]) })
Finally, we save our dataset to disk:
dataset_dict.save_to_disk('stackoverflow_qa_dataset')
Key Features of the Dataset
- Each question includes both the title and body for maximum context
- All questions have at least one answer
- Questions are pre-filtered for quality (positive score, minimum views)
- Focus on Python and machine learning ensures technical relevance
- 90/10 train/test split for model evaluation
Next Steps
With this dataset, you can:
- Build embeddings for questions and answers
- Train retrieval models
- Evaluate different RAG architectures
- Fine-tune language models for programming-specific Q&A
Remember that while this dataset focuses on Python and machine learning, you can modify the filtering criteria to create datasets for other programming languages or topics of interest.
Would you like me to elaborate on any part of this process or explain how to use this dataset in RAG? Feel free to contact us at info@patternx.us