EduEVAL-DB Dataset for AI Tutor Evaluation
๐กNew dataset for benchmarking AI tutors on bias, facts, and teaching qualityโfine-tune lightweight models now.
โก 30-Second TL;DR
What Changed
854 explanations for 139 ScienceQA questions spanning science, language, social science
Why It Matters
This dataset advances safe educational AI by enabling evaluation of pedagogical risks in LLM explanations. It supports training lightweight models for on-device use, democratizing AI tutor assessment. Researchers can now benchmark edAI systems against real teaching standards.
What To Do Next
Download EduEVAL-DB from arXiv and fine-tune Llama 3.1 8B for pedagogical risk detection.
๐ง Deep Insight
Web-grounded analysis with 4 cited sources.
๐ Enhanced Key Takeaways
- โขEduEVAL-DB contains 854 explanations for 139 curated ScienceQA questions across K-12 science, language, and social science subjects[1][2].
- โขIncludes one human-teacher explanation per question and six LLM-simulated teacher roles created via prompt engineering, inspired by real educational styles and shortcomings[1][2][3].
- โขFeatures a pedagogical risk rubric with five dimensions: factual correctness, explanatory depth/completeness, focus/relevance, student-level appropriateness, and ideological bias, using binary risk labels[1][2][3].
- โขAnnotations performed via semi-automatic process with expert teacher review; dataset is publicly released for training and evaluating LLM-based tutors and evaluators[1][2].
- โขBenchmarks show Gemini 2.5 Pro outperforming fine-tuned Llama 3.1 8B in risk detection, with fine-tuning improving calibration, sensitivity, and deployability on consumer hardware[1].
๐ Competitor Analysisโธ Show
| Feature | EduEVAL-DB | ScienceQA |
|---|---|---|
| Explanations per Question | 7 (1 human + 6 LLM) | Primarily QA pairs with images/text |
| Focus | Pedagogical risk evaluation | Visual question answering benchmarks |
| Rubric Dimensions | 5 (correctness, depth, focus, appropriateness, bias) | Accuracy on science questions |
| Benchmarks | Gemini 2.5 Pro vs. Llama 3.1 8B fine-tuned | Various LLMs on QA accuracy |
| Hardware | Consumer-deployable models | Not specified |
๐ ๏ธ Technical Deep Dive
- โขDataset derived from curated subset of ScienceQA benchmark, covering K-12 levels[1][2].
- โขLLM-simulated roles instantiated via prompt engineering to mimic instructional styles and common shortcomings[1][2][3].
- โขBinary risk labels annotated semi-automatically with expert teacher review for all five rubric dimensions[1][2].
- โขFine-tuning Llama 3.1 8B on EduEVAL-DB improves MAE trends, confusion matrix sensitivity to risk-present cases, and reduces majority label bias despite class imbalance[1].
- โขGemini 2.5 Pro leverages broader factual knowledge for advantages in evaluation, while fine-tuned model supports local deployment[1].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
EduEVAL-DB enables training of locally deployable pedagogical evaluators, advancing safer AI tutors by assessing beyond factual accuracy to include depth, focus, appropriateness, and bias, potentially standardizing K-12 AI education tools[1].
โณ Timeline
๐ Sources (4)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: ArXiv AI โ