EduEVAL-DB introduces a dataset of 854 explanations for 139 ScienceQA questions across K-12 subjects, with one human-teacher and six LLM-simulated teacher explanations. It features a pedagogical risk rubric covering factual correctness, depth, focus, appropriateness, and bias, annotated via semi-automatic expert review. Preliminary benchmarks compare Gemini 2.5 Pro against fine-tuned Llama 3.1 8B for risk detection on consumer hardware.
Key Points
- 1.854 explanations for 139 ScienceQA questions spanning science, language, social science
- 2.One human-teacher and six LLM-simulated roles via prompt engineering
- 3.Pedagogical risk rubric with five dimensions and binary labels
- 4.Semi-automatic annotation with expert teacher review
- 5.Benchmarks Gemini 2.5 Pro vs. fine-tuned Llama 3.1 8B for deployable risk detection
Impact Analysis
This dataset advances safe educational AI by enabling evaluation of pedagogical risks in LLM explanations. It supports training lightweight models for on-device use, democratizing AI tutor assessment. Researchers can now benchmark edAI systems against real teaching standards.
Technical Details
Dataset derived from curated ScienceQA subset for K-12. Teacher roles inspired by real instructional styles/shortcomings. Validation via supervised fine-tuning experiments on Llama 3.1 8B.