This project bridges the gap between privacy protection and research utility by introducing the De-identified Educational Online Discussion (DEOD) corpus, the first publicly accessible, large-scale corpus of authentic online student discussions from a large U.S. public university. We address the technical challenge through a supervised fine-tuning pipeline achieving a 0.92 F1 score across ten PII categories, coupled with a rigorous four-layer quality review process that flags or removes an additional 6,658 potential privacy risks.
Key Contributions
- STATE-OF-THE-ART PII REMOVAL — A supervised fine-tuned GPT-4.1 model achieving 0.92 precision, 0.91 recall, and 0.92 F1 score across ten PII categories, with 0.97 F1 on names, the most privacy-critical category.
- LARGE-SCALE, DIVERSE CORPUS — 192K de-identified discussion posts from over 16K unique students across 27 disciplines, 92 courses, and seven academic years (2018–2025), with rich metadata including thread structure, timestamps, demographics, and learning outcomes.
- OPEN, ETHICAL DISSEMINATION — The first publicly accessible student discussion corpus, to be released with IRB-approved protocols, usage agreements, and mechanisms for community-driven quality monitoring, enabling replicable research while maintaining strict privacy protections.
In preparation.