PII Removal: De-identified Educational Online Discussion Corpus

This project bridges the gap between privacy protection and research utility by introducing the De-identified Educational Online Discussion (DEOD) corpus, the first publicly accessible, large-scale corpus of authentic online student discussions from a large U.S. public university. We address the technical challenge through a supervised fine-tuning pipeline achieving a 0.92 F1 score across ten PII categories, coupled with a rigorous four-layer quality review process that flags or removes an additional 6,658 potential privacy risks.

Key Contributions

STATE-OF-THE-ART PII REMOVAL — A supervised fine-tuned GPT-4.1 model achieving 0.92 precision, 0.91 recall, and 0.92 F1 score across ten PII categories, with 0.97 F1 on names, the most privacy-critical category.
LARGE-SCALE, DIVERSE CORPUS — 192K de-identified discussion posts from over 16K unique students across 27 disciplines, 92 courses, and seven academic years (2018–2025), with rich metadata including thread structure, timestamps, demographics, and learning outcomes.
OPEN, ETHICAL DISSEMINATION — The first publicly accessible student discussion corpus, to be released with IRB-approved protocols, usage agreements, and mechanisms for community-driven quality monitoring, enabling replicable research while maintaining strict privacy protections.

In preparation.

PII REMOVAL: DE-IDENTIFIED EDUCATIONAL ONLINE DISCUSSION CORPUS

Lead Researcher & Data Scientist · Arizona State University · Learning @ Scale Grant