TY - JOUR
T1 - Task-specific information outperforms surveillance-style big data in predictive analytics
AU - Bjerre-Nielsen, Andreas
AU - Kassarnig, Valentin
AU - Lassen, David Dreyer
AU - Lehmann, Sune
N1 - Funding Information:
ACKNOWLEDGMENTS. We gratefully acknowledge financial support from the Villum Foundation’s Young Investigator Grant and Synergy Grant, the UCPH2016 initiative, and an Economic Policy Research Network (EPRN) grant. The Center for Economic Behavior and Inequality is supported by Danish National Research Foundation Grant DNRF134.
Publisher Copyright:
© 2021 National Academy of Sciences. All rights reserved.
PY - 2021/4/6
Y1 - 2021/4/6
N2 - Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19-induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students' privacy. We focus on academic performance and ask whether predictive performance for a given dataset can be achieved with less privacy-invasive, but more task-specific, data. We draw on a unique dataset on a large student population containing both highly detailed measures of behavior and personality and high-quality third-party reported individual-level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out of sample. However, models using only low-dimensional and arguably less privacyinvasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with "ground truth" administrative registry data can ideally allow the identification of privacy-preserving taskspecific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.
AB - Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19-induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students' privacy. We focus on academic performance and ask whether predictive performance for a given dataset can be achieved with less privacy-invasive, but more task-specific, data. We draw on a unique dataset on a large student population containing both highly detailed measures of behavior and personality and high-quality third-party reported individual-level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out of sample. However, models using only low-dimensional and arguably less privacyinvasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with "ground truth" administrative registry data can ideally allow the identification of privacy-preserving taskspecific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.
KW - Academic performance
KW - Big data
KW - Prediction
KW - Privacy
UR - http://www.scopus.com/inward/record.url?scp=85103745351&partnerID=8YFLogxK
U2 - 10.1073/pnas.2020258118
DO - 10.1073/pnas.2020258118
M3 - Article
C2 - 33790010
AN - SCOPUS:85103745351
SN - 0027-8424
VL - 118
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 14
M1 - e2020258118
ER -