Digging into the research methods of educational studies can be akin to looking under the hood of a car: most people see a lot of complex machinery without grasping how it all actually works. At least with cars, we can take them for a test drive. But with research, it’s not so straightforward.
How do educators and policy makers—and, for that matters, researchers themselves—know they can trust the findings generated by the tools and techniques of education research?
For help making sense of the field, we turned to assistant professor Wendy Chan, an applied education statistician who joined Penn GSE’s Human Development and Quantitative Methods Program in the fall of 2016. We started by asking Chan about her path to becoming an education researcher.
After graduating from Columbia University with a degree in mathematics and economics in 2008, Chan enrolled in Teach for America, taking an assignment in New York City. In addition to her regular teaching duties, she and some of her fellow corps members were asked to lead an after-school program for students they did not teach regularly.
“Several of my colleagues questioned the efficacy of this program,” she says, “and I was interested in investigating whether it was, indeed, effective. But the data we had were anecdotal and not very helpful for formally responding to our questions.”
They had long talks about this and other educational programs, Chan recalls. She knew there had to be a way to get better answers, though. And so when her Teach for America commitment came to an end, she decided to pursue a doctorate in applied statistics.
It was in graduate school, she says, where “I learned how statistics is used to answer similar questions to ones we were asking in Teaching for America, but this time using empirical evidence."
According to Chan, if you strip away all the equations and jargon, the work of applied education statisticians comes down to answering two related questions: Do educational interventions actually work? And for whom do they work?
Applied education statisticians answer these questions by ensuring that researchers’ data-crunching tools and techniques work as advertised. They also explore ways to improve those tools and techniques, and sometimes they even develop new ones.
For example, think about how researchers determine whether a new mathematics curriculum improves students’ test scores.
They collect test score data for students who use the curriculum (the treatment group) as well as for students who do not use the curriculum (the control group). They then examine whether the differences in test scores between these two groups are significant.
According to Chan, the fingerprints of applied statisticians are all over this example.
They’re on the design of the experiment, in terms of assessing whether the study has enough statistical power to detect potential treatment effects. They appear in the methods researchers use to compare the scores of the two groups and to compare the groups themselves if the curriculum is not randomly assigned to the students. And they show up on considerations of whether the experimental results would hold with a broader population of students outside of the study.
Chan illustrates other ways that applied education statisticians leave their mark on education research and policy with an example from her own research portfolio.
Suppose that the math curriculum in the experiment described above is effective. In other words, students in the treatment group have better achievement than their counterparts in the control group.
Should the state education agency mandate that the curriculum be implemented statewide?
The answer isn’t always straightforward, and this is where Chan’s research on generalizing from nonrandom samples comes in.
Studies with nonrandom samples are common in education research, but they pose a serious challenge for researchers and policy makers. “The problem is that inferences from such studies may be misleading if the subjects in the study are not representative of the subjects in the larger target population,” says Chan.
In our hypothetical, policy makers would want to know if the new curriculum would have similar results across all students in the state.
If the study sampled students from, say, a large urban district, policy makers could not reasonably infer that the curriculum will be effective for students in rural and suburban areas of the state. Too many potential influencing factors—background characteristics, family income levels, prior achievement levels—haven’t been controlled for.
One common workaround for this problem is called point identification, which, theoretically, controls for every possible characteristic that differentiates students in urban districts from students in suburban and rural areas. But it’s very hard—and maybe impossible—to account for every difference. So the results might be misleading.
In her research, Chan has considered the tradeoffs of using an alternative to point identification, called partial identification, for generalizing from nonrandom samples. Partial identification looks at the inferences that can be made from a range of values rather than a single value. “It’s less precise than point identification,” Chan says, “but it has the potential to be more credible.”
While there is no single right approach to generalizing from a nonrandom sample, Chan’s work is intended to help policy makers recognize the pros and cons of various methods. Armed with this information, she says, “they can make more informed decisions.”
So far, Chan’s generalizability work has focused on different types of sample groups in a single study. But, she wonders, could the same techniques improve how we understand studies that play out over time?
“Most of the current work on generalization focuses on target populations that are somewhat static,” Chan says. “But populations of students could potentially change, in terms of demographic composition or academic proficiency, in ways that will affect the extent to which experimental results will generalize.”
Consider a five-year longitudinal study conducted on the effect of a mathematics computer aid. A desktop program at the start of the study might be an app by year five. How do you generalize the results to better understand the how the aid affected students’ academic performance?
To support her research in this area, the Spencer Foundation recently awarded Chan a $36,000 grant. She has also received funding from the Penn’s University Research Foundation.
“My research today,” Chan says, “is still based on the kinds of questions that came up during my time in in the corps—namely, does a program work, and for whom does it work?”