Tag Archives: examinations

Less is More: Low-Stakes Assessments and Student Success

Steve Volk, April 16, 2018
Contact at: svolk@oberlin.edu

There is a point in every semester when, it being too late to make significant changes in our current courses, we instead begin to think ahead to next semester. So, overwhelmed as you are by finishing up this term, getting in book orders, and planning research and writing for the summer, maybe you can keep this article in mind as you plan for the fall 2018 semester!

In this posting, I provide a few resources and summarize some of the reasoning supporting the argument that student learning is more enhanced by frequent, low-stakes assessments (“retrieval practices” in the technical lingo), with the opportunities they provide for continual feedback, than from infrequent, high-stakes assessments (e.g., a midterm and a final). While the research has suggested that this holds across all disciplines, the impact is noted especially in STEM fields and quantitatively based courses.

Some Research

First, a bit of the research. Where in the classroom cycle (class-studying-examination) does learning most take place? We generally think that students learn when they are studying (reading assigned texts or reviewing their notes), and that testing is the mechanism that tells us what has been learned. The same assumptions also buttress what many instructors think about classroom practice: students are assumed to learn in lectures (learning which is solidified when they study their notes), and tests help us know how much has been learned. The test, that that sense, is considered a “neutral event,” it measures without impacting learning. And yet, a considerable body of research has demonstrated that such an understanding is flawed; many cognitive psychologists now suggest that assessment practices themselves can produce large gains in long-term retention of information and concepts in comparison with studying for an exam. In other words, write  Peter C. Brown, Henry L. Roediger III, and Mark A. McDaniel in Make It Stick: The Science of Successful Learning (Harvard, 2014), using “retrieval practices” (recalling facts, concepts, data, or events from memory) is “a more effective learning strategy than review by rereading.”

To summarize his (and Andrew C. Butler) conclusions from one of Roediger’s many studies,

  1. Assessments (“retrieval practices”) often produce superior long-term retention relative to studying for the same amount of time;
  2. Repeated assessments are better than one-time testing;
  3. Providing feedback produces better results than assessment practices without feedback, although even without feedback, the learning results are better for multiple assessments;
  4. Some lag between study and test support retention and learning;
  5. The benefits of “retrieval practices” can be transferred from one context (domain) to another.

AFP; sitting for the “bac,” Paris, June 18, 2015

Cynthia J. Brame and Rachel Biel, both at Vanderbilt, offer a useful chart in CBE Life Sciences Education that summarizes the research on the relationship between testing and learning in undergraduate science courses. Among the findings of previous studies that they consider are the following:

  • Testing improved retention significantly more than restudy in delayed tests.
  • Multiple tests provided greater benefit than a single test.
  • Students in the retrieval-practice condition had greater gains in meaningful learning compared with those who used elaborative concept mapping as a learning tool.
  • Both initially correct and incorrect answers were benefited by feedback, but low-confidence answers were most benefited by feedback.
  • Testing improved retention and increased transfer of information from one domain to another through test questions that required factual or conceptual recall and inferential questions that required transfer.
  • Interim testing improved recall on a final test for information taught before and after the interim test.

But first, a note to the doubters – myself included. One of the central concerns I have always had about testing as a method of assessment, particularly high-stakes mid-terms or finals, or even post-reading quizzes, is whether they only promote rote memory learning (“When did Cortez first arrive in Mexico?” “Describe the wing structure of a bat,” etc.). I questioned whether the memorization of facts that (now) can be looked up instantaneously on a smart phone was necessary, or, in any case, whether what students learned would stick with them, or stick with them in a contextualized fashion, other than the memorization of random facts.

We all know that learning is more than the memorization and the repetition of facts, that higher education learning above all must shift students to higher cognitive levels and outcomes. So, while we know, through a variety of studies, particularly the work carried out by Andrew C. Butler, that “repeated testing produce[s] superior retention and transfer on the final test relative to repeated studying,” what do we know about the impact of frequent assessment such as tests and quizzes on helping to move students to higher cognitive levels? Can what was learned via testing be “transferred,” extended to other domains and contexts? And can attention to “retrieval practices” help students better cope with some of the negative consequences that accompany high-stakes testing? Stick around for some answers.

What to Keep in Mind

  1. Frequent, low-stakes testing is better than infrequent, high-stakes testing.

The research cited above suggests the many reasons why this is the case. To put it in practical terms:

Jenna Carter,”Test,” Flickr cc

a. We know enough about high-stakes testing to know that such assessment practices can have serious negative effects on identity-threatened groups in specific contexts: Blacks and Latinx in the sciences; women in math; etc. Further, low-stakes testing reduces the anxiety associated with high-stakes testing since one’s future isn’t riding on one or two tests. And we know that student anxiety in general has reached frightening levels. In 1985,18% of incoming first-year students agreed that they “felt overwhelmed by all I had to do.” In 2010, that figure increased to 29%. Last year, it surged to 41 percent.

b. Frequent testing allows students to practice remembering, a useful capacity building skill. But, to help students not just retain information but to use retrieval practices to move to higher order thinking, the testing should be “effortful” (e.g., short answer rather than multiple choice), spaced (allowing enough time to elapse for some forgetting to occur) and “interleaved” (one should introduce different topics and problems rather than having students master one topic at a time and then move on to the next). (These are points that Brown, Roediger, and McDaniel stress in Make it Stick.)

c. Frequent testing helps students know where they stand in a course, making it more likely that they will seek help when needed, i.e., when it can actually make a difference. Particularly in STEM courses, if students only realize that they are in a very deep pit at the mid-term point, there is little that they can do to climb out.

  1. Feedback is better than no feedback.

While the research suggests that low-stakes testing without feedback is actually better than relying on studying alone in terms of student learning, it also indicates the key role that instructor feedback plays in boosting student learning.

a. Most obviously, feedback from the instructor will let students know where they stand in a course. As faculty, we often have our own ways of intuiting how students are doing in our classes without resorting to formal testing mechanisms. Are they coming to class? Do they look engaged? Are they falling asleep in class? Have they come to office hours when asked? Some of these indicators are fairly accurate; others can be not only wrong but seriously problematic, particularly if they feed off of implicit biases. But the point is that students don’t share these “insights” as to how they are performing in a course. Many will feel that they are doing just fine, when they are actually standing at the edge of a precipice. (A recent study reported on in the New York Times suggests that men, more than women, overestimate their abilities in science classes. Who would have thought!) Concrete feedback from instructors can help students obtain a more objective sense of how they are doing. They may not always “hear” it or respond appropriately to such information, but offering feedback most often is quite welcome and useful.

b. Providing feedback to correct responses as well as incorrect answers is highly useful. As Brame and Biel discuss in the above cited article, feedback on both low-confidence correct answers and incorrect answers may further enhance the testing effect, allowing students to solidify their understanding of concepts about which they are unclear.

c. Feedback can come from instructors on exams or other assessments, or from in-class peers. One research study found that when students answer an in-class conceptual question individually using clickers, discuss it with their neighbors, and then re-vote on the same question, the percentage of correct answers typically increases.

Gary Larson, The Far Side

  1. Low-stakes, frequent testing with feedback helps students think more intentionally about their own learning.

I have written before about ways to support student metacognition. One way to do this is to encourage students to reflect on their own process of learning by having them respond to feedback on exams, quizzes, or other assignments.

a. Students can “self-test” as part of their studying process, and then measure the outcome versus their expectation, but few will actually do this, and generally those who do are not always the ones who could use additional help. Providing concrete evidence of accomplishments can help students more accurately assess their level of comprehension and (hopefully) seek help to address those areas that need attention.

b. Frequent assignments tied to reflection exercises can help students think about what they did (to prepare for an assignment, for exaple), what the outcome was, and what they will do differently the next time.

c. No-stakes formative assessment (a short quiz at the end of the class or a simple “one-minute paper”) can help students think in a more focused fashion about what they know and they need to focus on.

  1. Design questions that can move students to higher order thinking.

While not every exam must be “effortful,” with some thought, one can design multiple choice questions to lead to higher order thinking.

a. Cynthia Brame, from Vanderbilt’s Center for Teaching, provides the following advice: Presenting a problem that requires application of course principles, analysis of a problem, or evaluation of alternatives tests students’ ability to engage in higher-order thinking. Instructors can design problems that require multilogical thinking (“thinking that requires knowledge of more than one fact to logically and systematically apply concepts to a …problem”).

Here are two examples she provides:
















b. In Dynamic Testing: The Nature and Measurement of Learning Potential , Robert Sternberg and Elena Grigorenko suggest, as the title indicates, a more dynamic approach to testing than what is typically provided in static, standardized tests. Essentially, this calls for determining the state of one’s knowledge; refocusing learning on areas that need more attention; and retesting to measure improvement. It is testing that helps the learner focus on new knowledge (i.e., on “learning”), rather than on what has already been accomplished. In this sense, well-designed testing can move the student to higher cognitive levels.

  1. Break large assignments/exams into smaller pieces

Designing, for example, eight assessments tools that are effortful, spaced, and interleaved, takes some work, particularly when you are accustomed to giving a midterm and a final. True enough, but think about breaking the high-stake assignments you already give into smaller pieces, as suggested by Sara Jones at Michigan State University’s “Inside Teaching.”

a. Replace high-stakes, large tests into several quizzes.

b. Scaffold large projects (independent research projects, term papers, etc.) into a variety of related steps: topic, preliminary bibliography, outline, final bibliography, draft, etc.

In the end, if there’s one “take-away” from this posting, it is to become more intentional when thinking about the assessments we rely on, particularly in terms of exams or quizzes. The research is quite conclusive that, when well prepared and followed by feedback, “retrieval practices” are more effective in reinforcing student learning than studying in and of itself. Frequent, low-stakes assessments are an important way to promote student learning.

The Stereotype Threat

Steven Volk, February 29, 2016

I was recently reading a blog post by Sharon Salzberg, a meditation teacher. She wrote about a trip she took in the 1980s down the Zambezi River in Zimbabwe on route to a teaching post. The environs were beautiful and she asked the tour guide if they could stop and walk along the shore. No way, he replied. The banks of the river had been strewn with landmines from the civil war and they still remained. The likelihood was that she would be blown up. Nor was Zimbabwe the only place in the world where talks in the countryside can carry fatal consequences. There are an estimated 110 million landmines in place around the world, and many, if not most, will remain long after hostilities have ceased since it is much more expensive to remove a landmine than to put one in.

The experience led Salzberg to think about her own emotional landmines and the ways that we often think of ourselves as inadequate. And it led me to think about the hidden “landmines” that we, and the larger society, have placed in the path of many of our students. What I want to address here are those specific “landmines” which have been studied as under the concept of “stereotype threats.”

Maya, a first-year student, has enrolled in a calculus course, the first of her college career. She studied hard for the mid-term exam but, reasonably, still feels nervous about how she’ll do on it. As she turns over the exam booklet, she is asked to fill in her name, class year, major (if any), and gender. She takes the exam. What the research shows is that she will do more poorly on the exam than would have been the case if the professor hadn’t asked her to fill in her gender.

What is impacting Maya’s exam performance has been called the “stereotype threat.” The term was coined by Claude Steele and Joshua Aronson in a 1995 article in the journal of Personality and Social Psychology. In it, the authors argued that when a person’s identity has a negative stereotype attached to it and he or she engages in activity relevant to that stereotype, there is a likelihood of a negative impact. To clarify further: the stereotype must be salient to the person in question, and the domain or activity that person is engaging in must be important to them for the threat to have a (negative) consequence. In the case of “Maya,” girls and women are presumed to be worse at math than boys and men in western cultures. That’s the stereotype. In this case the student in question is taking an exam which is quite important to her. That’s the relevant activity. To the extent that Maya’s gender is brought in, the research has found that she will not do as well on the exam as she could have were her gender to have been omitted from the exam paper.

Whisteling VivaldiMuch of the information for this article comes from Steele and Aronson’s work, Claude Steele’s Whistling Vivaldi: How Stereotypes Affect Us and What We Can Do (Norton, 2011), a recent “Teaching in Higher Ed” podcast on “The Potential Impact of Stereotype Threat,” which featured an interview with Robin Paige, a sociologist at Rice University who is  Assistant Director of the Center for Teaching Excellence, and the website reducingstereotypethreat.org, run by Catherine Good and Steven Stroessner, two social psychologists (at Baruch and Barnard, respectively). The website, in particular, offers superb resources for faculty, providing succinct summaries of research on stereotype threat, raising unresolved issues and controversies in the research literature, and offering research-based suggestions for reducing the negative consequences of stereotyping, particularly in academic settings.


Steele and Aronson ran a number of experiments in the 1990s which found that Black first-years and sophomores would perform more poorly than White students on standardized tests when their race was emphasized as part of the test. When race was not emphasized, however, Black students performed better, and on a par, with White students. The results showed that performance in academic contexts can be harmed by the awareness that one’s behavior might be viewed through the lens of racial stereotypes. Since that time, over 300 experiments on stereotype threat have been published in peer-reviewed journals (see Nguyen & Ryan 2008 and Walton & Cohen, 2003 for meta-analyses).

Researchers have found that stereotype threats can have a negative impact not only on activities involving assessments of learning (exams, papers, presentations), but they can also impact learning itself. And, to the extent that learning and assessment often merge, there is an obvious double vulnerability.

Beside Steele and Aronson’s original research findings, the reducingstereotypethreat.org website reports on five new directions that research on the stereotype threat have taken. I’ll summarize them briefly here:

  • Research has shown that the consequences of stereotype threat can also lead to “self-handicapping strategies” (for example, students will spend less practice time on a task) or a reduced sense of belonging to the stereotyped domain. To the degree that individuals value the domain in question, stereotype threats can lead students to choose not to pursue a relevant activity (math, science, etc.). Such choices can obviously limit the range of professions such students can hope to follow. In this context, we can see how some  long-term effects of stereotype threat can contribute to greater educational and social inequality. Furthermore, stereotype threat has been shown to affect stereotyped individuals’ performance in a number of domains beyond academics, such as women in negotiations or gay men in providing childcare.
  • We now have a better understanding of who can be vulnerable to stereotype threat: basically anyone. Stereotype threat can harm the academic performance of any individual for whom the situation invokes a stereotype-based expectation of poor performance (e.g. ethnic groups, low-income or first generation students, females in math, etc.).  At the same time, research also demonstrates that within a stereotyped group, some members may be more vulnerable to its negative consequences than others, depending on the strength of one’s group identification or domain identification.
  • We know more about the situations that are most likely to lead to stereotype threat. In general, the conditions that produce stereotype threat are ones in which a highlighted stereotype implicates the self though association with a relevant social category. This is the case in the first example I used, women and math: performance can be undermined because of concerns about the possibility of confirming negative stereotypes about one’s group. Thus, situations that increase the salience of the stereotyped group identity can increase vulnerability to stereotype threat.
  • Although the research is not entirely clear on the mechanisms by which negative stereotypes lead to demonstrated outcomes, we are beginning to better understand some factors involved. Recent research has shown that stereotype threat can reduce working memory resources, ultimately undermining one’s ability to successfully complete complex intellectual tasks.
  • Finally, researchers have begun to examine methods of reducing the negative effects of stereotype threats. Methods range from in-depth interventions to teach students about the malleable nature of intelligence to simple changes in classroom practices that can be easily implemented by the instructor, such as ensuring gender-fair testing. Much of this work coincides with the “mindset” research of Carol Dweck discussed here previously. I’ll present more of these recommendations at the end of the article.

“Exam,” Erik Cabezas, Flickr CC

Group Identities

Stereotype threats are most likely to impact group identities. Where one’s stereotyped group status is made relevant or conspicuous by situational features, threat and performance inhibitors are more likely (e.g. Blacks or Hispanics in sciences, women in math, men compared with women on social sensitivity, whites compared with Asian men in mathematics, etc.). Teachers may inadvertently highlight social identities in a variety of ways: asking for gender on a math test or, as Steele and Aronson suggested in their initial research, asking students to identify their ethnicity on test booklets. While it is unlikely that students would be asked to identify their race or ethnicity on regular class exams, most high-stakes testing situations – SATs or GREs, for example – do ask for this information. And, of course, faculty know their students so, unless they have a “blind” system of grading, these factors can always come in.

Marx and Goff (2005) asked Black and White undergraduates to complete a difficult verbal test administered by a Black or White examiner. Black students performed as well as White students when the test administrator was Black but more poorly when he/she was White. There was no difference in the results for White students.

New research has also suggested that stereotypes are quite contextual and contingent: in some areas of Asia, girls are “expected” to do better at math than boys, thus exams which stress gender in those settings can produce a stereotype “boost” for girls. Young Asian-American women taking a math assessment will often do better when primed for race and ethnicity, and worse when primed for gender. Of course, all of this is much more complicated as other stereotypes (such as the “model minority stereotype”) become involved. Research has found that faculty are less likely to give Asian American students help in class, i.e. to approach them to see if they have questions.

Other research has shown that minority status can also add to a worse performance in conditions that already produce a stereotype threat. Negative results have been demonstrated where one individual is (or even expects to be) the single representative of a stereotyped group. Thus, research has found that women’s performance on math exams declined as the number of men in the room taking the test increased.


As with any developing body of research, various issues remain unresolved and criticisms have emerged.

Some critics have challenged the research design and methodology of Steele and Aronson. Most of the early studies were conducted exclusively with college students and researchers charged that this was too narrow a base on which to put forward a broad hypothesis about human behavior. (This critique has been answered by a large number of studies using populations from young children to adults in workplace settings. The research on stereotype threat has proven to be highly consistent across populations and contexts.)

Stereotype threat research has been criticized for it inability to fully account for performance differences. In a 2004 paper, Steele and Aronson acknowledge that persistent racial differences in standardized testing have multiple causes and that stereotype threat is not a “silver-bullet cure for the race gap.” Certainly the issue of pre-existing differences in test scores is an issue to consider, and current research suggests that stereotype threat may be one of many factors that contribute to performance differences on standardized tests.

Some researchers question whether stereotype threat effects occur in “real-world” settings. In other words, are stereotype threat effects more likely to occur in “laboratory” settings than they would in the non-academic world?

One of the most significant critiques (which was addressed by Steele in Whistling Vivaldi) is the notion that negative stereotype impacts can be easily corrected because “it’s all in your head” and we can remove stereotype threat by remove specific domain indicators (gender designation on math tests, for example). Not only does this put the onus of responsibility for discrimination back on to the individual student as opposed to the larger context (higher education and a society that fosters inequality), but it somehow suggests that racism or sexism or class bias aren’t based on material and historical practices. If people move away from a young Black man who walks down a street late at night, if police arrest a Black woman for not signaling a lane change, that indicates the real and pervasive impact of racism which will not be removed by changing how tests are administered; it is a reality that is impervious to any level of success which Blacks have been able to achieve.


Nonetheless, broad research indicates that stereotype threat is real and has a concrete impact on the ability of our students to learn. So it is important that we consider evidence-based recommendations to improve our pedagogical practices in this area. What follows are some recommendations culled from the “reducing stereotype” website, the podcast, and other sources:

  • Reframe the task: use different language to describe the task or test being used. Modifying task descriptions so that such stereotypes are not invoked or are disarmed can begin to eliminate stereotype threat.
  • Move from “proving” to “improving”: use more low-stakes testing that puts an emphasis on improving learning rather than on diagnostic tests that set out to “prove” a student’s “intelligence”.
  • Address test fairness: where one can’t remove the diagnostic (“proving”) nature of a test (e.g. in regular course examinations) or in standardized testing situations, stereotype threats can be reduced by directly addressing the specter of gender-based performance differences within the context of explicitly diagnostic examinations (Good, Aronson & Harder, 2008). Simply addressing the fairness of the test while retaining its diagnostic nature can alleviate stereotype threat in any testing situation. Concretely, testing procedures should include a brief statement that the test, although diagnostic of underlying mathematics ability, is gender-fair (or race-fair).
  • De-emphasize threatened social identities: modify procedures that heighten the salience of stereotyped group memberships. One study (Stricker and Ward (2004) has found that moving standard demographic inquiries about ethnicity and gender to the end of the test resulted in significantly higher performance for women taking the AP calculus test.
  • Encourage individuals to think of themselves in ways that reduce the salience of a threatened identity: women who were encouraged to think of themselves in terms of their valued and unique characteristics were less likely to experience stereotype threat in mathematics. Encouraging individuals to think of characteristics that are shared by in-group and out-group members, particularly characteristics in the threatened domain, appears to preclude the development of stereotype threat in conditions that normally produce it.
  • Encourage self-affirmation: Help students to think about their characteristics, skills, values, or roles that they value or view as important (Schimel, Arndt, Banko, & Cook, 2004). A study led by Oberlin’s Cindy Frantz (Frantz, Cuddy, Burnett, Ray, and Hart, 2004) showed that Whites who were given the opportunity to affirm their commitment to being nonracist were less likely to respond in a stereotypic fashion to an implicit measure of racial associations that had been described as indicative of racial bias. Another study (Cohen, Garcia, Apfel, and Master, 2006) described two field studies in which seventh grade students at racially-diverse schools were randomly assigned to self-affirm (indicating values that were important to them and then writing a brief essay indicating why those values were important) or not to self-affirm (indicating their least important values and writing an essay on why those values might be important to others) as a part of a regular classroom exercise. Although the intervention took only 15 minutes, the effects on academic performance during the semester were dramatic. African American students who had been led to self-affirm performed .3 grade points better during the semester than those who had not.
  • Emphasize high standards with assurances about capability for meeting them: The nature of the feedback provided regarding performance has been shown to affect perceived bias, student motivation, and domain identification. Constructive feedback appears most effective when it communicates high standards for performance but also assurances that the student is capable of meeting those high standards.
  • Provide role models: Role models who can demonstrate proficiency in a specific domain can reduce or even eliminate stereotype threat effects. Some research has found that even reading about successful role models can alleviate performance deficits under stereotype threat.
  • Providing external attributions for difficulty: One reason that stereotype threat harms performance is because anxiety and associated thoughts distract threatened individuals from focusing on the task at hand. Several studies have shown that providing individuals with effective strategies for regulating anxiety can disarm stereotype threat.
  • Emphasize an incremental (“growth”) view of intelligence. Carol Dweck’s research has suggested that students who think of intelligence as a quality that can be developed and that changes across contexts or over time (an “incremental theory”), will be better able to overcome obstacles than students with a “fixed” idea of intelligence. African American students who were encouraged to view intelligence as malleable, “like a muscle” that can grow with work and effort, were more likely to indicate greater enjoyment and valuing of education and did better in school. (The opposite is also true: attributing gender differences in mathematics to genetics reduced performance of women on a math test compared with conditions in which differences were explained in terms of experience.) In short, emphasize the importance of effort and motivation in performance and de-emphasize inherent “talent” or “genius.”
  • Finally, the nature of the campus environment and culture can also impact the ability of students to overcome racial stereotype threats. A campus that is afraid to bring up issues of race and racism and which fosters the idea that the campus is a “color-blind” environment can be quite damaging for students of color on campus.  If we don’t find a way to talk about these issues, we are not helping all of our students learn. We have to think about this in our classes: What are our interactions with students like? Who is being represented in our classes and who is left out? Are we making our classes as inclusive as possible? Students of color are often left being the ones talking about these issues whereas race is something that defines us all – being White is also a racial construction. We all have to see ourselves as part of that ongoing conversation. As we know, this is not easy. As Robin Paige asked: How do we create a space of accountability without saying “you’re a bad person for saying that” and I won’t deal with you? How do we create a space for learning that understands both that we have to be accountable for our ideas, and that we (and our ideas) can change, that college is, above all, a place for re-thinking and re-examination. Ultimately, talking about inequality can be “priming” for students, encouraging them to overcome obstacles to their education as they prepare for themselves for a life after college.