The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies
- Open access
- Published: 10 December 2019
- volume 32 , pages 481–509 ( 2020 )
You have full access to this open access article
- Kit S. Double ORCID: orcid.org/0000-0001-8120-1573 1 ,
- Joshua A. McGrane 1 &
- Therese N. Hopfenbeck 1
Explore all metrics
Cite this article
Peer assessment has been the subject of considerable research interest over the last three decades, with numerous educational researchers advocating for the integration of peer assessment into schools and instructional practice. Research synthesis in this area has, however, largely relied on narrative reviews to evaluate the efficacy of peer assessment. Here, we present a meta-analysis (54 studies, k = 141) of experimental and quasi-experimental studies that evaluated the effect of peer assessment on academic performance in primary, secondary, or tertiary students across subjects and domains. An overall small to medium effect of peer assessment on academic performance was found ( g = 0.31, p < .001). The results suggest that peer assessment improves academic performance compared with no assessment ( g = 0.31, p = .004) and teacher assessment ( g = 0.28, p = .007), but was not significantly different in its effect from self-assessment ( g = 0.23, p = .209). Additionally, meta-regressions examined the moderating effects of several feedback and educational characteristics (e.g., online vs offline, frequency, education level). Results suggested that the effectiveness of peer assessment was remarkably robust across a wide range of contexts. These findings provide support for peer assessment as a formative practice and suggest several implications for the implementation of peer assessment into the classroom.
Avoid common mistakes on your manuscript.
Feedback is often regarded as a central component of educational practice and crucial to students’ learning and development (Fyfe & Rittle-Johnson, 2016 ; Hattie and Timperley 2007 ; Hays, Kornell, & Bjork, 2010 ; Paulus, 1999 ). Peer assessment has been identified as one method for delivering feedback efficiently and effectively to learners (Topping 1998 ; van Zundert et al. 2010 ). The use of students to generate feedback about the performance of their peers is referred to in the literature using various terms, including peer assessment, peer feedback, peer evaluation, and peer grading. In this article, we adopt the term peer assessment, as it more generally refers to the method of peers assessing or being assessed by each other, whereas the term feedback is used when we refer to the actual content or quality of the information exchanged between peers. This feedback can be delivered in a variety of forms including written comments, grading, or verbal feedback (Topping 1998 ). Importantly, by performing both the role of assessor and being assessed themselves, students’ learning can potentially benefit more than if they are just assessed (Reinholz 2016 ).
Peer assessments tend to be highly correlated with teacher assessments of the same students (Falchikov and Goldfinch 2000 ; Li et al. 2016 ; Sanchez et al. 2017 ). However, in addition to establishing comparability between teacher and peer assessment scores, it is important to determine whether peer assessment also has a positive effect on future academic performance. Several narrative reviews have argued for the positive formative effects of peer assessment (e.g., Black and Wiliam 1998a ; Topping 1998 ; van Zundert et al. 2010 ) and have additionally identified a number of potentially important moderators for the effect of peer assessment. This meta-analysis will build upon these reviews and provide quantitative evaluations for some of the instructional features identified in these narrative reviews by utilising them as moderators within our analysis.
Evaluating the Evidence for Peer Assessment
Despite the optimism surrounding peer assessment as a formative practice, there are relatively few control group studies that evaluate the effect of peer assessment on academic performance (Flórez and Sammons 2013 ; Strijbos and Sluijsmans 2010 ). Most studies on peer assessment have tended to focus on either students’ or teachers’ subjective perceptions of the practice rather than its effect on academic performance (e.g., Brown et al. 2009 ; Young and Jackman 2014 ). Moreover, interventions involving peer assessment often confound the effect of peer assessment with other assessment practices that are theoretically related under the umbrella of formative assessment (Black and Wiliam 2009 ). For instance, Wiliam et al. ( 2004 ) reported a mean effect size of .32 in favor of a formative assessment intervention but they were unable to determine the unique contribution of peer assessment to students’ achievement, as it was one of more than 15 assessment practices included in the intervention.
However, as shown in Fig. 1 , there has been a sharp increase in the number of studies related to peer assessment, with over 75% of relevant studies published in the last decade. Although it is still far from being the dominant outcome measure in research on formative practices, many of these recent studies have examined the effect of peer assessment on objective measures of academic performance (e.g., Gielen et al. 2010a ; Liu et al. 2016 ; Wang et al. 2014a ). The number of studies of peer assessment using control group designs also appears to be increasing in frequency (e.g., van Ginkel et al. 2017 ; Wang et al. 2017 ). These studies have typically compared the formative effect of peer assessment with either teacher assessment (e.g., Chaney and Ingraham 2009 ; Sippel and Jackson 2015 ; van Ginkel et al. 2017 ) or no assessment conditions (e.g., Kamp et al. 2014 ; L. Li and Steckelberg 2004 ; Schonrock-Adema et al. 2007 ). Given the increase in peer assessment research, and in particular experimental research, it seems pertinent to synthesise this new body of research, as it provides a basis for critically evaluating the overall effectiveness of peer assessment and its moderators.
Number of records returned by year. The following search terms were used: ‘peer assessment’ or ‘peer grading or ‘peer evaluation’ or ‘peer feedback’. Data were collated by searching Web of Science ( www.webofknowledge.com ) for the following keywords: ‘peer assessment’ or ‘peer grading’ or ‘peer evaluation’ or ‘peer feedback’ and categorising by year
Efforts to synthesise peer assessment research have largely been limited to narrative reviews, which have made very strong claims regarding the efficacy of peer assessment. For example, in a review of peer assessment with tertiary students, Topping ( 1998 ) argued that the effects of peer assessment are, ‘as good as or better than the effects of teacher assessment’ (p. 249). Similarly, in a review on peer and self-assessment with tertiary students, Dochy et al. ( 1999 ) concluded that peer assessment can have a positive effect on learning but may be hampered by social factors such as friendships, collusion, and perceived fairness. Reviews into peer assessment have also tended to focus on determining the accuracy of peer assessments, which is typically established by the correlation between peer and teacher assessments for the same performances. High correlations have been observed between peer and teacher assessments in three meta-analyses to date ( r = .69, .63, and .68 respectively; Falchikov and Goldfinch 2000 ; H. Li et al. 2016 ; Sanchez et al. 2017 ). Given that peer assessment is often advocated as a formative practice (e.g., Black and Wiliam 1998a ; Topping 1998 ), it is important to expand on these correlational meta-analyses to examine the formative effect that peer assessment has on academic performance.
In addition to examining the correlation between peer and teacher grading, Sanchez et al. ( 2017 ) additionally performed a meta-analysis on the formative effect of peer grading (i.e., a numerical or letter grade was provided to a student by their peer) in intervention studies. They found that there was a significant positive effect of peer grading on academic performance for primary and secondary (grades 3 to 12) students ( g = .29). However, it is unclear whether their findings would generalise to other forms of peer feedback (e.g., written or verbal feedback) and to tertiary students, both of which we will evaluate in the current meta-analysis.
Moderators of the Effectiveness of Peer Assessment
Theoretical frameworks of peer assessment propose that it is beneficial in at least two respects. Firstly, peer assessment allows students to critically engage with the assessed material, to compare and contrast performance with their peers, and to identify gaps or errors in their own knowledge (Topping 1998 ). In addition, peer assessment may improve the communication of feedback, as peers may use similar and more accessible language, as well as reduce negative feelings of being evaluated by an authority figure (Liu et al. 2016 ). However, the efficacy of peer assessment, like traditional feedback, is likely to be contingent on a range of factors including characteristics of the learning environment, the student, and the assessment itself (Kluger and DeNisi 1996 ; Ossenberg et al. 2018 ). Some of the characteristics that have been proposed to moderate the efficacy of feedback include anonymity (e.g., Rotsaert et al. 2018 ; Yu and Liu 2009 ), scaffolding (e.g., Panadero and Jonsson 2013 ), quality and timing of the feedback (Diab 2011 ), and elaboration (e.g., Gielen et al. 2010b ). Drawing on the previously mentioned narrative reviews and empirical evidence, we now briefly outline the evidence for each of the included theoretical moderators.
It is somewhat surprising that most studies that examine the effect of peer assessment tend to only assess the impact on the assessee and not the assessor (van Popta et al. 2017 ). Assessing may confer several distinct advantages such as drawing comparisons with peers’ work and increased familiarity with evaluative criteria. Several studies have compared the effect of assessing with being assessed. Lundstrom and Baker ( 2009 ) found that assessing a peer’s written work was more beneficial for their own writing than being assessed by a peer. Meanwhile, Graner ( 1987 ) found that students who were receiving feedback from a peer and acted as an assessor did not perform better than students who acted as an assessor but did not receive peer feedback. Reviewing peers’ work is also likely to help students become better reviewers of their own work and to revise and improve their own work (Rollinson 2005 ). While, in practice, students will most often act as both assessor and assessee during peer assessment, it is useful to gain a greater insight into the relative impact of performing each of these roles for both practical reasons and to help determine the mechanisms by which peer assessment improves academic performance.
Peer Assessment Type
The characteristics of peer assessment vary greatly both in practice and within the research literature. Because meta-analysis is unable to capture all of the nuanced dimensions that determine the type, intensity, and quality of peer assessment, we focus on distinguishing between what we regard as the most prevalent types of peer assessment in the literature: grading, peer dialogs, and written assessment. Each of these peer assessment types is widely used in the classroom and often in various combinations (e.g., written qualitative feedback in combination with a numerical grade). While these assessment types differ substantially in terms of their cognitive complexity and comprehensiveness, each has shown at least some evidence of impactive academic performance (e.g., Sanchez et al. 2017 ; Smith et al. 2009 ; Topping 2009 ).
Peer assessment is often implemented in conjunction with some form of scaffolding, for example, rubrics, and scoring scripts. Scaffolding has been shown to improve both the quality peer assessment and increase the amount of feedback assessors provide (Peters, Körndle & Narciss, 2018 ). Peer assessment has also been shown to be more accurate when rubrics are utilised. For example, Panadero, Romero, & Strijbos ( 2013 ) found that students were less likely to overscore their peers.
Increasingly, peer assessment has been performed online due in part to the growth in online learning activities as well as the ease by which peer assessment can be implemented online (van Popta et al. 2017 ). Conducting peer assessment online can significantly reduce the logistical burden of implementing peer assessment (e.g., Tannacito and Tuzi 2002 ). Several studies have shown that peer assessment can effectively be carried out online (e.g., Hsu 2016 ; Li and Gao 2016 ). Van Popta et al. ( 2017 ) argue that the cognitive processes involved in peer assessment, such as evaluating, explaining, and suggesting, similarly play out in online and offline environments. However, the social processes involved in peer assessment are likely to substantively differ between online and offline peer assessment (e.g., collaborating, discussing), and it is unclear whether this might limit the benefits of peer assessment through one or the other medium. To the authors’ knowledge, no prior studies have compared the effects of online and offline peer assessment on academic performance.
Because peer assessment is fundamentally a collaborative assessment practice, interpersonal variables play a substantial role in determining the type and quality of peer assessment (Strijbos and Wichmann 2018 ). Some researchers have argued that anonymous peer assessment is advantageous because assessors are more likely to be honest in their feedback, and interpersonal processes cannot influence how assessees receive the assessment feedback (Rotsaert et al. 2018 ). Qualitative evidence suggests that anonymous peer assessment results in improved feedback quality and more positive perceptions towards peer assessment (Rotsaert et al. 2018 ; Vanderhoven et al. 2015 ). A recent qualitative review by Panadero and Alqassab ( 2019 ) found that three studies had compared anonymous peer assessment to a control group (i.e., open peer assessment) and looked at academic performance as the outcome. Their review found mixed evidence regarding the benefit of anonymity in peer assessment with one of the included studies finding an advantage of anonymity, but the other two finding little benefit of anonymity. Others have questioned whether anonymity impairs the development of cognitive and interpersonal development by limiting the collaborative nature of peer assessment (Strijbos and Wichmann 2018 ).
Peers are often novices at providing constructive assessment and inexperienced learners tend to provide limited feedback (Hattie and Timperley 2007 ). Several studies have therefore suggested that peer assessment becomes more effective as students’ experience with peer assessment increases. For example, with greater experience, peers tend to use scoring criteria to a greater extent (Sluijsmans et al. 2004 ). Similarly, training peer assessment over time can improve the quality of feedback they provide, although the effects may be limited by the extent of a student’s relevant domain knowledge (Alqassab et al. 2018 ). Frequent peer assessment may also increase positive learner perceptions of peer assessment (e.g., Sluijsmans et al. 2004 ). However, other studies have found that learner perceptions of peer assessment are not necessarily positive (Alqassab et al. 2018 ). This may suggest that learner perceptions of peer assessment vary depending on its characteristics (e.g., quality, detail).
Given the previous reliance on narrative reviews and the increasing research and teacher interest in peer assessment, as well as the popularity of instructional theories advocating for peer assessment and formative assessment practices in the classroom, we present a quantitative meta-analytic review to develop and synthesise the evidence in relation to peer assessment. This meta-analysis evaluates the effect of peer assessment on academic performance when compared to no assessment as well as teacher assessment. To do this, the meta-analysis only evaluates intervention studies that utilised experimental or quasi-experimental designs, i.e., only studies with control groups, so that the effects of maturation and other confounding variables are mitigated. Control groups can be either passive (e.g., no feedback) or active (e.g., teacher feedback). We meta-analytically address two related research questions:
What effect do peer assessment interventions have on academic performance relative to the observed control groups?
What characteristics moderate the effectiveness of peer assessment?
The specific methods of peer assessment can vary considerably, but there are a number of shared characteristics across most methods. Peers are defined as individuals at similar (i.e., within 1–2 grades) or identical education levels. Peer assessment must involve assessing or being assessed by peers, or both. Peer assessment requires the communication (either written, verbal, or online) of task-relevant feedback, although the style of feedback can differ markedly, from elaborate written and verbal feedback to holistic ratings of performance.
We took a deliberately broad definition of academic performance for this meta-analysis including traditional outcomes (e.g., test performance or essay writing) and also practical skills (e.g., constructing a circuit in science class). Despite this broad interpretation of academic performance, we did not include any studies that were carried out in a professional/organisational setting other than professional skills (e.g., teacher training) that were being taught in a traditional educational setting (e.g., a university).
To be included in this meta-analysis, studies had to meet several criteria. Firstly, a study needed to examine the effect of peer assessment. Secondly, the assessment could be delivered in any form (e.g., written, verbal, online), but needed to be distinguishable from peer-coaching/peer-tutoring. Thirdly, a study needed to compare the effect of peer assessment with a control group. Pre-post designs that did not include a control/comparison group were excluded because we could not discount the effects of maturation or other confounding variables. Moreover, the comparison group could take the form of either a passive control (e.g., a no assessment condition) or an active control (e.g., teacher assessment). Fourthly, a study needed to examine the effect of peer assessment on a non-self-reported measure of academic performance.
In addition to these criteria, a study needed to be carried out in an educational context or be related to educational outcomes in some way. Any level of education (i.e., tertiary, secondary, primary) was acceptable. A study also needed to provide sufficient data to calculate an effect size. If insufficient data was available in the manuscript, the authors were contacted by email to request the necessary data (additional information was provided for a single study). Studies also needed to be written in English.
The literature search was carried out on 8 June 2018 using PsycInfo , Google Scholar , and ERIC. Google Scholar was used to check for additional references as it does not allow for the exporting of entries. These three electronic databases were selected due to their relevance to educational instruction and practice. Results were not filtered based on publication date, but ERIC only holds records from 1966 to present. A deliberately wide selection of search terms was used in the first instance to capture all relevant articles. The search terms included ‘peer grading’ or ‘peer assessment’ or ‘peer evaluation’ or ‘peer feedback’, which were paired with ‘learning’ or ‘performance’ or ‘academic achievement’ or ‘academic performance’ or ‘grades’. All peer assessment-related search terms were included with and without hyphenation. In addition, an ancestry search (i.e., back-search) was performed on the reference lists of the included articles. Conference programs for major educational conferences were searched. Finally, unpublished results were sourced by emailing prominent authors in the field and through social media. Although there is significant disagreement about the inclusion of unpublished data and conference abstracts, i.e., ‘grey literature’ (Cook et al. 1993 ), we opted to include it in the first instance because including only published studies can result in a meta-analysis over-estimating effect sizes due to publication bias (Hopewell et al. 2007 ). It should, however, be noted that none of the substantive conclusions changed when the analyses were re-run with the grey literature excluded.
The database search returned 4072 records. An ancestry search returned an additional 37 potentially relevant articles. No unpublished data could be found. After duplicates were removed, two reviewers independently screened titles and abstracts for relevance. A kappa statistic was calculated to assess inter-rater reliability between the two coders and was found to be .78 (89.06% overall agreement, CI .63 to .94), which is above the recommended minimum levels of inter-rater reliability (Fleiss 1971 ). Subsequently, the full text of articles that were deemed relevant based on their abstracts was examined to ensure that they met the selection criteria described previously. Disagreements between the coders were discussed and, when necessary, resolved by a third coder. Ultimately, 55 articles with 143 effect sizes were found that met the inclusion criteria and included in the meta-analysis. The search process is depicted in Fig. 2 .
Flow chart for the identification, screening protocol, and inclusion of publications in the meta-analyses
A research assistant and the first author extracted data from the included papers. We took an iterative approach to the coding procedure whereby the coders refined the classification of each variable as they progressed through the included studies to ensure that the classifications best characterised the extant literature. Below, the coding strategy is reviewed along with the classifications utilised. Frequency statistics and inter-rater reliability for the extracted data for the different classifications are presented in Table 1 . All extracted variable showed at least moderate agreement except for whether the peer assessment was freeform or structured, which showed fair agreement (Landis and Koch 1977 ).
Publications were classified into journal articles, conference papers, dissertations, reports, or unpublished records.
Education level was coded as either graduate tertiary, undergraduate tertiary, secondary, or primary. Given the small number of studies that utilised graduate samples ( N = 2), we subsequently combined this classification with undergraduate to form a general tertiary category. In addition, we recorded the grade level of the students. Generally speaking, primary education refers to the ages of 6–12, secondary education refers to education from 13–18, and tertiary education is undertaken after the age of 18.
Age and Sex
The percentage of students in a study that were female was recorded. In addition, we recorded the mean age from each study. Unfortunately, only 55.5% of studies recorded participants’ sex and only 18.5% of studies recorded mean age information.
The subject area associated with the academic performance measure was coded. We also recorded the nature of the academic performance variable for descriptive purposes.
Studies were coded as to whether the students acted as peer assessors, assessees, or both assessors and assessees.
Four types of comparison group were found in the included studies: no assessment, teacher assessment, self-assessment, and reader-control. In many instances, a no assessment condition could be characterised as typical instruction; that is, two versions of a course were run—one with peer assessment and one without peer assessment. As such, while no specific teacher assessment comparison condition is referenced in the article, participants would most likely have received some form of teacher feedback as is typical in standard instructional practice. Studies were classified as having teacher assessment on the basis of a specific reference to teacher feedback being provided.
Studies were classified as self-assessment controls if there was an explicit reference to a self-assessment activity, e.g., self-grading/rating. Studies that only included revision, e.g., working alone on revising an assignment, were classified as no assessment rather than self-assessment because they did not necessarily involve explicit self-assessment. Studies where both the comparison and intervention groups received teacher assessment (in addition to peer assessment in the case of the intervention group) were coded as no assessment to reflect the fact that the comparison group received no additional assessment compared to the peer assessment condition. In addition, Philippakos and MacArthur ( 2016 ) and Cho and MacArthur ( 2011 ) were notable in that they utilised a reader-control condition whereby students read, but did not assess peers’ work. Due to the small frequency of this control condition, we ultimately classified them as no assessment controls.
Peer assessment was characterised using coding we believed best captured the theoretical distinctions in the literature. Our typology of peer assessment used three distinct components, which were combined for classification:
Did the peer feedback include a dialog between peers?
Did the peer feedback include written comments?
Did the peer feedback include grading?
Each study was classified using a dichotomous present/absent scoring system for each of the three components.
Studies were dichotomously classified as to whether a specific rubric, assessment script, or scoring system was provided to students. Studies that only provided basic instructions to students to conduct the peer feedback were coded as freeform.
Was the Assessment Online?
Studies were classified based on whether the peer assessment was online or offline.
Studies were classified based on whether the peer assessment was anonymous or identified.
Frequency of Assessment
Studies were coded dichotomously as to whether they involved only a single peer assessment occasion or, alternatively, whether students provided/received peer feedback on multiple occasions.
The level of transfer between the peer assessment task and the academic performance measure was coded into three categories:
No transfer—the peer-assessed task was the same as the academic performance measure. For example, a student’s assignment was assessed by peers and this feedback was utilised to make revisions before it was graded by their teacher.
Near transfer—the peer-assessed task was in the same or very similar format as the academic performance measure, e.g., an essay on a different, but similar topic.
Far transfer—the peer-assessed task was in a different form to the academic performance task, although they may have overlapping content. For example, a student’s assignment was peer assessed, while the final course exam grade was the academic performance measure.
We recorded how participants were allocated to a condition. Three categories of allocation were found in the included studies: random allocation at the class level, at the student level, or at the year/semester level. As only two studies allocated students to conditions at the year/semester level, we combined these studies with the studies allocated at the classroom level (i.e., as quasi-experiments).
Statistical Analyses of Effect Sizes
Effect size estimation and heterogeneity.
A random effects, multi-level meta-analysis was carried out using R version 3.4.3 (R Core Team 2017 ). The primary outcome was standardised mean difference between peer assessment and comparison (i.e., control) conditions. A common effect size metric, Hedge’s g , was calculated. A positive Hedge’s g value indicates comparatively higher values in the dependent variable in the peer assessment group (i.e., higher academic performance). Heterogeneity in the effect sizes was estimated using the I 2 statistic. I 2 is equivalent to the percentage of variation between studies that is due to heterogeneity (Schwarzer et al. 2015 ). Large values of the I 2 statistics suggest higher heterogeneity between studies in the analysis.
Meta-regressions were performed to examine the moderating effects of the various factors that differed across the studies. We report the results of these meta-regressions alongside sub-groups analyses. While it was possible to determine whether sub-groups differed significantly from each other by determining whether the confidence interval around their effect sizes overlap, sub-groups analysis may also produce biased estimates when heteroscedasticity or multicollinearity are present (Steel and Kammeyer-Mueller 2002 ). We performed meta-regressions separately for each predictor to test the overall effect of a moderator.
Finally, as this meta-analysis included students from primary school to graduate school, which are highly varied participant and educational contexts, we opted to analyse the data both in complete form, as well as after controlling for each level of education. As such, we were able to look at the effect of each moderator across education levels and for each education level separately.
Robust Variance Estimation
Often meta-analyses include multiple effect sizes from the same sample (e.g., the effect of peer assessment on two different measures of academic performance). Including these dependent effect sizes in a meta-analysis can be problematic, as this can potentially bias the results of the analysis in favour of studies that have more effect sizes. Recently, Robust Variance Estimation (RVE) was developed as a technique to address such concerns (Hedges et al. 2010 ). RVE allows for the modelling of dependence between effect sizes even when the nature of the dependence is not specifically known. Under such situations, RVE results in unbiased estimates of fixed effects when dependent effect sizes are included in the analysis (Moeyaert et al. 2017 ). A correlated effects structure was specified for the meta-analysis (i.e., the random error in the effects from a single paper were expected to be correlated due to similar participants, procedures). A rho value of .8 was specified for the correlated effects (i.e., effects from the same study) as is standard practice when the correlation is unknown (Hedges et al. 2010 ). A sensitivity analysis indicated that none of the results varied as a function of the chosen rho. We utilised the ‘robumeta’ package (Fisher et al. 2017 ) to perform the meta-analyses. Our approach was to use only summative dependent variables when they were provided (e.g., overall writing quality score rather than individual trait measures), but to utilise individual measures when overall indicators were not available. When a pre-post design was used in a study, we adjusted the effect size for pre-intervention differences in academic performance as long as there was sufficient data to do so (e.g., t tests for pre-post change).
Overall Meta-analysis of the Effect of Peer Assessment
Prior to conducting the analysis, two effect sizes ( g = 2.06 and 1.91) were identified as outliers and removed using the outlier labelling rule (Hoaglin and Iglewicz 1987 ). Descriptive characteristics of the included studies are presented in Table 2 . The meta-analysis indicated that there was a significant positive effect of peer assessment on academic performance ( g = 0.31, SE = .06, 95% CI = .18 to .44, p < .001). A density graph of the recorded effect sizes is provided in Fig. 3 . A sensitivity analysis indicated that the effect size estimates did not differ with different values of rho. Heterogeneity between the studies’ effect sizes was large, I 2 = 81.08%, supporting the use of a meta-regression/sub-groups analysis in order to explain the observed heterogeneity in effect sizes.
A density plot of effect sizes
Meta-Regressions and Sub-Groups Analyses
Effect sizes for sub-groups are presented in Table 3 . The results of the meta-regressions are presented in Table 4 .
A meta-regression with tertiary students as the reference category indicated that there was no significant difference in effect size as a function of education level. The effect of peer assessment was similar for secondary students ( g = .44, p < .001) and primary school students ( g = .41, p = .006) and smaller for tertiary students ( g = .21, p = .043). There is, however, a strong theoretical basis for examining effects separately at different education levels (primary, secondary, tertiary), because of the large degree of heterogeneity across such a wide span of learning contexts (e.g., pedagogical practices, intellectual and social development of the students). We therefore will proceed by reporting the data both as a whole and separately for each of the education levels for all of the moderators considered here. Education level is contrast coded such that tertiary is compared to the average of secondary and primary and secondary and primary are compared to each other.
A meta-regression indicated that the effect size was not significantly different when comparing peer assessment with teacher assessment, than when comparing peer assessment with no assessment ( b = .02, 95% CI − .26 to .31, p = .865). The difference between peer assessment vs. no assessment and peer assessment vs. self-assessment was also not significant ( b = − .03, CI − .44 to .38, p = .860), see Table 4 . An examination of sub-groups suggested that peer assessment had a moderate positive effect compared to no assessment controls ( g = .31, p = .004) and teacher assessment ( g = .28, p = .007) and was not significantly different compared with self-assessment ( g = .23, p = .209). The meta-regression was also re-run with education level as a covariate but the results were unchanged.
Meta-regressions indicated that the participant’s role was not a significant moderator of the effect size; see Table 4 . However, given the extremely small number of studies where participants did not act as both assessees ( n = 2) and assessors ( n = 4), we did not perform a sub-groups analysis, as such analyses are unreliable with small samples (Fisher et al. 2017 ).
Given that many subject areas had few studies (see Table 1 ) and the writing subject area made up the majority of effect sizes (40.74%), we opted to perform a meta-regression comparing writing with other subject areas. However, the effect of peer assessment did not differ between writing ( g = .30 , p = .001) and other subject areas ( g = .31 , p = .002); b = − .003, 95% CI − .25 to .25, p = .979. Similarly, the results did not substantially change when education level was entered into the model.
The effect of peer assessment did not differ significantly when peer assessment included a written component ( g = .35 , p < .001) than when it did not ( g = .20 , p = .015) , b = .144, 95% CI − .10 to .39, p = .241. Including education as a variable in the model did not change the effect written feedback. Similarly, studies with a dialog component ( g = .21 , p = .033) did not differ significantly from those that did not ( g = .35 , p < .001), b = − .137, 95% CI − .39 to .12, p = .279.
Studies where peer feedback included a grading component ( g = .37 , p < .001) did not differ significantly from those that did not ( g = .17 , p = .138). However, when education level was included in the model, the model indicated significant interaction effect between grading in tertiary students and the average effect of grading in primary and secondary students ( b = .395, 95% CI .06 to .73, p = .022). A follow-up sub-groups analysis showed that grading was beneficial for academic performance in tertiary students ( g = .55 , p = .009), but not secondary school students ( g = .002, p = .991) or primary school students ( g = − .08, p = .762). When the three variables used to characterise peer assessment were entered simultaneously, the results were unchanged.
The average effect size was not significantly different for studies where assessment was freeform, i.e., where no specific script or rubric was given ( g = .42, p = .030) compared to those where a specific script or rubric was provided ( g = .29, p < .001); b = − .13, 95% CI − .51 to .25, p = .455. However, there were few studies where feedback was freeform ( n = 9, k =29). The results were unchanged when education level was controlled for in the meta-regression.
Studies where peer assessment was online ( g = .38, p = .003) did not differ from studies where assessment was offline ( g = .24, p = .004); b = .16, 95% CI − .10 to .42, p = .215. This result was unchanged when education level was included in the meta-regression.
There was no significant difference in terms of effect size between studies where peer assessment was anonymised ( g = .27, p = .019) and those where it was not ( g = .25, p = .004); b = .03, 95% CI − .22 to .28, p = .811). Nor was the effect significant when education level was controlled for.
Studies where peer assessment was performed just a single time ( g = .19, p = .103) did not differ significantly from those where it was performed multiple times ( g = .37, p < .001); b = -.17, 95% CI − .45 to .11, p = .223. Although it is worth noting that the results of the sub-groups analysis suggest that the effect of peer assessment was not significant when only considering studies that applied it a single time. The result did not change when education was included in the model.
There was no significant difference in effect size between studies utilising far transfer ( g = .21, p = .124) than those with near ( g = .42, p < .001) or no transfer ( g = .29, p = .017). Although it is worth noting that the sub-groups analysis suggests that the effect of peer assessment was only significant when there was no transfer to the criterion task. As shown in Table 4 , this was also not significant when analysed using meta-regressions either with or without education in the model.
Studies that allocated participants to experimental condition at the student level ( g = .21, p = .14) did not differ from those that allocated condition at the classroom/semester level ( g = .31, p < .001 and g = .79, p = .223 respectively), see Table 4 for meta-regressions.
Risk of publication bias was assessed by inspecting the funnel plots (see Fig. 4 ) of the relationship between observed effects and standard error for asymmetry (Schwarzer et al. 2015 ). Egger’s test was also run by including standard error as a predictor in a meta-regression. Based on the funnel plots and a non-significant Egger’s test of asymmetry ( b = .886, p = .226), risk of publication bias was judged to be low
A funnel plot showing the relationship between standard error and observed effect size for the academic performance meta-analysis
Proponents of peer assessment argue that it is an effective classroom technique for improving academic performance (Topping 2009 ). While previous narrative reviews have argued for the benefits of peer assessment, the current meta-analysis quantifies the effect of peer assessment interventions on academic performance within educational contexts. Overall, the results suggest that there is a positive effect of peer assessment on academic performance in primary, secondary, and tertiary students. The magnitude of the overall effect size was within the small to medium range for effect sizes (Sawilowsky 2009 ). These findings also suggest that that the benefits of peer assessment are robust across many contextual factors, including different feedback and educational characteristics.
Recently, researchers have increasingly advocated for the role of assessment in promoting learning in educational practice (Wiliam 2018 ). Peer assessment forms a core part of theories of formative assessment because it is seen as providing new information about the learning process to the teacher or student, which in turn facilitates later performance (Pellegrino et al. 2001 ). The current results provide support for the position that peer assessment can be an effective classroom technique for improving academic performance. The result suggest that peer assessment is effective compared to both no assessment (which often involved ‘teaching as usual’) and teacher assessment, suggesting that peer assessment can play an important formative role in the classroom. The findings suggest that structuring classroom activities in a way that utilises peer assessment may be an effective way to promote learning and optimise the use of teaching resources by permitting the teacher to focus on assisting students with greater difficulties or for more complex tasks. Importantly, the results indicate that peer assessment can be effective across a wide range of subject areas, education levels, and assessment types. Pragmatically, this suggests that classroom teachers can implement peer assessment in a variety of ways and tailor the peer assessment design to the particular characteristics and constraints of their classroom context.
Notably, the results of this quantitative meta-analysis align well with past narrative reviews (e.g., Black and Wiliam 1998a ; Topping 1998 ; van Zundert et al. 2010 ). The fact that both quantitative and qualitative syntheses of the literature suggest that peer assessment can be beneficial provides a stronger basis for recommending peer assessment as a practice. However, several of the moderators of the effectiveness of peer feedback that have been argued for in the available narrative reviews (e.g., rubrics; Panadero and Jonsson 2013 ) have received little support from this quantitative meta-analysis. As detailed below, this may suggest that the prominence of such feedback characteristics in narrative reviews is more driven by theoretical considerations rather than quantitative empirical evidence. However, many of these moderating variables are complex, for example, rubrics can take many forms, and due to this complexity may not lend themselves as well to quantitative synthesis/aggregation (for a detailed discussion on combining qualitative and quantitative evidence, see Gorard 2002 ).
Mechanisms and Moderators
Indeed, the current findings suggest that the feedback characteristics deemed important by current theories of peer assessment may not be as significant as first thought. Previously, individual studies have argued for the importance of characteristics such as rubrics (Panadero and Jonsson 2013 ), anonymity (Bloom & Hautaluoma, 1987 ), and allowing students to practice peer assessment (Smith, Cooper, & Lancaster, 2002 ). While these feedback characteristics have been shown to affect the efficacy of peer assessment in individual studies, we find little evidence that they moderate the effect of peer assessment when analysed across studies. Many of the current models of peer assessment rely on qualitative evidence, theoretical arguments, and pedagogical experience to formulate theories about what determines effective peer assessment. While such evidence should not be discounted, the current findings also point to the need for better quantitative and experimental studies to test some of the assumptions embedded in these models. We suggest that the null findings observed in this meta-analysis regarding the proposed moderators of peer assessment efficacy should be interpreted cautiously, as more studies that experimentally manipulate these variables are needed to provide more definitive insight into how to design better peer assessment procedures.
While the current findings are ambiguous regarding the mechanisms of peer assessment, it is worth noting that without a solid understanding of the mechanisms underlying peer assessment effects, it is difficult to identify important moderators or optimally use peer assessment in the classroom. Often the research literature makes somewhat broad claims about the possible benefits of peer assessment. For example, Topping ( 1998 , p.256) suggested that peer assessment may, ‘promote a sense of ownership, personal responsibility, and motivation… [and] might also increase variety and interest, activity and interactivity, identification and bonding, self-confidence, and empathy for others’. Others have argued that peer assessment is beneficial because it is less personally evaluative—with evidence suggesting that teacher assessment is often personally evaluative (e.g., ‘good boy, that is correct’) which may have little or even negative effects on performance particularly if the assessee has low self-efficacy (Birney, Beckmann, Beckmann & Double 2017 ; Double and Birney 2017 , 2018 ; Hattie and Timperley 2007 ). However, more research is needed to distinguish between the many proposed mechanisms for peer assessment’s formative effects made within the extant literature, particularly as claims about the mechanisms of the effectiveness of peer assessment are often evidenced by student self-reports about the aspects of peer assessment they rate as useful. While such self-reports may be informative, more experimental research that systematically manipulates aspects of the design of peer assessment is likely to provide greater clarity about what aspects of peer assessment drive the observed benefits.
Our findings did indicate an important role for grading in determining the effectiveness of peer feedback. We found that peer grading was beneficial for tertiary students but not beneficial for primary or secondary school students. This finding suggests that grading appears to add little to the peer feedback process in non-tertiary students. In contrast, a recent meta-analysis by Sanchez et al. ( 2017 ) on peer grading found a benefit for non-tertiary students, albeit based on a relatively small number of studies compared with the current meta-analysis. In contrast, the present findings suggest that there may be significant qualitative differences in the performance of peer grading as students develop. For example, the criteria students use to assesses ability may change as they age (Stipek and Iver 1989 ). It is difficult to ascertain precisely why grading has positive additive effects in only tertiary students, but there are substantial differences in pedagogy, curriculum, motivation of learning, and grading systems that may account for these differences. One possibility is that tertiary students are more ‘grade orientated’ and therefore put more weight on peer assessment which includes a specific grade. Further research is needed to explore the effects of grading at different educational levels.
One of the more unexpected findings of this meta-analysis was the positive effect of peer assessment compared to teacher assessment. This finding is somewhat counterintuitive given the greater qualifications and pedagogical experience of the teacher. In addition, in many of the studies, the teacher had privileged knowledge about, and often graded the outcome assessment. Thus, it seems reasonable to expect that teacher feedback would better align with assessment objectives and therefore produce better outcomes. Despite all these advantages, teacher assessment appeared to be less efficacious than peer assessment for academic performance. It is possible that the pedagogical disadvantages of peer assessment are compensated for by affective or motivational aspects of peer assessment, or by the substantial benefits of acting as an assessor. However, more experimental research is needed to rule out the effects of potential methodological issues discussed in detail below.
A major limitation of the current results is that they cannot adequately distinguish between the effect of assessing versus being an assessee. Most of the current studies confound giving and receiving peer assessment in their designs (i.e., the students in the peer assessment group both provide assessment and receive it), and therefore, no substantive conclusions can be drawn about whether the benefits of peer assessment extend from giving feedback, receiving feedback, or both. This raises the possibility that the benefit of peer assessment comes more from assessing, rather than being assessed (Usher 2018 ). Consistent with this, Lundstrom and Baker ( 2009 ) directly compared the effects of giving and receiving assessment on students’ writing performance and found that assessing was more beneficial than being assessed. Similarly, Graner ( 1987 ) found that assessing papers without being assessed was as effective for improving writing performance as assessing papers and receiving feedback.
Furthermore, more true experiments are needed, as there is evidence from these results that they produce more conservative estimates of the effect of peer assessment. The studies included in this meta-analysis were not only predominantly randomly allocated at the classroom level (i.e., quasi-experiments), but in all but one case, were not analysed using appropriate techniques for analysing clustered data (e.g., multi-level modelling). This is problematic because it makes disentangling classroom-level effects (e.g., teacher quality) from the intervention effect difficult, which may lead to biased statistical inferences (Hox 1998 ). While experimental designs with individual allocation are often not pragmatic for classroom interventions, online peer assessment interventions appear to be obvious candidates for increased true experiments. In particular, carefully controlled experimental designs that examine the effect of specific assessment characteristics, rather than ‘black-box’ studies of the effectiveness of peer assessment, are crucial for understanding when and how peer assessment is most likely to be effective. For example, peer assessment may be counterproductive when learning novel tasks due to students’ inadequate domain knowledge (Könings et al. 2019 ).
While the current results provide an overall estimate of the efficacy of peer assessment in improving academic performance when compared to teacher and no assessment, it should be noted that these effects are averaged across a wide range of outcome measures, including science project grades, essay writing ratings, and end-of-semester exam scores. Aggregating across such disparate outcomes is always problematic in meta-analysis and is a particular concern for meta-analyses in educational research, as some outcome measures are likely to be more sensitive to interventions than others (William, 2010 ). A further issue is that the effect of moderators may differ between academic domains. For example, some assessment characteristics may be important when teaching writing but not mathematics. Because there were too few studies in the individual academic domains (with the exception of writing), we are unable to account for these differential effects. The effects of the moderators reported here therefore need to be considered as overall averages that provide information about the extent to which the effect of a moderator generalises across domains.
Finally, the findings of the current meta-analysis are also somewhat limited by the fact that few studies gave a complete profile of the participants and measures used. For example, few studies indicated that ability of peer reviewer relative to the reviewee and age difference between the peers was not necessarily clear. Furthermore, it was not possible to classify the academic performance measures in the current study further, such as based on novelty, or to code for the quality of the measures, including their reliability and validity, because very few studies provide comprehensive details about the outcome measure(s) they utilised. Moreover, other important variables such as fidelity of treatment were almost never reported in the included manuscripts. Indeed, many of the included variables needed to be coded based on inferences from the included studies’ text and were not explicitly stated, even when one would reasonably expect that information to be made clear in a peer-reviewed manuscript. The observed effect sizes reported here should therefore be taken as an indicator of average efficacy based on the extant literature and not an indication of expected effects for specific implementations of peer assessment.
Overall, our findings provide support for the use of peer assessment as a formative practice for improving academic performance. The results indicate that peer assessment is more effective than no assessment and teacher assessment and not significantly different in its effect from self-assessment. These findings are consistent with current theories of formative assessment and instructional best practice and provide strong empirical support for the continued use of peer assessment in the classroom and other educational contexts. Further experimental work is needed to clarify the contextual and educational factors that moderate the effectiveness of peer assessment, but the present findings are encouraging for those looking to utilise peer assessment to enhance learning.
References marked with an * were included in the meta-analysis
* AbuSeileek, A. F., & Abualsha'r, A. (2014). Using peer computer-mediated corrective feedback to support EFL learners'. Language Learning & Technology, 18 (1), 76-95.
Alqassab, M., Strijbos, J. W., & Ufer, S. (2018). Training peer-feedback skills on geometric construction tasks: Role of domain knowledge and peer-feedback levels. European Journal of Psychology of Education, 33 (1), 11–30.
Article Google Scholar
* Anderson, N. O., & Flash, P. (2014). The power of peer reviewing to enhance writing in horticulture: Greenhouse management. International Journal of Teaching and Learning in Higher Education, 26 (3), 310–334.
* Bangert, A. W. (1995). Peer assessment: an instructional strategy for effectively implementing performance-based assessments. (Unpublished doctoral dissertation). University of South Dakota.
* Benson, N. L. (1979). The effects of peer feedback during the writing process on writing performance, revision behavior, and attitude toward writing. (Unpublished doctoral dissertation). University of Colorado, Boulder.
* Bhullar, N., Rose, K. C., Utell, J. M., & Healey, K. N. (2014). The impact of peer review on writing in apsychology course: Lessons learned. Journal on Excellence in College Teaching, 25(2), 91-106.
* Birjandi, P., & Hadidi Tamjid, N. (2012). The role of self-, peer and teacher assessment in promoting Iranian EFL learners’ writing performance. Assessment & Evaluation in Higher Education, 37 (5), 513–533.
Birney, D. P., Beckmann, J. F., Beckmann, N., & Double, K. S. (2017). Beyond the intellect: Complexity and learning trajectories in Raven’s Progressive Matrices depend on self-regulatory processes and conative dispositions. Intelligence, 61 , 63–77.
Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5 (1), 7–74.
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability (formerly: Journal of Personnel Evaluation in Education), 21 (1), 5.
Bloom, A. J., & Hautaluoma, J. E. (1987). Effects of message valence, communicator credibility, and source anonymity on reactions to peer feedback. The Journal of Social Psychology, 127 (4), 329–338.
Brown, G. T., Irving, S. E., Peterson, E. R., & Hirschfeld, G. H. (2009). Use of interactive–informal assessment practices: New Zealand secondary students' conceptions of assessment. Learning and Instruction, 19 (2), 97–111.
* Califano, L. Z. (1987). Teacher and peer editing: Their effects on students' writing as measured by t-unit length, holistic scoring, and the attitudes of fifth and sixth grade students (Unpublished doctoral dissertation), Northern Arizona University.
* Chaney, B. A., & Ingraham, L. R. (2009). Using peer grading and proofreading to ratchet student expectations in preparing accounting cases. American Journal of Business Education, 2 (3), 39-48.
* Chang, S. H., Wu, T. C., Kuo, Y. K., & You, L. C. (2012). Project-based learning with an online peer assessment system in a photonics instruction for enhancing led design skills. Turkish Online Journal of Educational Technology-TOJET, 11(4), 236–246.
* Cho, K., & MacArthur, C. (2011). Learning by reviewing. Journal of Educational Psychology, 103 (1), 73.
Cho, K., Schunn, C. D., & Charney, D. (2006). Commenting on writing: Typology and perceived helpfulness of comments from novice peer reviewers and subject matter experts. Written Communication, 23 (3), 260–294.
Cook, D. J., Guyatt, G. H., Ryan, G., Clifton, J., Buckingham, L., Willan, A., et al. (1993). Should unpublished data be included in meta-analyses?: Current convictions and controversies. JAMA, 269 (21), 2749–2753.
*Crowe, J. A., Silva, T., & Ceresola, R. (2015). The effect of peer review on student learning outcomes in a research methods course. Teaching Sociology, 43 (3), 201–213.
* Diab, N. M. (2011). Assessing the relationship between different types of student feedback and the quality of revised writing . Assessing Writing, 16(4), 274-292.
Demetriadis, S., Egerter, T., Hanisch, F., & Fischer, F. (2011). Peer review-based scripted collaboration to support domain-specific and domain-general knowledge acquisition in computer science. Computer Science Education, 21 (1), 29–56.
Dochy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer and co-assessment in higher education: A review. Studies in Higher Education, 24 (3), 331–350.
Double, K. S., & Birney, D. (2017). Are you sure about that? Eliciting confidence ratings may influence performance on Raven’s progressive matrices. Thinking & Reasoning, 23 (2), 190–206.
Double, K. S., & Birney, D. P. (2018). Reactivity to confidence ratings in older individuals performing the latin square task. Metacognition and Learning, 13(3), 309–326.
* Enders, F. B., Jenkins, S., & Hoverman, V. (2010). Calibrated peer review for interpreting linear regression parameters: Results from a graduate course. Journal of Statistics Education , 18 (2).
* English, R., Brookes, S. T., Avery, K., Blazeby, J. M., & Ben-Shlomo, Y. (2006). The effectiveness and reliability of peer-marking in first-year medical students. Medical Education, 40 (10), 965-972.
* Erfani, S. S., & Nikbin, S. (2015). The effect of peer-assisted mediation vs. tutor-intervention within dynamic assessment framework on writing development of Iranian Intermediate EFL Learners. English Language Teaching, 8 (4), 128–141.
Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-analysis comparing peer and teacher marks. Review of Educational Research, 70 (3), 287–322.
* Farrell, K. J. (1977). A comparison of three instructional approaches for teaching written composition to high school juniors: teacher lecture, peer evaluation, and group tutoring (Unpublished doctoral dissertation), Boston University, Boston.
Fisher, Z., Tipton, E., & Zhipeng, Z. (2017). robumeta: Robust variance meta-regression (Version 2). Retrieved from https://CRAN.R-project.org/package = robumeta
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76 (5), 378.
Flórez, M. T., & Sammons, P. (2013). Assessment for learning: Effects and impact: CfBT Education Trust . England: Reading.
Fyfe, E. R., & Rittle-Johnson, B. (2016). Feedback both helps and hinders learning: The causal role of prior knowledge. Journal of Educational Psychology, 108 (1), 82.
Gielen, S., Peeters, E., Dochy, F., Onghena, P., & Struyven, K. (2010a). Improving the effectiveness of peer feedback for learning. Learning and Instruction, 20 (4), 304–315.
* Gielen, S., Tops, L., Dochy, F., Onghena, P., & Smeets, S. (2010b). A comparative study of peer and teacher feedback and of various peer feedback forms in a secondary school writing curriculum. British Educational Research Journal , 36 (1), 143-162.
Gorard, S. (2002). Can we overcome the methodological schism? Four models for combining qualitative and quantitative evidence. Research Papers in Education Policy and Practice, 17 (4), 345–361.
Graner, M. H. (1987). Revision workshops: An alternative to peer editing groups. The English Journal, 76 (3), 40–45.
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81–112.
Hays, M. J., Kornell, N., & Bjork, R. A. (2010). The costs and benefits of providing feedback during learning. Psychonomic bulletin & review, 17 (6), 797–801.
Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. journal of . Educational Statistics, 6 (2), 107–128.
Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1 (1), 39–65.
Higgins, J. P., & Green, S. (2011). Cochrane handbook for systematic reviews of interventions. The Cochrane Collaboration. Version 5.1.0, www.handbook.cochrane.org
Hoaglin, D. C., & Iglewicz, B. (1987). Fine-tuning some resistant rules for outlier labeling. Journal of the American Statistical Association, 82 (400), 1147–1149.
Hopewell, S., McDonald, S., Clarke, M. J., & Egger, M. (2007). Grey literature in meta-analyses of randomized trials of health care interventions. Cochrane Database of Systematic Reviews .
* Horn, G. C. (2009). Rubrics and revision: What are the effects of 3 RD graders using rubrics to self-assess or peer-assess drafts of writing? (Unpublished doctoral thesis), Boise State University
Hox, J. J. (1998). Multilevel modeling: When and why. In I. Balderjahn, R. Mathar, & M. Schader (Eds.), Classification, data analysis, and data highways (pp. 147–154). New Yor: Springer Verlag.
Chapter Google Scholar
* Hsia, L. H., Huang, I., & Hwang, G. J. (2016). A web-based peer-assessment approach to improving junior high school students’ performance, self-efficacy and motivation in performing arts courses. British Journal of Educational Technology, 47 (4), 618–632.
* Hsu, T. C. (2016). Effects of a peer assessment system based on a grid-based knowledge classification approach on computer skills training. Journal of Educational Technology & Society , 19 (4), 100-111.
* Hussein, M. A. H., & Al Ashri, El Shirbini A. F. (2013). The effectiveness of writing conferences and peer response groups strategies on the EFL secondary students' writing performance and their self efficacy (A Comparative Study). Egypt: National Program Zero.
* Hwang, G. J., Hung, C. M., & Chen, N. S. (2014). Improving learning achievements, motivations and problem-solving skills through a peer assessment-based game development approach. Educational Technology Research and Development, 62 (2), 129–145.
* Hwang, G. J., Tu, N. T., & Wang, X. M. (2018). Creating interactive E-books through learning by design: The impacts of guided peer-feedback on students’ learning achievements and project outcomes in science courses. Journal of Educational Technology & Society, 21 (1), 25–36.
* Kamp, R. J., van Berkel, H. J., Popeijus, H. E., Leppink, J., Schmidt, H. G., & Dolmans, D. H. (2014). Midterm peer feedback in problem-based learning groups: The effect on individual contributions and achievement. Advances in Health Sciences Education, 19 (1), 53–69.
* Karegianes, M. J., Pascarella, E. T., & Pflaum, S. W. (1980). The effects of peer editing on the writing proficiency of low-achieving tenth grade students. The Journal of Educational Research , 73 (4), 203-207.
* Khonbi, Z. A., & Sadeghi, K. (2013). The effect of assessment type (self vs. peer) on Iranian university EFL students’ course achievement. Procedia-Social and Behavioral Sciences , 70 , 1552-1564.
Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119 (2), 254.
Könings, K. D., van Zundert, M., & van Merriënboer, J. J. G. (2019). Scaffolding peer-assessment skills: Risk of interference with learning domain-specific skills? Learning and Instruction, 60 , 85–94.
* Kurihara, N. (2017). Do peer reviews help improve student writing abilities in an EFL high school classroom? TESOL Journal, 8 (2), 450–470.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33 (1), 159–174.
* Li, L., & Gao, F. (2016). The effect of peer assessment on project performance of students at different learning levels. Assessment & Evaluation in Higher Education, 41 (6), 885–900.
* Li, L., & Steckelberg, A. (2004). Using peer feedback to enhance student meaningful learning . Chicago: Association for Educational Communications and Technology.
Li, H., Xiong, Y., Zang, X., Kornhaber, M. L., Lyu, Y., Chung, K. S., & Suen, K. H. (2016). Peer assessment in the digital age: a meta-analysis comparing peer and teacher ratings. Assessment & Evaluation in Higher Education, 41 (2), 245–264.
* Lin, Y.-C. A. (2009). An examination of teacher feedback, face-to-face peer feedback, and google documents peer feedback in Taiwanese EFL college students’ writing. (Unpublished doctoral dissertation), Alliant International University, San Diego, United States
Lipsey, M. W., & Wilson, D. B. (2001). Practical Meta-analysis . Thousand Oaks: SAGE publications.
* Liu, C.-C., Lu, K.-H., Wu, L. Y., & Tsai, C.-C. (2016). The impact of peer review on creative self-efficacy and learning performance in Web 2.0 learning activities. Journal of Educational Technology & Society, 19 (2):286-297
Lundstrom, K., & Baker, W. (2009). To give is better than to receive: The benefits of peer review to the reviewer's own writing. Journal of Second Language Writing, 18 (1), 30–43.
* McCurdy, B. L., & Shapiro, E. S. (1992). A comparison of teacher-, peer-, and self-monitoring with curriculum-based measurement in reading among students with learning disabilities. The Journal of Special Education , 26 (2), 162-180.
Moeyaert, M., Ugille, M., Natasha Beretvas, S., Ferron, J., Bunuan, R., & Van den Noortgate, W. (2017). Methods for dealing with multiple outcomes in meta-analysis: a comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. International Journal of Social Research Methodology, 20 (6), 559–572.
* Montanero, M., Lucero, M., & Fernandez, M.-J. (2014). Iterative co-evaluation with a rubric of narrative texts in primary education. Journal for the Study of Education and Development, 37 (1), 184-198.
Morris, S. B. (2008). Estimating effect sizes from pretest-posttest-control group designs. Organizational Research Methods, 11 (2), 364–386.
* Olson, V. L. B. (1990). The revising processes of sixth-grade writers with and without peer feedback. The Journal of Educational Research, 84(1), 22–29.
Ossenberg, C., Henderson, A., & Mitchell, M. (2018). What attributes guide best practice for effective feedback? A scoping review. Advances in Health Sciences Education , 1–19.
* Ozogul, G., Olina, Z., & Sullivan, H. (2008). Teacher, self and peer evaluation of lesson plans written by preservice teachers. Educational Technology Research and Development, 56 (2), 181.
Panadero, E., & Alqassab, M. (2019). An empirical review of anonymity effects in peer assessment, peer feedback, peer review, peer evaluation and peer grading. Assessment & Evaluation in Higher Education , 1–26.
Panadero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9 , 129–144.
Panadero, E., Romero, M., & Strijbos, J. W. (2013). The impact of a rubric and friendship on peer assessment: Effects on construct validity, performance, and perceptions of fairness and comfort. Studies in Educational Evaluation, 39 (4), 195–203.
* Papadopoulos, P. M., Lagkas, T. D., & Demetriadis, S. N. (2012). How to improve the peer review method: Free-selection vs assigned-pair protocol evaluated in a computer networking course. Computers & Education, 59 (2), 182–195.
Paulus, T. M. (1999). The effect of peer and teacher feedback on student writing. Journal of second language writing, 8 (3), 265–289.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: the science and design of educational assessment . Washington: National Academy Press.
Peters, O., Körndle, H., & Narciss, S. (2018). Effects of a formative assessment script on how vocational students generate formative feedback to a peer’s or their own performance. European Journal of Psychology of Education, 33 (1), 117–143.
* Philippakos, Z. A., & MacArthur, C. A. (2016). The effects of giving feedback on the persuasive writing of fourth-and fifth-grade students. Reading Research Quarterly, 51 (4), 419-433.
* Pierson, H. (1967). Peer and teacher correction: A comparison of the effects of two methods of teaching composition in grade nine English classes. (Unpublished doctoral dissertation), New York University.
* Prater, D., & Bermudez, A. (1993). Using peer response groups with limited English proficient writers. Bilingual Research Journal , 17 (1-2), 99-116.
Reinholz, D. (2016). The assessment cycle: A model for learning through peer assessment. Assessment & Evaluation in Higher Education, 41 (2), 301–315.
* Rijlaarsdam, G., & Schoonen, R. (1988). Effects of a teaching program based on peer evaluation on written composition and some variables related to writing apprehension. (Unpublished doctoral dissertation), Amsterdam University, Amsterdam
Rollinson, P. (2005). Using peer feedback in the ESL writing class. ELT Journal, 59 (1), 23–30.
Rotsaert, T., Panadero, E., & Schellens, T. (2018). Anonymity as an instructional scaffold in peer assessment: its effects on peer feedback quality and evolution in students’ perceptions about peer assessment skills. European Journal of Psychology of Education, 33 (1), 75–99.
* Rudd II, J. A., Wang, V. Z., Cervato, C., & Ridky, R. W. (2009). Calibrated peer review assignments for the Earth Sciences. Journal of Geoscience Education , 57 (5), 328-334.
* Ruegg, R. (2015). The relative effects of peer and teacher feedback on improvement in EFL students' writing ability. Linguistics and Education, 29 , 73-82.
* Sadeghi, K., & Abolfazli Khonbi, Z. (2015). Iranian university students’ experiences of and attitudes towards alternatives in assessment. Assessment & Evaluation in Higher Education, 40 (5), 641–665.
* Sadler, P. M., & Good, E. (2006). The impact of self- and peer-grading on student learning. Educational Assessment , 11 (1), 1-31.
Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., & Cooper, H. (2017). Self-grading and peer-grading for formative and summative assessments in 3rd through 12th grade classrooms: A meta-analysis. Journal of Educational Psychology, 109 (8), 1049.
Sawilowsky, S. S. (2009). New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8 (2), 26.
* Schonrock-Adema, J., Heijne-Penninga, M., van Duijn, M. A., Geertsma, J., & Cohen-Schotanus, J. (2007). Assessment of professional behaviour in undergraduate medical education: Peer assessment enhances performance. Medical Education, 41 (9), 836-842.
Schwarzer, G., Carpenter, J. R., & Rücker, G. (2015). Meta-analysis with R . Cham: Springer.
Book Google Scholar
* Sippel, L., & Jackson, C. N. (2015). Teacher vs. peer oral corrective feedback in the German language classroom. Foreign Language Annals , 48 (4), 688-705.
Sluijsmans, D. M., Brand-Gruwel, S., van Merriënboer, J. J., & Martens, R. L. (2004). Training teachers in peer-assessment skills: Effects on performance and perceptions. Innovations in Education and Teaching International, 41 (1), 59–78.
Smith, H., Cooper, A., & Lancaster, L. (2002). Improving the quality of undergraduate peer assessment: A case for student and staff development. Innovations in education and teaching international, 39 (1), 71–81.
Smith, M. K., Wood, W. B., Adams, W. K., Wieman, C., Knight, J. K., Guild, N., & Su, T. T. (2009). Why peer discussion improves student performance on in-class concept questions. Science, 323 (5910), 122–124.
Steel, P. D., & Kammeyer-Mueller, J. D. (2002). Comparing meta-analytic moderator estimation techniques under realistic conditions. Journal of Applied Psychology, 87 (1), 96.
Stipek, D., & Iver, D. M. (1989). Developmental change in children's assessment of intellectual competence. Child Development , 521–538.
Strijbos, J. W., & Wichmann, A. (2018). Promoting learning by leveraging the collaborative nature of formative peer assessment with instructional scaffolds. European Journal of Psychology of Education, 33 (1), 1–9.
Strijbos, J.-W., Narciss, S., & Dünnebier, K. (2010). Peer feedback content and sender's competence level in academic writing revision tasks: Are they critical for feedback perceptions and efficiency? Learning and Instruction, 20 (4), 291–303.
* Sun, D. L., Harris, N., Walther, G., & Baiocchi, M. (2015). Peer assessment enhances student learning: The results of a matched randomized crossover experiment in a college statistics class. PLoS One 10(12),
Tannacito, T., & Tuzi, F. (2002). A comparison of e-response: Two experiences, one conclusion. Kairos, 7 (3), 1–14.
Team, R. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017: R Core Team.
Topping, K. (1998). Peer assessment between students in colleges and universities. Review of Educational Research, 68 (3), 249-276.
Topping, K. (2009). Peer assessment. Theory Into Practice, 48 (1), 20–27.
Usher, N. (2018). Learning about academic writing through holistic peer assessment. (Unpiblished doctoral thesis), University of Oxford, Oxford, UK.
* van den Boom, G., Paas, F., & van Merriënboer, J. J. (2007). Effects of elicited reflections combined with tutor or peer feedback on self-regulated learning and learning outcomes. Learning and Instruction , 17 (5), 532-548.
* van Ginkel, S., Gulikers, J., Biemans, H., & Mulder, M. (2017). The impact of the feedback source on developing oral presentation competence. Studies in Higher Education, 42 (9), 1671-1685.
van Popta, E., Kral, M., Camp, G., Martens, R. L., & Simons, P. R. J. (2017). Exploring the value of peer feedback in online learning for the provider. Educational Research Review, 20 , 24–34.
van Zundert, M., Sluijsmans, D., & van Merriënboer, J. (2010). Effective peer assessment processes: Research findings and future directions. Learning and Instruction, 20 (4), 270–279.
Vanderhoven, E., Raes, A., Montrieux, H., Rotsaert, T., & Schellens, T. (2015). What if pupils can assess their peers anonymously? A quasi-experimental study. Computers & Education, 81 , 123–132.
Wang, J.-H., Hsu, S.-H., Chen, S. Y., Ko, H.-W., Ku, Y.-M., & Chan, T.-W. (2014a). Effects of a mixed-mode peer response on student response behavior and writing performance. Journal of Educational Computing Research, 51 (2), 233–256.
* Wang, J. H., Hsu, S. H., Chen, S. Y., Ko, H. W., Ku, Y. M., & Chan, T. W. (2014b). Effects of a mixed-mode peer response on student response behavior and writing performance. Journal of Educational Computing Research , 51 (2), 233-256.
* Wang, X.-M., Hwang, G.-J., Liang, Z.-Y., & Wang, H.-Y. (2017). Enhancing students’ computer programming performances, critical thinking awareness and attitudes towards programming: An online peer-assessment attempt. Journal of Educational Technology & Society, 20 (4), 58-68.
Wiliam, D. (2010). What counts as evidence of educational achievement? The role of constructs in the pursuit of equity in assessment. Review of Research in Education, 34 (1), 254–284.
Wiliam, D. (2018). How can assessment support learning? A response to Wilson and Shepard, Penuel, and Pellegrino. Educational Measurement: Issues and Practice, 37 (1), 42–44.
Wiliam, D., Lee, C., Harrison, C., & Black, P. (2004). Teachers developing assessment for learning: Impact on student achievement. Assessment in Education: Principles, Policy & Practice, 11 (1), 49–65.
* Wise, W. G. (1992). The effects of revision instruction on eighth graders' persuasive writing (Unpublished doctoral dissertation), University of Maryland, Maryland
* Wong, H. M. H., & Storey, P. (2006). Knowing and doing in the ESL writing class. Language Awareness , 15 (4), 283.
* Xie, Y., Ke, F., & Sharma, P. (2008). The effect of peer feedback for blogging on college students' reflective learning processes. The Internet and Higher Education , 11 (1), 18-25.
Young, J. E., & Jackman, M. G.-A. (2014). Formative assessment in the Grenadian lower secondary school: Teachers’ perceptions, attitudes and practices. Assessment in Education: Principles, Policy & Practice, 21 (4), 398–411.
Yu, F.-Y., & Liu, Y.-H. (2009). Creating a psychologically safe online space for a student-generated questions learning activity via different identity revelation modes. British Journal of Educational Technology, 40 (6), 1109–1123.
The authors would like to thank Kristine Gorgen and Jessica Chan for their help coding the studies included in the meta-analysis.
Authors and affiliations.
Department of Education, University of Oxford, Oxford, England
Kit S. Double, Joshua A. McGrane & Therese N. Hopfenbeck
You can also search for this author in PubMed Google Scholar
Correspondence to Kit S. Double .
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
(XLSX 40 kb)
Effect Size Calculation
Standardised mean differences were calculated as a measure of effect size. Standardised mean difference ( d ) was calculated using the following formula, which is typically used in meta-analyses (e.g., Lipsey and Wilson 2001 ).
As standardized mean difference ( d ) is known to have a slight positive bias (Hedges 1981 ), we applied a correction to bias-correct estimates (resulting in what is often referred to as Hedge’s g ).
For studies where there was insufficient information to calculate Hedge’s g using the above method, we used the online effect size calculator developed by Lipsey and Wilson ( 2001 ) available http://www.campbellcollaboration.org/escalc . For pre-post design studies where adjusted means were not provided, we used the critical value relevant to the difference between peer feedback and control groups from the reported pre-intervention adjusted analysis (e.g., Analysis of Covariances) as suggested by Higgins and Green ( 2011 ). For pre-post designs studies where both pre and post intervention means and standard deviations were provided, we used an effect size estimate based on the mean pre-post change in the peer feedback group minus the mean pre-post change in the control group, divided by the pooled pre-intervention standard deviation as such an approach minimised bias and improves estimate precision (Morris 2008 ).
Variance estimates for each effect size were calculated using the following formula:
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Reprints and Permissions
About this article
Double, K.S., McGrane, J.A. & Hopfenbeck, T.N. The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies. Educ Psychol Rev 32 , 481–509 (2020). https://doi.org/10.1007/s10648-019-09510-3
Published : 10 December 2019
Issue Date : June 2020
DOI : https://doi.org/10.1007/s10648-019-09510-3
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Peer assessment
- Experimental design
- Effect size
- Formative assessment
- Find a journal
- Publish with us
- Open supplemental data
- Reference Manager
- Simple TEXT file
People also looked at
Systematic review article, a critical review of research on student self-assessment.
- Educational Psychology and Methodology, University at Albany, Albany, NY, United States
This article is a review of research on student self-assessment conducted largely between 2013 and 2018. The purpose of the review is to provide an updated overview of theory and research. The treatment of theory involves articulating a refined definition and operationalization of self-assessment. The review of 76 empirical studies offers a critical perspective on what has been investigated, including the relationship between self-assessment and achievement, consistency of self-assessment and others' assessments, student perceptions of self-assessment, and the association between self-assessment and self-regulated learning. An argument is made for less research on consistency and summative self-assessment, and more on the cognitive and affective mechanisms of formative self-assessment.
This review of research on student self-assessment expands on a review published as a chapter in the Cambridge Handbook of Instructional Feedback ( Andrade, 2018 , reprinted with permission). The timespan for the original review was January 2013 to October 2016. A lot of research has been done on the subject since then, including at least two meta-analyses; hence this expanded review, in which I provide an updated overview of theory and research. The treatment of theory presented here involves articulating a refined definition and operationalization of self-assessment through a lens of feedback. My review of the growing body of empirical research offers a critical perspective, in the interest of provoking new investigations into neglected areas.
Defining and Operationalizing Student Self-Assessment
Without exception, reviews of self-assessment ( Sargeant, 2008 ; Brown and Harris, 2013 ; Panadero et al., 2016a ) call for clearer definitions: What is self-assessment, and what is not? This question is surprisingly difficult to answer, as the term self-assessment has been used to describe a diverse range of activities, such as assigning a happy or sad face to a story just told, estimating the number of correct answers on a math test, graphing scores for dart throwing, indicating understanding (or the lack thereof) of a science concept, using a rubric to identify strengths and weaknesses in one's persuasive essay, writing reflective journal entries, and so on. Each of those activities involves some kind of assessment of one's own functioning, but they are so different that distinctions among types of self-assessment are needed. I will draw those distinctions in terms of the purposes of self-assessment which, in turn, determine its features: a classic form-fits-function analysis.
What is Self-Assessment?
Brown and Harris (2013) defined self-assessment in the K-16 context as a “descriptive and evaluative act carried out by the student concerning his or her own work and academic abilities” (p. 368). Panadero et al. (2016a) defined it as a “wide variety of mechanisms and techniques through which students describe (i.e., assess) and possibly assign merit or worth to (i.e., evaluate) the qualities of their own learning processes and products” (p. 804). Referring to physicians, Epstein et al. (2008) defined “concurrent self-assessment” as “ongoing moment-to-moment self-monitoring” (p. 5). Self-monitoring “refers to the ability to notice our own actions, curiosity to examine the effects of those actions, and willingness to use those observations to improve behavior and thinking in the future” (p. 5). Taken together, these definitions include self-assessment of one's abilities, processes , and products —everything but the kitchen sink. This very broad conception might seem unwieldy, but it works because each object of assessment—competence, process, and product—is subject to the influence of feedback from oneself.
What is missing from each of these definitions, however, is the purpose of the act of self-assessment. Their authors might rightly point out that the purpose is implied, but a formal definition requires us to make it plain: Why do we ask students to self-assess? I have long held that self-assessment is feedback ( Andrade, 2010 ), and that the purpose of feedback is to inform adjustments to processes and products that deepen learning and enhance performance; hence the purpose of self-assessment is to generate feedback that promotes learning and improvements in performance. This learning-oriented purpose of self-assessment implies that it should be formative: if there is no opportunity for adjustment and correction, self-assessment is almost pointless.
Clarity about the purpose of self-assessment allows us to interpret what otherwise appear to be discordant findings from research, which has produced mixed results in terms of both the accuracy of students' self-assessments and their influence on learning and/or performance. I believe the source of the discord can be traced to the different ways in which self-assessment is carried out, such as whether it is summative and formative. This issue will be taken up again in the review of current research that follows this overview. For now, consider a study of the accuracy and validity of summative self-assessment in teacher education conducted by Tejeiro et al. (2012) , which showed that students' self-assigned marks tended to be higher than marks given by professors. All 122 students in the study assigned themselves a grade at the end of their course, but half of the students were told that their self-assigned grade would count toward 5% of their final grade. In both groups, students' self-assessments were higher than grades given by professors, especially for students with “poorer results” (p. 791) and those for whom self-assessment counted toward the final grade. In the group that was told their self-assessments would count toward their final grade, no relationship was found between the professor's and the students' assessments. Tejeiro et al. concluded that, although students' and professor's assessments tend to be highly similar when self-assessment did not count toward final grades, overestimations increased dramatically when students' self-assessments did count. Interviews of students who self-assigned highly discrepant grades revealed (as you might guess) that they were motivated by the desire to obtain the highest possible grades.
Studies like Tejeiro et al's. (2012) are interesting in terms of the information they provide about the relationship between consistency and honesty, but the purpose of the self-assessment, beyond addressing interesting research questions, is unclear. There is no feedback purpose. This is also true for another example of a study of summative self-assessment of competence, during which elementary-school children took the Test of Narrative Language and then were asked to self-evaluate “how you did in making up stories today” by pointing to one of five pictures, from a “very happy face” (rating of five) to a “very sad face” (rating of one) ( Kaderavek et al., 2004 . p. 37). The usual results were reported: Older children and good narrators were more accurate than younger children and poor narrators, and males tended to more frequently overestimate their ability.
Typical of clinical studies of accuracy in self-evaluation, this study rests on a definition and operationalization of self-assessment with no value in terms of instructional feedback. If those children were asked to rate their stories and then revise or, better yet, if they assessed their stories according to clear, developmentally appropriate criteria before revising, the valence of their self-assessments in terms of instructional feedback would skyrocket. I speculate that their accuracy would too. In contrast, studies of formative self-assessment suggest that when the act of self-assessing is given a learning-oriented purpose, students' self-assessments are relatively consistent with those of external evaluators, including professors ( Lopez and Kossack, 2007 ; Barney et al., 2012 ; Leach, 2012 ), teachers ( Bol et al., 2012 ; Chang et al., 2012 , 2013 ), researchers ( Panadero and Romero, 2014 ; Fitzpatrick and Schulz, 2016 ), and expert medical assessors ( Hawkins et al., 2012 ).
My commitment to keeping self-assessment formative is firm. However, Gavin Brown (personal communication, April 2011) reminded me that summative self-assessment exists and we cannot ignore it; any definition of self-assessment must acknowledge and distinguish between formative and summative forms of it. Thus, the taxonomy in Table 1 , which depicts self-assessment as serving formative and/or summative purposes, and focuses on competence, processes, and/or products.
Table 1 . A taxonomy of self-assessment.
Fortunately, a formative view of self-assessment seems to be taking hold in various educational contexts. For instance, Sargeant (2008) noted that all seven authors in a special issue of the Journal of Continuing Education in the Health Professions “conceptualize self-assessment within a formative, educational perspective, and see it as an activity that draws upon both external and internal data, standards, and resources to inform and make decisions about one's performance” (p. 1). Sargeant also stresses the point that self-assessment should be guided by evaluative criteria: “Multiple external sources can and should inform self-assessment, perhaps most important among them performance standards” (p. 1). Now we are talking about the how of self-assessment, which demands an operationalization of self-assessment practice. Let us examine each object of self-assessment (competence, processes, and/or products) with an eye for what is assessed and why.
What is Self-Assessed?
Monitoring and self-assessing processes are practically synonymous with self-regulated learning (SRL), or at least central components of it such as goal-setting and monitoring, or metacognition. Research on SRL has clearly shown that self-generated feedback on one's approach to learning is associated with academic gains ( Zimmerman and Schunk, 2011 ). Self-assessment of the products , such as papers and presentations, are the easiest to defend as feedback, especially when those self-assessments are grounded in explicit, relevant, evaluative criteria and followed by opportunities to relearn and/or revise ( Andrade, 2010 ).
Including the self-assessment of competence in this definition is a little trickier. I hesitated to include it because of the risk of sneaking in global assessments of one's overall ability, self-esteem, and self-concept (“I'm good enough, I'm smart enough, and doggone it, people like me,” Franken, 1992 ), which do not seem relevant to a discussion of feedback in the context of learning. Research on global self-assessment, or self-perception, is popular in the medical education literature, but even there, scholars have begun to question its usefulness in terms of influencing learning and professional growth (e.g., see Sargeant et al., 2008 ). Eva and Regehr (2008) seem to agree in the following passage, which states the case in a way that makes it worthy of a long quotation:
Self-assessment is often (implicitly or otherwise) conceptualized as a personal, unguided reflection on performance for the purposes of generating an individually derived summary of one's own level of knowledge, skill, and understanding in a particular area. For example, this conceptualization would appear to be the only reasonable basis for studies that fit into what Colliver et al. (2005) has described as the “guess your grade” model of self-assessment research, the results of which form the core foundation for the recurring conclusion that self-assessment is generally poor. This unguided, internally generated construction of self-assessment stands in stark contrast to the model put forward by Boud (1999) , who argued that the phrase self-assessment should not imply an isolated or individualistic activity; it should commonly involve peers, teachers, and other sources of information. The conceptualization of self-assessment as enunciated in Boud's description would appear to involve a process by which one takes personal responsibility for looking outward, explicitly seeking feedback, and information from external sources, then using these externally generated sources of assessment data to direct performance improvements. In this construction, self-assessment is more of a pedagogical strategy than an ability to judge for oneself; it is a habit that one needs to acquire and enact rather than an ability that one needs to master (p. 15).
As in the K-16 context, self-assessment is coming to be seen as having value as much or more so in terms of pedagogy as in assessment ( Silver et al., 2008 ; Brown and Harris, 2014 ). In the end, however, I decided that self-assessing one's competence to successfully learn a particular concept or complete a particular task (which sounds a lot like self-efficacy—more on that later) might be useful feedback because it can inform decisions about how to proceed, such as the amount of time to invest in learning how to play the flute, or whether or not to seek help learning the steps of the jitterbug. An important caveat, however, is that self-assessments of competence are only useful if students have opportunities to do something about their perceived low competence—that is, it serves the purpose of formative feedback for the learner.
How to Self-Assess?
Panadero et al. (2016a) summarized five very different taxonomies of self-assessment and called for the development of a comprehensive typology that considers, among other things, its purpose, the presence or absence of criteria, and the method. In response, I propose the taxonomy depicted in Table 1 , which focuses on the what (competence, process, or product), the why (formative or summative), and the how (methods, including whether or not they include standards, e.g., criteria) of self-assessment. The collections of examples of methods in the table is inexhaustive.
I put the methods in Table 1 where I think they belong, but many of them could be placed in more than one cell. Take self-efficacy , for instance, which is essentially a self-assessment of one's competence to successfully undertake a particular task ( Bandura, 1997 ). Summative judgments of self-efficacy are certainly possible but they seem like a silly thing to do—what is the point, from a learning perspective? Formative self-efficacy judgments, on the other hand, can inform next steps in learning and skill building. There is reason to believe that monitoring and making adjustments to one's self-efficacy (e.g., by setting goals or attributing success to effort) can be productive ( Zimmerman, 2000 ), so I placed self-efficacy in the formative row.
It is important to emphasize that self-efficacy is task-specific, more or less ( Bandura, 1997 ). This taxonomy does not include general, holistic evaluations of one's abilities, for example, “I am good at math.” Global assessment of competence does not provide the leverage, in terms of feedback, that is provided by task-specific assessments of competence, that is, self-efficacy. Eva and Regehr (2008) provided an illustrative example: “We suspect most people are prompted to open a dictionary as a result of encountering a word for which they are uncertain of the meaning rather than out of a broader assessment that their vocabulary could be improved” (p. 16). The exclusion of global evaluations of oneself resonates with research that clearly shows that feedback that focuses on aspects of a task (e.g., “I did not solve most of the algebra problems”) is more effective than feedback that focuses on the self (e.g., “I am bad at math”) ( Kluger and DeNisi, 1996 ; Dweck, 2006 ; Hattie and Timperley, 2007 ). Hence, global self-evaluations of ability or competence do not appear in Table 1 .
Another approach to student self-assessment that could be placed in more than one cell is traffic lights . The term traffic lights refers to asking students to use green, yellow, or red objects (or thumbs up, sideways, or down—anything will do) to indicate whether they think they have good, partial, or little understanding ( Black et al., 2003 ). It would be appropriate for traffic lights to appear in multiple places in Table 1 , depending on how they are used. Traffic lights seem to be most effective at supporting students' reflections on how well they understand a concept or have mastered a skill, which is line with their creators' original intent, so they are categorized as formative self-assessments of one's learning—which sounds like metacognition.
In fact, several of the methods included in Table 1 come from research on metacognition, including self-monitoring , such as checking one's reading comprehension, and self-testing , e.g., checking one's performance on test items. These last two methods have been excluded from some taxonomies of self-assessment (e.g., Boud and Brew, 1995 ) because they do not engage students in explicitly considering relevant standards or criteria. However, new conceptions of self-assessment are grounded in theories of the self- and co-regulation of learning ( Andrade and Brookhart, 2016 ), which includes self-monitoring of learning processes with and without explicit standards.
However, my research favors self-assessment with regard to standards ( Andrade and Boulay, 2003 ; Andrade and Du, 2007 ; Andrade et al., 2008 , 2009 , 2010 ), as does related research by Panadero and his colleagues (see below). I have involved students in self-assessment of stories, essays, or mathematical word problems according to rubrics or checklists with criteria. For example, two studies investigated the relationship between elementary or middle school students' scores on a written assignment and a process that involved them in reading a model paper, co-creating criteria, self-assessing first drafts with a rubric, and revising ( Andrade et al., 2008 , 2010 ). The self-assessment was highly scaffolded: students were asked to underline key phrases in the rubric with colored pencils (e.g., underline “clearly states an opinion” in blue), then underline or circle in their drafts the evidence of having met the standard articulated by the phrase (e.g., his or her opinion) with the same blue pencil. If students found they had not met the standard, they were asked to write themselves a reminder to make improvements when they wrote their final drafts. This process was followed for each criterion on the rubric. There were main effects on scores for every self-assessed criterion on the rubric, suggesting that guided self-assessment according to the co-created criteria helped students produce more effective writing.
Panadero and his colleagues have also done quasi-experimental and experimental research on standards-referenced self-assessment, using rubrics or lists of assessment criteria that are presented in the form of questions ( Panadero et al., 2012 , 2013 , 2014 ; Panadero and Romero, 2014 ). Panadero calls the list of assessment criteria a script because his work is grounded in research on scaffolding (e.g., Kollar et al., 2006 ): I call it a checklist because that is the term used in classroom assessment contexts. Either way, the list provides standards for the task. Here is a script for a written summary that Panadero et al. (2014) used with college students in a psychology class:
• Does my summary transmit the main idea from the text? Is it at the beginning of my summary?
• Are the important ideas also in my summary?
• Have I selected the main ideas from the text to make them explicit in my summary?
• Have I thought about my purpose for the summary? What is my goal?
Taken together, the results of the studies cited above suggest that students who engaged in self-assessment using scripts or rubrics were more self-regulated, as measured by self-report questionnaires and/or think aloud protocols, than were students in the comparison or control groups. Effect sizes were very small to moderate (η 2 = 0.06–0.42), and statistically significant. Most interesting, perhaps, is one study ( Panadero and Romero, 2014 ) that demonstrated an association between rubric-referenced self-assessment activities and all three phases of SRL; forethought, performance, and reflection.
There are surely many other methods of self-assessment to include in Table 1 , as well as interesting conversations to be had about which method goes where and why. In the meantime, I offer the taxonomy in Table 1 as a way to define and operationalize self-assessment in instructional contexts and as a framework for the following overview of current research on the subject.
An Overview of Current Research on Self-Assessment
Several recent reviews of self-assessment are available ( Brown and Harris, 2013 ; Brown et al., 2015 ; Panadero et al., 2017 ), so I will not summarize the entire body of research here. Instead, I chose to take a birds-eye view of the field, with goal of reporting on what has been sufficiently researched and what remains to be done. I used the references lists from reviews, as well as other relevant sources, as a starting point. In order to update the list of sources, I directed two new searches 1 , the first of the ERIC database, and the second of both ERIC and PsychINFO. Both searches included two search terms, “self-assessment” OR “self-evaluation.” Advanced search options had four delimiters: (1) peer-reviewed, (2) January, 2013–October, 2016 and then October 2016–March 2019, (3) English, and (4) full-text. Because the focus was on K-20 educational contexts, sources were excluded if they were about early childhood education or professional development.
The first search yielded 347 hits; the second 1,163. Research that was unrelated to instructional feedback was excluded, such as studies limited to self-estimates of performance before or after taking a test, guesses about whether a test item was answered correctly, and estimates of how many tasks could be completed in a certain amount of time. Although some of the excluded studies might be thought of as useful investigations of self-monitoring, as a group they seemed too unrelated to theories of self-generated feedback to be appropriate for this review. Seventy-six studies were selected for inclusion in Table S1 (Supplementary Material), which also contains a few studies published before 2013 that were not included in key reviews, as well as studies solicited directly from authors.
The Table S1 in the Supplementary Material contains a complete list of studies included in this review, organized by the focus or topic of the study, as well as brief descriptions of each. The “type” column Table S1 (Supplementary Material) indicates whether the study focused on formative or summative self-assessment. This distinction was often difficult to make due to a lack of information. For example, Memis and Seven (2015) frame their study in terms of formative assessment, and note that the purpose of the self-evaluation done by the sixth grade students is to “help students improve their [science] reports” (p. 39), but they do not indicate how the self-assessments were done, nor whether students were given time to revise their reports based on their judgments or supported in making revisions. A sentence or two of explanation about the process of self-assessment in the procedures sections of published studies would be most useful.
Figure 1 graphically represents the number of studies in the four most common topic categories found in the table—achievement, consistency, student perceptions, and SRL. The figure reveals that research on self-assessment is on the rise, with consistency the most popular topic. Of the 76 studies in the table in the appendix, 44 were inquiries into the consistency of students' self-assessments with other judgments (e.g., a test score or teacher's grade). Twenty-five studies investigated the relationship between self-assessment and achievement. Fifteen explored students' perceptions of self-assessment. Twelve studies focused on the association between self-assessment and self-regulated learning. One examined self-efficacy, and two qualitative studies documented the mental processes involved in self-assessment. The sum ( n = 99) of the list of research topics is more than 76 because several studies had multiple foci. In the remainder of this review I examine each topic in turn.
Figure 1 . Topics of self-assessment studies, 2013–2018.
Table S1 (Supplementary Material) reveals that much of the recent research on self-assessment has investigated the accuracy or, more accurately, consistency, of students' self-assessments. The term consistency is more appropriate in the classroom context because the quality of students' self-assessments is often determined by comparing them with their teachers' assessments and then generating correlations. Given the evidence of the unreliability of teachers' grades ( Falchikov, 2005 ), the assumption that teachers' assessments are accurate might not be well-founded ( Leach, 2012 ; Brown et al., 2015 ). Ratings of student work done by researchers are also suspect, unless evidence of the validity and reliability of the inferences made about student work by researchers is available. Consequently, much of the research on classroom-based self-assessment should use the term consistency , which refers to the degree of alignment between students' and expert raters' evaluations, avoiding the purer, more rigorous term accuracy unless it is fitting.
In their review, Brown and Harris (2013) reported that correlations between student self-ratings and other measures tended to be weakly to strongly positive, ranging from r ≈ 0.20 to 0.80, with few studies reporting correlations >0.60. But their review included results from studies of any self-appraisal of school work, including summative self-rating/grading, predictions about the correctness of answers on test items, and formative, criteria-based self-assessments, a combination of methods that makes the correlations they reported difficult to interpret. Qualitatively different forms of self-assessment, especially summative and formative types, cannot be lumped together without obfuscating important aspects of self-assessment as feedback.
Given my concern about combining studies of summative and formative assessment, you might anticipate a call for research on consistency that distinguishes between the two. I will make no such call for three reasons. One is that we have enough research on the subject, including the 22 studies in Table S1 (Supplementary Material) that were published after Brown and Harris's review (2013 ). Drawing only on studies included in Table S1 (Supplementary Material), we can say with confidence that summative self-assessment tends to be inconsistent with external judgements ( Baxter and Norman, 2011 ; De Grez et al., 2012 ; Admiraal et al., 2015 ), with males tending to overrate and females to underrate ( Nowell and Alston, 2007 ; Marks et al., 2018 ). There are exceptions ( Alaoutinen, 2012 ; Lopez-Pastor et al., 2012 ) as well as mixed results, with students being consistent regarding some aspects of their learning but not others ( Blanch-Hartigan, 2011 ; Harding and Hbaci, 2015 ; Nguyen and Foster, 2018 ). We can also say that older, more academically competent learners tend to be more consistent ( Hacker et al., 2000 ; Lew et al., 2010 ; Alaoutinen, 2012 ; Guillory and Blankson, 2017 ; Butler, 2018 ; Nagel and Lindsey, 2018 ). There is evidence that consistency can be improved through experience ( Lopez and Kossack, 2007 ; Yilmaz, 2017 ; Nagel and Lindsey, 2018 ), the use of guidelines ( Bol et al., 2012 ), feedback ( Thawabieh, 2017 ), and standards ( Baars et al., 2014 ), perhaps in the form of rubrics ( Panadero and Romero, 2014 ). Modeling and feedback also help ( Labuhn et al., 2010 ; Miller and Geraci, 2011 ; Hawkins et al., 2012 ; Kostons et al., 2012 ).
An outcome typical of research on the consistency of summative self-assessment can be found in row 59, which summarizes the study by Tejeiro et al. (2012) discussed earlier: Students' self-assessments were higher than marks given by professors, especially for students with poorer results, and no relationship was found between the professors' and the students' assessments in the group in which self-assessment counted toward the final mark. Students are not stupid: if they know that they can influence their final grade, and that their judgment is summative rather than intended to inform revision and improvement, they will be motivated to inflate their self-evaluation. I do not believe we need more research to demonstrate that phenomenon.
The second reason I am not calling for additional research on consistency is a lot of it seems somewhat irrelevant. This might be because the interest in accuracy is rooted in clinical research on calibration, which has very different aims. Calibration accuracy is the “magnitude of consent between learners' true and self-evaluated task performance. Accurately calibrated learners' task performance equals their self-evaluated task performance” ( Wollenschläger et al., 2016 ). Calibration research often asks study participants to predict or postdict the correctness of their responses to test items. I caution about generalizing from clinical experiments to authentic classroom contexts because the dismal picture of our human potential to self-judge was painted by calibration researchers before study participants were effectively taught how to predict with accuracy, or provided with the tools they needed to be accurate, or motivated to do so. Calibration researchers know that, of course, and have conducted intervention studies that attempt to improve accuracy, with some success (e.g., Bol et al., 2012 ). Studies of formative self-assessment also suggest that consistency increases when it is taught and supported in many of the ways any other skill must be taught and supported ( Lopez and Kossack, 2007 ; Labuhn et al., 2010 ; Chang et al., 2012 , 2013 ; Hawkins et al., 2012 ; Panadero and Romero, 2014 ; Lin-Siegler et al., 2015 ; Fitzpatrick and Schulz, 2016 ).
Even clinical psychological studies that go beyond calibration to examine the associations between monitoring accuracy and subsequent study behaviors do not transfer well to classroom assessment research. After repeatedly encountering claims that, for example, low self-assessment accuracy leads to poor task-selection accuracy and “suboptimal learning outcomes” ( Raaijmakers et al., 2019 , p. 1), I dug into the cited studies and discovered two limitations. The first is that the tasks in which study participants engage are quite inauthentic. A typical task involves studying “word pairs (e.g., railroad—mother), followed by a delayed judgment of learning (JOL) in which the students predicted the chances of remembering the pair… After making a JOL, the entire pair was presented for restudy for 4 s [ sic ], and after all pairs had been restudied, a criterion test of paired-associate recall occurred” ( Dunlosky and Rawson, 2012 , p. 272). Although memory for word pairs might be important in some classroom contexts, it is not safe to assume that results from studies like that one can predict students' behaviors after criterion-referenced self-assessment of their comprehension of complex texts, lengthy compositions, or solutions to multi-step mathematical problems.
The second limitation of studies like the typical one described above is more serious: Participants in research like that are not permitted to regulate their own studying, which is experimentally manipulated by a computer program. This came as a surprise, since many of the claims were about students' poor study choices but they were rarely allowed to make actual choices. For example, Dunlosky and Rawson (2012) permitted participants to “use monitoring to effectively control learning” by programming the computer so that “a participant would need to have judged his or her recall of a definition entirely correct on three different trials, and once they judged it entirely correct on the third trial, that particular key term definition was dropped [by the computer program] from further practice” (p. 272). The authors note that this study design is an improvement on designs that did not require all participants to use the same regulation algorithm, but it does not reflect the kinds of decisions that learners make in class or while doing homework. In fact, a large body of research shows that students can make wise choices when they self-pace the study of to-be-learned materials and then allocate study time to each item ( Bjork et al., 2013 , p. 425):
In a typical experiment, the students first study all the items at an experimenter-paced rate (e.g., study 60 paired associates for 3 s each), which familiarizes the students with the items; after this familiarity phase, the students then either choose which items they want to restudy (e.g., all items are presented in an array, and the students select which ones to restudy) and/or pace their restudy of each item. Several dependent measures have been widely used, such as how long each item is studied, whether an item is selected for restudy, and in what order items are selected for restudy. The literature on these aspects of self-regulated study is massive (for a comprehensive overview, see both Dunlosky and Ariel, 2011 and Son and Metcalfe, 2000 ), but the evidence is largely consistent with a few basic conclusions. First, if students have a chance to practice retrieval prior to restudying items, they almost exclusively choose to restudy unrecalled items and drop the previously recalled items from restudy ( Metcalfe and Kornell, 2005 ). Second, when pacing their study of individual items that have been selected for restudy, students typically spend more time studying items that are more, rather than less, difficult to learn. Such a strategy is consistent with a discrepancy-reduction model of self-paced study (which states that people continue to study an item until they reach mastery), although some key revisions to this model are needed to account for all the data. For instance, students may not continue to study until they reach some static criterion of mastery, but instead, they may continue to study until they perceive that they are no longer making progress.
I propose that this research, which suggests that students' unscaffolded, unmeasured, informal self-assessments tend to lead to appropriate task selection, is better aligned with research on classroom-based self-assessment. Nonetheless, even this comparison is inadequate because the study participants were not taught to compare their performance to the criteria for mastery, as is often done in classroom-based self-assessment.
The third and final reason I do not believe we need additional research on consistency is that I think it is a distraction from the true purposes of self-assessment. Many if not most of the articles about the accuracy of self-assessment are grounded in the assumption that accuracy is necessary for self-assessment to be useful, particularly in terms of subsequent studying and revision behaviors. Although it seems obvious that accurate evaluations of their performance positively influence students' study strategy selection, which should produce improvements in achievement, I have not seen relevant research that tests those conjectures. Some claim that inaccurate estimates of learning lead to the selection of inappropriate learning tasks ( Kostons et al., 2012 ) but they cite research that does not support their claim. For example, Kostons et al. cite studies that focus on the effectiveness of SRL interventions but do not address the accuracy of participants' estimates of learning, nor the relationship of those estimates to the selection of next steps. Other studies produce findings that support my skepticism. Take, for instance, two relevant studies of calibration. One suggested that performance and judgments of performance had little influence on subsequent test preparation behavior ( Hacker et al., 2000 ), and the other showed that study participants followed their predictions of performance to the same degree, regardless of monitoring accuracy ( van Loon et al., 2014 ).
Eva and Regehr (2008) believe that:
Research questions that take the form of “How well do various practitioners self-assess?” “How can we improve self-assessment?” or “How can we measure self-assessment skill?” should be considered defunct and removed from the research agenda [because] there have been hundreds of studies into these questions and the answers are “Poorly,” “You can't,” and “Don't bother” (p. 18).
I almost agree. A study that could change my mind about the importance of accuracy of self-assessment would be an investigation that goes beyond attempting to improve accuracy just for the sake of accuracy by instead examining the relearning/revision behaviors of accurate and inaccurate self-assessors: Do students whose self-assessments match the valid and reliable judgments of expert raters (hence my use of the term accuracy ) make better decisions about what they need to do to deepen their learning and improve their work? Here, I admit, is a call for research related to consistency: I would love to see a high-quality investigation of the relationship between accuracy in formative self-assessment, and students' subsequent study and revision behaviors, and their learning. For example, a study that closely examines the revisions to writing made by accurate and inaccurate self-assessors, and the resulting outcomes in terms of the quality of their writing, would be most welcome.
Table S1 (Supplementary Material) indicates that by 2018 researchers began publishing studies that more directly address the hypothesized link between self-assessment and subsequent learning behaviors, as well as important questions about the processes learners engage in while self-assessing ( Yan and Brown, 2017 ). One, a study by Nugteren et al. (2018 row 19 in Table S1 (Supplementary Material)), asked “How do inaccurate [summative] self-assessments influence task selections?” (p. 368) and employed a clever exploratory research design. The results suggested that most of the 15 students in their sample over-estimated their performance and made inaccurate learning-task selections. Nugteren et al. recommended helping students make more accurate self-assessments, but I think the more interesting finding is related to why students made task selections that were too difficult or too easy, given their prior performance: They based most task selections on interest in the content of particular items (not the overarching content to be learned), and infrequently considered task difficulty and support level. For instance, while working on the genetics tasks, students reported selecting tasks because they were fun or interesting, not because they addressed self-identified weaknesses in their understanding of genetics. Nugteren et al. proposed that students would benefit from instruction on task selection. I second that proposal: Rather than directing our efforts on accuracy in the service of improving subsequent task selection, let us simply teach students to use the information at hand to select next best steps, among other things.
Butler (2018 , row 76 in Table S1 (Supplementary Material)) has conducted at least two studies of learners' processes of responding to self-assessment items and how they arrived at their judgments. Comparing generic, decontextualized items to task-specific, contextualized items (which she calls after-task items ), she drew two unsurprising conclusions: the task-specific items “generally showed higher correlations with task performance,” and older students “appeared to be more conservative in their judgment compared with their younger counterparts” (p. 249). The contribution of the study is the detailed information it provides about how students generated their judgments. For example, Butler's qualitative data analyses revealed that when asked to self-assess in terms of vague or non-specific items, the children often “contextualized the descriptions based on their own experiences, goals, and expectations,” (p. 257) focused on the task at hand, and situated items in the specific task context. Perhaps as a result, the correlation between after-task self-assessment and task performance was generally higher than for generic self-assessment.
Butler (2018) notes that her study enriches our empirical understanding of the processes by which children respond to self-assessment. This is a very promising direction for the field. Similar studies of processing during formative self-assessment of a variety of task types in a classroom context would likely produce significant advances in our understanding of how and why self-assessment influences learning and performance.
Fifteen of the studies listed in Table S1 (Supplementary Material) focused on students' perceptions of self-assessment. The studies of children suggest that they tend to have unsophisticated understandings of its purposes ( Harris and Brown, 2013 ; Bourke, 2016 ) that might lead to shallow implementation of related processes. In contrast, results from the studies conducted in higher education settings suggested that college and university students understood the function of self-assessment ( Ratminingsih et al., 2018 ) and generally found it to be useful for guiding evaluation and revision ( Micán and Medina, 2017 ), understanding how to take responsibility for learning ( Lopez and Kossack, 2007 ; Bourke, 2014 ; Ndoye, 2017 ), prompting them to think more critically and deeply ( van Helvoort, 2012 ; Siow, 2015 ), applying newfound skills ( Murakami et al., 2012 ), and fostering self-regulated learning by guiding them to set goals, plan, self-monitor and reflect ( Wang, 2017 ).
Not surprisingly, positive perceptions of self-assessment were typically developed by students who actively engaged the formative type by, for example, developing their own criteria for an effective self-assessment response ( Bourke, 2014 ), or using a rubric or checklist to guide their assessments and then revising their work ( Huang and Gui, 2015 ; Wang, 2017 ). Earlier research suggested that children's attitudes toward self-assessment can become negative if it is summative ( Ross et al., 1998 ). However, even summative self-assessment was reported by adult learners to be useful in helping them become more critical of their own and others' writing throughout the course and in subsequent courses ( van Helvoort, 2012 ).
Twenty-five of the studies in Table S1 (Supplementary Material) investigated the relation between self-assessment and achievement, including two meta-analyses. Twenty of the 25 clearly employed the formative type. Without exception, those 20 studies, plus the two meta-analyses ( Graham et al., 2015 ; Sanchez et al., 2017 ) demonstrated a positive association between self-assessment and learning. The meta-analysis conducted by Graham and his colleagues, which included 10 studies, yielded an average weighted effect size of 0.62 on writing quality. The Sanchez et al. meta-analysis revealed that, although 12 of the 44 effect sizes were negative, on average, “students who engaged in self-grading performed better ( g = 0.34) on subsequent tests than did students who did not” (p. 1,049).
All but two of the non-meta-analytic studies of achievement in Table S1 (Supplementary Material) were quasi-experimental or experimental, providing relatively rigorous evidence that their treatment groups outperformed their comparison or control groups in terms of everything from writing to dart-throwing, map-making, speaking English, and exams in a wide variety of disciplines. One experiment on summative self-assessment ( Miller and Geraci, 2011 ), in contrast, resulted in no improvements in exam scores, while the other one did ( Raaijmakers et al., 2017 ).
It would be easy to overgeneralize and claim that the question about the effect of self-assessment on learning has been answered, but there are unanswered questions about the key components of effective self-assessment, especially social-emotional components related to power and trust ( Andrade and Brown, 2016 ). The trends are pretty clear, however: it appears that formative forms of self-assessment can promote knowledge and skill development. This is not surprising, given that it involves many of the processes known to support learning, including practice, feedback, revision, and especially the intellectually demanding work of making complex, criteria-referenced judgments ( Panadero et al., 2014 ). Boud (1995a , b) predicted this trend when he noted that many self-assessment processes undermine learning by rushing to judgment, thereby failing to engage students with the standards or criteria for their work.
The association between self-assessment and learning has also been explained in terms of self-regulation ( Andrade, 2010 ; Panadero and Alonso-Tapia, 2013 ; Andrade and Brookhart, 2016 , 2019 ; Panadero et al., 2016b ). Self-regulated learning (SRL) occurs when learners set goals and then monitor and manage their thoughts, feelings, and actions to reach those goals. SRL is moderately to highly correlated with achievement ( Zimmerman and Schunk, 2011 ). Research suggests that formative assessment is a potential influence on SRL ( Nicol and Macfarlane-Dick, 2006 ). The 12 studies in Table S1 (Supplementary Material) that focus on SRL demonstrate the recent increase in interest in the relationship between self-assessment and SRL.
Conceptual and practical overlaps between the two fields are abundant. In fact, Brown and Harris (2014) recommend that student self-assessment no longer be treated as an assessment, but as an essential competence for self-regulation. Butler and Winne (1995) introduced the role of self-generated feedback in self-regulation years ago:
[For] all self-regulated activities, feedback is an inherent catalyst. As learners monitor their engagement with tasks, internal feedback is generated by the monitoring process. That feedback describes the nature of outcomes and the qualities of the cognitive processes that led to those states (p. 245).
The outcomes and processes referred to by Butler and Winne are many of the same products and processes I referred to earlier in the definition of self-assessment and in Table 1 .
In general, research and practice related to self-assessment has tended to focus on judging the products of student learning, while scholarship on self-regulated learning encompasses both processes and products. The very practical focus of much of the research on self-assessment means it might be playing catch-up, in terms of theory development, with the SRL literature, which is grounded in experimental paradigms from cognitive psychology ( de Bruin and van Gog, 2012 ), while self-assessment research is ahead in terms of implementation (E. Panadero, personal communication, October 21, 2016). One major exception is the work done on Self-regulated Strategy Development ( Glaser and Brunstein, 2007 ; Harris et al., 2008 ), which has successfully integrated SRL research with classroom practices, including self-assessment, to teach writing to students with special needs.
Nicol and Macfarlane-Dick (2006) have been explicit about the potential for self-assessment practices to support self-regulated learning:
To develop systematically the learner's capacity for self-regulation, teachers need to create more structured opportunities for self-monitoring and the judging of progression to goals. Self-assessment tasks are an effective way of achieving this, as are activities that encourage reflection on learning progress (p. 207).
The studies of SRL in Table S1 (Supplementary Material) provide encouraging findings regarding the potential role of self-assessment in promoting achievement, self-regulated learning in general, and metacognition and study strategies related to task selection in particular. The studies also represent a solution to the “methodological and theoretical challenges involved in bringing metacognitive research to the real world, using meaningful learning materials” ( Koriat, 2012 , p. 296).
Future Directions for Research
I agree with ( Yan and Brown, 2017 ) statement that “from a pedagogical perspective, the benefits of self-assessment may come from active engagement in the learning process, rather than by being “veridical” or coinciding with reality, because students' reflection and metacognitive monitoring lead to improved learning” (p. 1,248). Future research should focus less on accuracy/consistency/veridicality, and more on the precise mechanisms of self-assessment ( Butler, 2018 ).
An important aspect of research on self-assessment that is not explicitly represented in Table S1 (Supplementary Material) is practice, or pedagogy: Under what conditions does self-assessment work best, and how are those conditions influenced by context? Fortunately, the studies listed in the table, as well as others (see especially Andrade and Valtcheva, 2009 ; Nielsen, 2014 ; Panadero et al., 2016a ), point toward an answer. But we still have questions about how best to scaffold effective formative self-assessment. One area of inquiry is about the characteristics of the task being assessed, and the standards or criteria used by learners during self-assessment.
Influence of Types of Tasks and Standards or Criteria
Type of task or competency assessed seems to matter (e.g., Dolosic, 2018 , Nguyen and Foster, 2018 ), as do the criteria ( Yilmaz, 2017 ), but we do not yet have a comprehensive understanding of how or why. There is some evidence that it is important that the criteria used to self-assess are concrete, task-specific ( Butler, 2018 ), and graduated. For example, Fastre et al. (2010) revealed an association between self-assessment according to task-specific criteria and task performance: In a quasi-experimental study of 39 novice vocational education students studying stoma care, they compared concrete, task-specific criteria (“performance-based criteria”) such as “Introduces herself to the patient” and “Consults the care file for details concerning the stoma” to vaguer, “competence-based criteria” such as “Shows interest, listens actively, shows empathy to the patient” and “Is discrete with sensitive topics.” The performance-based criteria group outperformed the competence-based group on tests of task performance, presumably because “performance-based criteria make it easier to distinguish levels of performance, enabling a step-by-step process of performance improvement” (p. 530).
This finding echoes the results of a study of self-regulated learning by Kitsantas and Zimmerman (2006) , who argued that “fine-grained standards can have two key benefits: They can enable learners to be more sensitive to small changes in skill and make more appropriate adaptations in learning strategies” (p. 203). In their study, 70 college students were taught how to throw darts at a target. The purpose of the study was to examine the role of graphing of self-recorded outcomes and self-evaluative standards in learning a motor skill. Students who were provided with graduated self-evaluative standards surpassed “those who were provided with absolute standards or no standards (control) in both motor skill and in motivational beliefs (i.e., self-efficacy, attributions, and self-satisfaction)” (p. 201). Kitsantas and Zimmerman hypothesized that setting high absolute standards would limit a learner's sensitivity to small improvements in functioning. This hypothesis was supported by the finding that students who set absolute standards reported significantly less awareness of learning progress (and hit the bull's-eye less often) than students who set graduated standards. “The correlation between the self-evaluation and dart-throwing outcomes measures was extraordinarily high ( r = 0.94)” (p. 210). Classroom-based research on specific, graduated self-assessment criteria would be informative.
Cognitive and Affective Mechanisms of Self-Assessment
There are many additional questions about pedagogy, such as the hoped-for investigation mentioned above of the relationship between accuracy in formative self-assessment, students' subsequent study behaviors, and their learning. There is also a need for research on how to help teachers give students a central role in their learning by creating space for self-assessment (e.g., see Hawe and Parr, 2014 ), and the complex power dynamics involved in doing so ( Tan, 2004 , 2009 ; Taras, 2008 ; Leach, 2012 ). However, there is an even more pressing need for investigations into the internal mechanisms experienced by students engaged in assessing their own learning. Angela Lui and I call this the next black box ( Lui, 2017 ).
Black and Wiliam (1998) used the term black box to emphasize the fact that what happened in most classrooms was largely unknown: all we knew was that some inputs (e.g., teachers, resources, standards, and requirements) were fed into the box, and that certain outputs (e.g., more knowledgeable and competent students, acceptable levels of achievement) would follow. But what, they asked, is happening inside, and what new inputs will produce better outputs? Black and Wiliam's review spawned a great deal of research on formative assessment, some but not all of which suggests a positive relationship with academic achievement ( Bennett, 2011 ; Kingston and Nash, 2011 ). To better understand why and how the use of formative assessment in general and self-assessment in particular is associated with improvements in academic achievement in some instances but not others, we need research that looks into the next black box: the cognitive and affective mechanisms of students who are engaged in assessment processes ( Lui, 2017 ).
The role of internal mechanisms has been discussed in theory but not yet fully tested. Crooks (1988) argued that the impact of assessment is influenced by students' interpretation of the tasks and results, and Butler and Winne (1995) theorized that both cognitive and affective processes play a role in determining how feedback is internalized and used to self-regulate learning. Other theoretical frameworks about the internal processes of receiving and responding to feedback have been developed (e.g., Nicol and Macfarlane-Dick, 2006 ; Draper, 2009 ; Andrade, 2013 ; Lipnevich et al., 2016 ). Yet, Shute (2008) noted in her review of the literature on formative feedback that “despite the plethora of research on the topic, the specific mechanisms relating feedback to learning are still mostly murky, with very few (if any) general conclusions” (p. 156). This area is ripe for research.
Self-assessment is the act of monitoring one's processes and products in order to make adjustments that deepen learning and enhance performance. Although it can be summative, the evidence presented in this review strongly suggests that self-assessment is most beneficial, in terms of both achievement and self-regulated learning, when it is used formatively and supported by training.
What is not yet clear is why and how self-assessment works. Those of you who like to investigate phenomena that are maddeningly difficult to measure will rejoice to hear that the cognitive and affective mechanisms of self-assessment are the next black box. Studies of the ways in which learners think and feel, the interactions between their thoughts and feelings and their context, and the implications for pedagogy will make major contributions to our field.
The author confirms being the sole contributor of this work and has approved it for publication.
Conflict of Interest Statement
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2019.00087/full#supplementary-material
1. ^ I am grateful to my graduate assistants, Joanna Weaver and Taja Young, for conducting the searches.
Admiraal, W., Huisman, B., and Pilli, O. (2015). Assessment in massive open online courses. Electron. J. e-Learning , 13, 207–216.
Alaoutinen, S. (2012). Evaluating the effect of learning style and student background on self-assessment accuracy. Comput. Sci. Educ. 22, 175–198. doi: 10.1080/08993408.2012.692924
CrossRef Full Text | Google Scholar
Al-Rawahi, N. M., and Al-Balushi, S. M. (2015). The effect of reflective science journal writing on students' self-regulated learning strategies. Int. J. Environ. Sci. Educ. 10, 367–379. doi: 10.12973/ijese.2015.250a
Andrade, H. (2010). “Students as the definitive source of formative assessment: academic self-assessment and the self-regulation of learning,” in Handbook of Formative Assessment , eds H. Andrade and G. Cizek (New York, NY: Routledge, 90–105.
Andrade, H. (2013). “Classroom assessment in the context of learning theory and research,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (New York, NY: Sage), 17–34. doi: 10.4135/9781452218649.n2
Andrade, H. (2018). “Feedback in the context of self-assessment,” in Cambridge Handbook of Instructional Feedback , eds A. Lipnevich and J. Smith (Cambridge: Cambridge University Press), 376–408.
Andrade, H., and Boulay, B. (2003). The role of rubric-referenced self-assessment in learning to write. J. Educ. Res. 97, 21–34. doi: 10.1080/00220670309596625
Andrade, H., and Brookhart, S. (2019). Classroom assessment as the co-regulation of learning. Assessm. Educ. Principles Policy Pract. doi: 10.1080/0969594X.2019.1571992
Andrade, H., and Brookhart, S. M. (2016). “The role of classroom assessment in supporting self-regulated learning,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (Heidelberg: Springer), 293–309. doi: 10.1007/978-3-319-39211-0_17
Andrade, H., and Du, Y. (2007). Student responses to criteria-referenced self-assessment. Assess. Evalu. High. Educ. 32, 159–181. doi: 10.1080/02602930600801928
Andrade, H., Du, Y., and Mycek, K. (2010). Rubric-referenced self-assessment and middle school students' writing. Assess. Educ. 17, 199–214. doi: 10.1080/09695941003696172
Andrade, H., Du, Y., and Wang, X. (2008). Putting rubrics to the test: The effect of a model, criteria generation, and rubric-referenced self-assessment on elementary school students' writing. Educ. Meas. 27, 3–13. doi: 10.1111/j.1745-3992.2008.00118.x
Andrade, H., and Valtcheva, A. (2009). Promoting learning and achievement through self- assessment. Theory Pract. 48, 12–19. doi: 10.1080/00405840802577544
Andrade, H., Wang, X., Du, Y., and Akawi, R. (2009). Rubric-referenced self-assessment and self-efficacy for writing. J. Educ. Res. 102, 287–302. doi: 10.3200/JOER.102.4.287-302
Andrade, H. L., and Brown, G. T. L. (2016). “Student self-assessment in the classroom,” in Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 319–334.
PubMed Abstract | Google Scholar
Baars, M., Vink, S., van Gog, T., de Bruin, A., and Paas, F. (2014). Effects of training self-assessment and using assessment standards on retrospective and prospective monitoring of problem solving. Learn. Instruc. 33, 92–107. doi: 10.1016/j.learninstruc.2014.04.004
Balderas, I., and Cuamatzi, P. M. (2018). Self and peer correction to improve college students' writing skills. Profile. 20, 179–194. doi: 10.15446/profile.v20n2.67095
Bandura, A. (1997). Self-efficacy: The Exercise of Control . New York, NY: Freeman.
Barney, S., Khurum, M., Petersen, K., Unterkalmsteiner, M., and Jabangwe, R. (2012). Improving students with rubric-based self-assessment and oral feedback. IEEE Transac. Educ. 55, 319–325. doi: 10.1109/TE.2011.2172981
Baxter, P., and Norman, G. (2011). Self-assessment or self deception? A lack of association between nursing students' self-assessment and performance. J. Adv. Nurs. 67, 2406–2413. doi: 10.1111/j.1365-2648.2011.05658.x
PubMed Abstract | CrossRef Full Text | Google Scholar
Bennett, R. E. (2011). Formative assessment: a critical review. Assess. Educ. 18, 5–25. doi: 10.1080/0969594X.2010.513678
Birjandi, P., and Hadidi Tamjid, N. (2012). The role of self-, peer and teacher assessment in promoting Iranian EFL learners' writing performance. Assess. Evalu. High. Educ. 37, 513–533. doi: 10.1080/02602938.2010.549204
Bjork, R. A., Dunlosky, J., and Kornell, N. (2013). Self-regulated learning: beliefs, techniques, and illusions. Annu. Rev. Psychol. 64, 417–444. doi: 10.1146/annurev-psych-113011-143823
Black, P., Harrison, C., Lee, C., Marshall, B., and Wiliam, D. (2003). Assessment for Learning: Putting it into Practice . Berkshire: Open University Press.
Black, P., and Wiliam, D. (1998). Inside the black box: raising standards through classroom assessment. Phi Delta Kappan 80, 139–144; 146–148.
Blanch-Hartigan, D. (2011). Medical students' self-assessment of performance: results from three meta-analyses. Patient Educ. Counsel. 84, 3–9. doi: 10.1016/j.pec.2010.06.037
Bol, L., Hacker, D. J., Walck, C. C., and Nunnery, J. A. (2012). The effects of individual or group guidelines on the calibration accuracy and achievement of high school biology students. Contemp. Educ. Psychol. 37, 280–287. doi: 10.1016/j.cedpsych.2012.02.004
Boud, D. (1995a). Implementing Student Self-Assessment, 2nd Edn. Australian Capital Territory: Higher Education Research and Development Society of Australasia.
Boud, D. (1995b). Enhancing Learning Through Self-Assessment. London: Kogan Page.
Boud, D. (1999). Avoiding the traps: Seeking good practice in the use of self-assessment and reflection in professional courses. Soc. Work Educ. 18, 121–132. doi: 10.1080/02615479911220131
Boud, D., and Brew, A. (1995). Developing a typology for learner self-assessment practices. Res. Dev. High. Educ. 18, 130–135.
Bourke, R. (2014). Self-assessment in professional programmes within tertiary institutions. Teach. High. Educ. 19, 908–918. doi: 10.1080/13562517.2014.934353
Bourke, R. (2016). Liberating the learner through self-assessment. Cambridge J. Educ. 46, 97–111. doi: 10.1080/0305764X.2015.1015963
Brown, G., Andrade, H., and Chen, F. (2015). Accuracy in student self-assessment: directions and cautions for research. Assess. Educ. 22, 444–457. doi: 10.1080/0969594X.2014.996523
Brown, G. T., and Harris, L. R. (2013). “Student self-assessment,” in Sage Handbook of Research on Classroom Assessment , ed J. H. McMillan (Los Angeles, CA: Sage), 367–393. doi: 10.4135/9781452218649.n21
Brown, G. T. L., and Harris, L. R. (2014). The future of self-assessment in classroom practice: reframing self-assessment as a core competency. Frontline Learn. Res. 3, 22–30. doi: 10.14786/flr.v2i1.24
Butler, D. L., and Winne, P. H. (1995). Feedback and self-regulated learning: a theoretical synthesis. Rev. Educ. Res. 65, 245–281. doi: 10.3102/00346543065003245
Butler, Y. G. (2018). “Young learners' processes and rationales for responding to self-assessment items: cases for generic can-do and five-point Likert-type formats,” in Useful Assessment and Evaluation in Language Education , eds J. Davis et al. (Washington, DC: Georgetown University Press), 21–39. doi: 10.2307/j.ctvvngrq.5
CrossRef Full Text
Chang, C.-C., Liang, C., and Chen, Y.-H. (2013). Is learner self-assessment reliable and valid in a Web-based portfolio environment for high school students? Comput. Educ. 60, 325–334. doi: 10.1016/j.compedu.2012.05.012
Chang, C.-C., Tseng, K.-H., and Lou, S.-J. (2012). A comparative analysis of the consistency and difference among teacher-assessment, student self-assessment and peer-assessment in a Web-based portfolio assessment environment for high school students. Comput. Educ. 58, 303–320. doi: 10.1016/j.compedu.2011.08.005
Colliver, J., Verhulst, S, and Barrows, H. (2005). Self-assessment in medical practice: a further concern about the conventional research paradigm. Teach. Learn. Med. 17, 200–201. doi: 10.1207/s15328015tlm1703_1
Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Rev. Educ. Res. 58, 438–481. doi: 10.3102/00346543058004438
de Bruin, A. B. H., and van Gog, T. (2012). Improving self-monitoring and self-regulation: From cognitive psychology to the classroom , Learn. Instruct. 22, 245–252. doi: 10.1016/j.learninstruc.2012.01.003
De Grez, L., Valcke, M., and Roozen, I. (2012). How effective are self- and peer assessment of oral presentation skills compared with teachers' assessments? Active Learn. High. Educ. 13, 129–142. doi: 10.1177/1469787412441284
Dolosic, H. (2018). An examination of self-assessment and interconnected facets of second language reading. Read. Foreign Langu. 30, 189–208.
Draper, S. W. (2009). What are learners actually regulating when given feedback? Br. J. Educ. Technol. 40, 306–315. doi: 10.1111/j.1467-8535.2008.00930.x
Dunlosky, J., and Ariel, R. (2011). “Self-regulated learning and the allocation of study time,” in Psychology of Learning and Motivation , Vol. 54 ed B. Ross (Cambridge, MA: Academic Press), 103–140. doi: 10.1016/B978-0-12-385527-5.00004-8
Dunlosky, J., and Rawson, K. A. (2012). Overconfidence produces underachievement: inaccurate self evaluations undermine students' learning and retention. Learn. Instr. 22, 271–280. doi: 10.1016/j.learninstruc.2011.08.003
Dweck, C. (2006). Mindset: The New Psychology of Success. New York, NY: Random House.
Epstein, R. M., Siegel, D. J., and Silberman, J. (2008). Self-monitoring in clinical practice: a challenge for medical educators. J. Contin. Educ. Health Prof. 28, 5–13. doi: 10.1002/chp.149
Eva, K. W., and Regehr, G. (2008). “I'll never play professional football” and other fallacies of self-assessment. J. Contin. Educ. Health Prof. 28, 14–19. doi: 10.1002/chp.150
Falchikov, N. (2005). Improving Assessment Through Student Involvement: Practical Solutions for Aiding Learning in Higher and Further Education . London: Routledge Falmer.
Fastre, G. M. J., van der Klink, M. R., Sluijsmans, D., and van Merrienboer, J. J. G. (2012). Drawing students' attention to relevant assessment criteria: effects on self-assessment skills and performance. J. Voc. Educ. Train. 64, 185–198. doi: 10.1080/13636820.2011.630537
Fastre, G. M. J., van der Klink, M. R., and van Merrienboer, J. J. G. (2010). The effects of performance-based assessment criteria on student performance and self-assessment skills. Adv. Health Sci. Educ. 15, 517–532. doi: 10.1007/s10459-009-9215-x
Fitzpatrick, B., and Schulz, H. (2016). “Teaching young students to self-assess critically,” Paper presented at the Annual Meeting of the American Educational Research Association (Washington, DC).
Franken, A. S. (1992). I'm Good Enough, I'm Smart Enough, and Doggone it, People Like Me! Daily affirmations by Stuart Smalley. New York, NY: Dell.
Glaser, C., and Brunstein, J. C. (2007). Improving fourth-grade students' composition skills: effects of strategy instruction and self-regulation procedures. J. Educ. Psychol. 99, 297–310. doi: 10.1037/0022-06220.127.116.117
Gonida, E. N., and Leondari, A. (2011). Patterns of motivation among adolescents with biased and accurate self-efficacy beliefs. Int. J. Educ. Res. 50, 209–220. doi: 10.1016/j.ijer.2011.08.002
Graham, S., Hebert, M., and Harris, K. R. (2015). Formative assessment and writing. Elem. Sch. J. 115, 523–547. doi: 10.1086/681947
Guillory, J. J., and Blankson, A. N. (2017). Using recently acquired knowledge to self-assess understanding in the classroom. Sch. Teach. Learn. Psychol. 3, 77–89. doi: 10.1037/stl0000079
Hacker, D. J., Bol, L., Horgan, D. D., and Rakow, E. A. (2000). Test prediction and performance in a classroom context. J. Educ. Psychol. 92, 160–170. doi: 10.1037/0022-0618.104.22.168
Harding, J. L., and Hbaci, I. (2015). Evaluating pre-service teachers math teaching experience from different perspectives. Univ. J. Educ. Res. 3, 382–389. doi: 10.13189/ujer.2015.030605
Harris, K. R., Graham, S., Mason, L. H., and Friedlander, B. (2008). Powerful Writing Strategies for All Students . Baltimore, MD: Brookes.
Harris, L. R., and Brown, G. T. L. (2013). Opportunities and obstacles to consider when using peer- and self-assessment to improve student learning: case studies into teachers' implementation. Teach. Teach. Educ. 36, 101–111. doi: 10.1016/j.tate.2013.07.008
Hattie, J., and Timperley, H. (2007). The power of feedback. Rev. Educ. Res. 77, 81–112. doi: 10.3102/003465430298487
Hawe, E., and Parr, J. (2014). Assessment for learning in the writing classroom: an incomplete realization. Curr. J. 25, 210–237. doi: 10.1080/09585176.2013.862172
Hawkins, S. C., Osborne, A., Schofield, S. J., Pournaras, D. J., and Chester, J. F. (2012). Improving the accuracy of self-assessment of practical clinical skills using video feedback: the importance of including benchmarks. Med. Teach. 34, 279–284. doi: 10.3109/0142159X.2012.658897
Huang, Y., and Gui, M. (2015). Articulating teachers' expectations afore: Impact of rubrics on Chinese EFL learners' self-assessment and speaking ability. J. Educ. Train. Stud. 3, 126–132. doi: 10.11114/jets.v3i3.753
Kaderavek, J. N., Gillam, R. B., Ukrainetz, T. A., Justice, L. M., and Eisenberg, S. N. (2004). School-age children's self-assessment of oral narrative production. Commun. Disord. Q. 26, 37–48. doi: 10.1177/15257401040260010401
Karnilowicz, W. (2012). A comparison of self-assessment and tutor assessment of undergraduate psychology students. Soc. Behav. Person. 40, 591–604. doi: 10.2224/sbp.2012.40.4.591
Kevereski, L. (2017). (Self) evaluation of knowledge in students' population in higher education in Macedonia. Res. Pedag. 7, 69–75. doi: 10.17810/2015.49
Kingston, N. M., and Nash, B. (2011). Formative assessment: a meta-analysis and a call for research. Educ. Meas. 30, 28–37. doi: 10.1111/j.1745-3992.2011.00220.x
Kitsantas, A., and Zimmerman, B. J. (2006). Enhancing self-regulation of practice: the influence of graphing and self-evaluative standards. Metacogn. Learn. 1, 201–212. doi: 10.1007/s11409-006-9000-7
Kluger, A. N., and DeNisi, A. (1996). The effects of feedback interventions on performance: a historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychol. Bull. 119, 254–284. doi: 10.1037/0033-2909.119.2.254
Kollar, I., Fischer, F., and Hesse, F. (2006). Collaboration scripts: a conceptual analysis. Educ. Psychol. Rev. 18, 159–185. doi: 10.1007/s10648-006-9007-2
Kolovelonis, A., Goudas, M., and Dermitzaki, I. (2012). Students' performance calibration in a basketball dribbling task in elementary physical education. Int. Electron. J. Elem. Educ. 4, 507–517.
Koriat, A. (2012). The relationships between monitoring, regulation and performance. Learn. Instru. 22, 296–298. doi: 10.1016/j.learninstruc.2012.01.002
Kostons, D., van Gog, T., and Paas, F. (2012). Training self-assessment and task-selection skills: a cognitive approach to improving self-regulated learning. Learn. Instruc. 22, 121–132. doi: 10.1016/j.learninstruc.2011.08.004
Labuhn, A. S., Zimmerman, B. J., and Hasselhorn, M. (2010). Enhancing students' self-regulation and mathematics performance: the influence of feedback and self-evaluative standards Metacogn. Learn. 5, 173–194. doi: 10.1007/s11409-010-9056-2
Leach, L. (2012). Optional self-assessment: some tensions and dilemmas. Assess. Evalu. High. Educ. 37, 137–147. doi: 10.1080/02602938.2010.515013
Lew, M. D. N., Alwis, W. A. M., and Schmidt, H. G. (2010). Accuracy of students' self-assessment and their beliefs about its utility. Assess. Evalu. High. Educ. 35, 135–156. doi: 10.1080/02602930802687737
Lin-Siegler, X., Shaenfield, D., and Elder, A. D. (2015). Contrasting case instruction can improve self-assessment of writing. Educ. Technol. Res. Dev. 63, 517–537. doi: 10.1007/s11423-015-9390-9
Lipnevich, A. A., Berg, D. A. G., and Smith, J. K. (2016). “Toward a model of student response to feedback,” in The Handbook of Human and Social Conditions in Assessment , eds G. T. L. Brown and L. R. Harris (New York, NY: Routledge), 169–185.
Lopez, R., and Kossack, S. (2007). Effects of recurring use of self-assessment in university courses. Int. J. Learn. 14, 203–216. doi: 10.18848/1447-9494/CGP/v14i04/45277
Lopez-Pastor, V. M., Fernandez-Balboa, J.-M., Santos Pastor, M. L., and Aranda, A. F. (2012). Students' self-grading, professor's grading and negotiated final grading at three university programmes: analysis of reliability and grade difference ranges and tendencies. Assess. Evalu. High. Educ. 37, 453–464. doi: 10.1080/02602938.2010.545868
Lui, A. (2017). Validity of the responses to feedback survey: operationalizing and measuring students' cognitive and affective responses to teachers' feedback (Doctoral dissertation). University at Albany—SUNY: Albany NY.
Marks, M. B., Haug, J. C., and Hu, H. (2018). Investigating undergraduate business internships: do supervisor and self-evaluations differ? J. Educ. Bus. 93, 33–45. doi: 10.1080/08832323.2017.1414025
Memis, E. K., and Seven, S. (2015). Effects of an SWH approach and self-evaluation on sixth grade students' learning and retention of an electricity unit. Int. J. Prog. Educ. 11, 32–49.
Metcalfe, J., and Kornell, N. (2005). A region of proximal learning model of study time allocation. J. Mem. Langu. 52, 463–477. doi: 10.1016/j.jml.2004.12.001
Meusen-Beekman, K. D., Joosten-ten Brinke, D., and Boshuizen, H. P. A. (2016). Effects of formative assessments to develop self-regulation among sixth grade students: results from a randomized controlled intervention. Stud. Educ. Evalu. 51, 126–136. doi: 10.1016/j.stueduc.2016.10.008
Micán, D. A., and Medina, C. L. (2017). Boosting vocabulary learning through self-assessment in an English language teaching context. Assess. Evalu. High. Educ. 42, 398–414. doi: 10.1080/02602938.2015.1118433
Miller, T. M., and Geraci, L. (2011). Training metacognition in the classroom: the influence of incentives and feedback on exam predictions. Metacogn. Learn. 6, 303–314. doi: 10.1007/s11409-011-9083-7
Murakami, C., Valvona, C., and Broudy, D. (2012). Turning apathy into activeness in oral communication classes: regular self- and peer-assessment in a TBLT programme. System 40, 407–420. doi: 10.1016/j.system.2012.07.003
Nagel, M., and Lindsey, B. (2018). The use of classroom clickers to support improved self-assessment in introductory chemistry. J. College Sci. Teach. 47, 72–79.
Ndoye, A. (2017). Peer/self-assessment and student learning. Int. J. Teach. Learn. High. Educ. 29, 255–269.
Nguyen, T., and Foster, K. A. (2018). Research note—multiple time point course evaluation and student learning outcomes in an MSW course. J. Soc. Work Educ. 54, 715–723. doi: 10.1080/10437797.2018.1474151
Nicol, D., and Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: a model and seven principles of good feedback practice. Stud. High. Educ. 31, 199–218. doi: 10.1080/03075070600572090
Nielsen, K. (2014), Self-assessment methods in writing instruction: a conceptual framework, successful practices and essential strategies. J. Res. Read. 37, 1–16. doi: 10.1111/j.1467-9817.2012.01533.x.
Nowell, C., and Alston, R. M. (2007). I thought I got an A! Overconfidence across the economics curriculum. J. Econ. Educ. 38, 131–142. doi: 10.3200/JECE.38.2.131-142
Nugteren, M. L., Jarodzka, H., Kester, L., and Van Merriënboer, J. J. G. (2018). Self-regulation of secondary school students: self-assessments are inaccurate and insufficiently used for learning-task selection. Instruc. Sci. 46, 357–381. doi: 10.1007/s11251-018-9448-2
Panadero, E., and Alonso-Tapia, J. (2013). Self-assessment: theoretical and practical connotations. When it happens, how is it acquired and what to do to develop it in our students. Electron. J. Res. Educ. Psychol. 11, 551–576. doi: 10.14204/ejrep.30.12200
Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2012). Rubrics and self-assessment scripts effects on self-regulation, learning and self-efficacy in secondary education. Learn. Individ. Differ. 22, 806–813. doi: 10.1016/j.lindif.2012.04.007
Panadero, E., Alonso-Tapia, J., and Huertas, J. A. (2014). Rubrics vs. self-assessment scripts: effects on first year university students' self-regulation and performance. J. Study Educ. Dev. 3, 149–183. doi: 10.1080/02103702.2014.881655
Panadero, E., Alonso-Tapia, J., and Reche, E. (2013). Rubrics vs. self-assessment scripts effect on self-regulation, performance and self-efficacy in pre-service teachers. Stud. Educ. Evalu. 39, 125–132. doi: 10.1016/j.stueduc.2013.04.001
Panadero, E., Brown, G. L., and Strijbos, J.-W. (2016a). The future of student self-assessment: a review of known unknowns and potential directions. Educ. Psychol. Rev. 28, 803–830. doi: 10.1007/s10648-015-9350-2
Panadero, E., Jonsson, A., and Botella, J. (2017). Effects of self-assessment on self-regulated learning and self-efficacy: four meta-analyses. Educ. Res. Rev. 22, 74–98. doi: 10.1016/j.edurev.2017.08.004
Panadero, E., Jonsson, A., and Strijbos, J. W. (2016b). “Scaffolding self-regulated learning through self-assessment and peer assessment: guidelines for classroom implementation,” in Assessment for Learning: Meeting the Challenge of Implementation , eds D. Laveault and L. Allal (New York, NY: Springer), 311–326. doi: 10.1007/978-3-319-39211-0_18
Panadero, E., and Romero, M. (2014). To rubric or not to rubric? The effects of self-assessment on self-regulation, performance and self-efficacy. Assess. Educ. 21, 133–148. doi: 10.1080/0969594X.2013.877872
Papanthymou, A., and Darra, M. (2018). Student self-assessment in higher education: The international experience and the Greek example. World J. Educ. 8, 130–146. doi: 10.5430/wje.v8n6p130
Punhagui, G. C., and de Souza, N. A. (2013). Self-regulation in the learning process: actions through self-assessment activities with Brazilian students. Int. Educ. Stud. 6, 47–62. doi: 10.5539/ies.v6n10p47
Raaijmakers, S. F., Baars, M., Paas, F., van Merriënboer, J. J. G., and van Gog, T. (2019). Metacognition and Learning , 1–22. doi: 10.1007/s11409-019-09189-5
Raaijmakers, S. F., Baars, M., Schapp, L., Paas, F., van Merrienboer, J., and van Gog, T. (2017). Training self-regulated learning with video modeling examples: do task-selection skills transfer? Instr. Sci. 46, 273–290. doi: 10.1007/s11251-017-9434-0
Ratminingsih, N. M., Marhaeni, A. A. I. N., and Vigayanti, L. P. D. (2018). Self-assessment: the effect on students' independence and writing competence. Int. J. Instruc. 11, 277–290. doi: 10.12973/iji.2018.11320a
Ross, J. A., Rolheiser, C., and Hogaboam-Gray, A. (1998). “Impact of self-evaluation training on mathematics achievement in a cooperative learning environment,” Paper presented at the annual meeting of the American Educational Research Association (San Diego, CA).
Ross, J. A., and Starling, M. (2008). Self-assessment in a technology-supported environment: the case of grade 9 geography. Assess. Educ. 15, 183–199. doi: 10.1080/09695940802164218
Samaie, M., Nejad, A. M., and Qaracholloo, M. (2018). An inquiry into the efficiency of whatsapp for self- and peer-assessments of oral language proficiency. Br. J. Educ. Technol. 49, 111–126. doi: 10.1111/bjet.12519
Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., and Cooper, H. (2017). Self-grading and peer-grading for formative and summative assessments in 3rd through 12th grade classrooms: a meta-analysis. J. Educ. Psychol. 109, 1049–1066. doi: 10.1037/edu0000190
Sargeant, J. (2008). Toward a common understanding of self-assessment. J. Contin. Educ. Health Prof. 28, 1–4. doi: 10.1002/chp.148
Sargeant, J., Mann, K., van der Vleuten, C., and Metsemakers, J. (2008). “Directed” self-assessment: practice and feedback within a social context. J. Contin. Educ. Health Prof. 28, 47–54. doi: 10.1002/chp.155
Shute, V. (2008). Focus on formative feedback. Rev. Educ. Res. 78, 153–189. doi: 10.3102/0034654307313795
Silver, I., Campbell, C., Marlow, B., and Sargeant, J. (2008). Self-assessment and continuing professional development: the Canadian perspective. J. Contin. Educ. Health Prof. 28, 25–31. doi: 10.1002/chp.152
Siow, L.-F. (2015). Students' perceptions on self- and peer-assessment in enhancing learning experience. Malaysian Online J. Educ. Sci. 3, 21–35.
Son, L. K., and Metcalfe, J. (2000). Metacognitive and control strategies in study-time allocation. J. Exp. Psychol. 26, 204–221. doi: 10.1037/0278-7322.214.171.124
Tan, K. (2004). Does student self-assessment empower or discipline students? Assess. Evalu. Higher Educ. 29, 651–662. doi: 10.1080/0260293042000227209
Tan, K. (2009). Meanings and practices of power in academics' conceptions of student self-assessment. Teach. High. Educ. 14, 361–373. doi: 10.1080/13562510903050111
Taras, M. (2008). Issues of power and equity in two models of self-assessment. Teach. High. Educ. 13, 81–92. doi: 10.1080/13562510701794076
Tejeiro, R. A., Gomez-Vallecillo, J. L., Romero, A. F., Pelegrina, M., Wallace, A., and Emberley, E. (2012). Summative self-assessment in higher education: implications of its counting towards the final mark. Electron. J. Res. Educ. Psychol. 10, 789–812.
Thawabieh, A. M. (2017). A comparison between students' self-assessment and teachers' assessment. J. Curri. Teach. 6, 14–20. doi: 10.5430/jct.v6n1p14
Tulgar, A. T. (2017). Selfie@ssessment as an alternative form of self-assessment at undergraduate level in higher education. J. Langu. Linguis. Stud. 13, 321–335.
van Helvoort, A. A. J. (2012). How adult students in information studies use a scoring rubric for the development of their information literacy skills. J. Acad. Librarian. 38, 165–171. doi: 10.1016/j.acalib.2012.03.016
van Loon, M. H., de Bruin, A. B. H., van Gog, T., van Merriënboer, J. J. G., and Dunlosky, J. (2014). Can students evaluate their understanding of cause-and-effect relations? The effects of diagram completion on monitoring accuracy. Acta Psychol. 151, 143–154. doi: 10.1016/j.actpsy.2014.06.007
van Reybroeck, M., Penneman, J., Vidick, C., and Galand, B. (2017). Progressive treatment and self-assessment: Effects on students' automatisation of grammatical spelling and self-efficacy beliefs. Read. Writing 30, 1965–1985. doi: 10.1007/s11145-017-9761-1
Wang, W. (2017). Using rubrics in student self-assessment: student perceptions in the English as a foreign language writing context. Assess. Evalu. High. Educ. 42, 1280–1292. doi: 10.1080/02602938.2016.1261993
Wollenschläger, M., Hattie, J., Machts, N., Möller, J., and Harms, U. (2016). What makes rubrics effective in teacher-feedback? Transparency of learning goals is not enough. Contemp. Educ. Psychol. 44–45, 1–11. doi: 10.1016/j.cedpsych.2015.11.003
Yan, Z., and Brown, G. T. L. (2017). A cyclical self-assessment process: towards a model of how students engage in self-assessment. Assess. Evalu. High. Educ. 42, 1247–1262. doi: 10.1080/02602938.2016.1260091
Yilmaz, F. N. (2017). Reliability of scores obtained from self-, peer-, and teacher-assessments on teaching materials prepared by teacher candidates. Educ. Sci. 17, 395–409. doi: 10.12738/estp.2017.2.0098
Zimmerman, B. J. (2000). Self-efficacy: an essential motive to learn. Contemp. Educ. Psychol. 25, 82–91. doi: 10.1006/ceps.1999.1016
Zimmerman, B. J., and Schunk, D. H. (2011). “Self-regulated learning and performance: an introduction and overview,” in Handbook of Self-Regulation of Learning and Performance , eds B. J. Zimmerman and D. H. Schunk (New York, NY: Routledge), 1–14.
Keywords: self-assessment, self-evaluation, self-grading, formative assessment, classroom assessment, self-regulated learning (SRL)
Citation: Andrade HL (2019) A Critical Review of Research on Student Self-Assessment. Front. Educ. 4:87. doi: 10.3389/feduc.2019.00087
Received: 27 April 2019; Accepted: 02 August 2019; Published: 27 August 2019.
Copyright © 2019 Andrade. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Heidi L. Andrade, firstname.lastname@example.org
This article is part of the Research Topic
Advances in Classroom Assessment Theory and Practice
- Search Menu
- Advance articles
- Author Guidelines
- Submission Site
- Open Access
- About Research Evaluation
- Editorial Board
- Advertising and Corporate Services
- Journals Career Network
- Self-Archiving Policy
- Dispatch Dates
- Journals on Oxford Academic
- Books on Oxford Academic
1. introduction, what is meant by impact, 2. why evaluate research impact, 3. evaluating research impact, 4. impact and the ref, 5. the challenges of impact evaluation, 6. developing systems and taxonomies for capturing impact, 7. indicators, evidence, and impact within systems, 8. conclusions and recommendations.
- < Previous
Assessment, evaluations, and definitions of research impact: A review
- Article contents
- Figures & tables
- Supplementary Data
Teresa Penfield, Matthew J. Baker, Rosa Scoble, Michael C. Wykes, Assessment, evaluations, and definitions of research impact: A review, Research Evaluation , Volume 23, Issue 1, January 2014, Pages 21–32, https://doi.org/10.1093/reseval/rvt021
- Permissions Icon Permissions
This article aims to explore what is understood by the term ‘research impact’ and to provide a comprehensive assimilation of available literature and information, drawing on global experiences to understand the potential for methods and frameworks of impact assessment being implemented for UK impact assessment. We take a more focused look at the impact component of the UK Research Excellence Framework taking place in 2014 and some of the challenges to evaluating impact and the role that systems might play in the future for capturing the links between research and impact and the requirements we have for these systems.
When considering the impact that is generated as a result of research, a number of authors and government recommendations have advised that a clear definition of impact is required ( Duryea, Hochman, and Parfitt 2007 ; Grant et al. 2009 ; Russell Group 2009 ). From the outset, we note that the understanding of the term impact differs between users and audiences. There is a distinction between ‘academic impact’ understood as the intellectual contribution to one’s field of study within academia and ‘external socio-economic impact’ beyond academia. In the UK, evaluation of academic and broader socio-economic impact takes place separately. ‘Impact’ has become the term of choice in the UK for research influence beyond academia. This distinction is not so clear in impact assessments outside of the UK, where academic outputs and socio-economic impacts are often viewed as one, to give an overall assessment of value and change created through research.
an effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia
Impact is assessed alongside research outputs and environment to provide an evaluation of research taking place within an institution. As such research outputs, for example, knowledge generated and publications, can be translated into outcomes, for example, new products and services, and impacts or added value ( Duryea et al. 2007 ). Although some might find the distinction somewhat marginal or even confusing, this differentiation between outputs, outcomes, and impacts is important, and has been highlighted, not only for the impacts derived from university research ( Kelly and McNicol 2011 ) but also for work done in the charitable sector ( Ebrahim and Rangan, 2010 ; Berg and Månsson 2011 ; Kelly and McNicoll 2011 ). The Social Return on Investment (SROI) guide ( The SROI Network 2012 ) suggests that ‘The language varies “impact”, “returns”, “benefits”, “value” but the questions around what sort of difference and how much of a difference we are making are the same’. It is perhaps assumed here that a positive or beneficial effect will be considered as an impact but what about changes that are perceived to be negative? Wooding et al. (2007) adapted the terminology of the Payback Framework, developed for the health and biomedical sciences from ‘benefit’ to ‘impact’ when modifying the framework for the social sciences, arguing that the positive or negative nature of a change was subjective and can also change with time, as has commonly been highlighted with the drug thalidomide, which was introduced in the 1950s to help with, among other things, morning sickness but due to teratogenic effects, which resulted in birth defects, was withdrawn in the early 1960s. Thalidomide has since been found to have beneficial effects in the treatment of certain types of cancer. Clearly the impact of thalidomide would have been viewed very differently in the 1950s compared with the 1960s or today.
In viewing impact evaluations it is important to consider not only who has evaluated the work but the purpose of the evaluation to determine the limits and relevance of an assessment exercise. In this article, we draw on a broad range of examples with a focus on methods of evaluation for research impact within Higher Education Institutions (HEIs). As part of this review, we aim to explore the following questions:
What are the reasons behind trying to understand and evaluate research impact?
What are the methodologies and frameworks that have been employed globally to assess research impact and how do these compare?
What are the challenges associated with understanding and evaluating research impact?
What indicators, evidence, and impacts need to be captured within developing systems
What are the reasons behind trying to understand and evaluate research impact? Throughout history, the activities of a university have been to provide both education and research, but the fundamental purpose of a university was perhaps described in the writings of mathematician and philosopher Alfred North Whitehead (1929) .
‘The justification for a university is that it preserves the connection between knowledge and the zest of life, by uniting the young and the old in the imaginative consideration of learning. The university imparts information, but it imparts it imaginatively. At least, this is the function which it should perform for society. A university which fails in this respect has no reason for existence. This atmosphere of excitement, arising from imaginative consideration transforms knowledge.’
In undertaking excellent research, we anticipate that great things will come and as such one of the fundamental reasons for undertaking research is that we will generate and transform knowledge that will benefit society as a whole.
One might consider that by funding excellent research, impacts (including those that are unforeseen) will follow, and traditionally, assessment of university research focused on academic quality and productivity. Aspects of impact, such as value of Intellectual Property, are currently recorded by universities in the UK through their Higher Education Business and Community Interaction Survey return to Higher Education Statistics Agency; however, as with other public and charitable sector organizations, showcasing impact is an important part of attracting and retaining donors and support ( Kelly and McNicoll 2011 ).
The reasoning behind the move towards assessing research impact is undoubtedly complex, involving both political and socio-economic factors, but, nevertheless, we can differentiate between four primary purposes.
HEIs overview. To enable research organizations including HEIs to monitor and manage their performance and understand and disseminate the contribution that they are making to local, national, and international communities.
Accountability. To demonstrate to government, stakeholders, and the wider public the value of research. There has been a drive from the UK government through Higher Education Funding Council for England (HEFCE) and the Research Councils ( HM Treasury 2004 ) to account for the spending of public money by demonstrating the value of research to tax payers, voters, and the public in terms of socio-economic benefits ( European Science Foundation 2009 ), in effect, justifying this expenditure ( Davies Nutley, and Walter 2005 ; Hanney and González-Block 2011 ).
Inform funding. To understand the socio-economic value of research and subsequently inform funding decisions. By evaluating the contribution that research makes to society and the economy, future funding can be allocated where it is perceived to bring about the desired impact. As Donovan (2011) comments, ‘Impact is a strong weapon for making an evidence based case to governments for enhanced research support’.
Understand. To understand the method and routes by which research leads to impacts to maximize on the findings that come out of research and develop better ways of delivering impact.
The growing trend for accountability within the university system is not limited to research and is mirrored in assessments of teaching quality, which now feed into evaluation of universities to ensure fee-paying students’ satisfaction. In demonstrating research impact, we can provide accountability upwards to funders and downwards to users on a project and strategic basis ( Kelly and McNicoll 2011 ). Organizations may be interested in reviewing and assessing research impact for one or more of the aforementioned purposes and this will influence the way in which evaluation is approached.
It is important to emphasize that ‘Not everyone within the higher education sector itself is convinced that evaluation of higher education activity is a worthwhile task’ ( Kelly and McNicoll 2011 ). The University and College Union ( University and College Union 2011 ) organized a petition calling on the UK funding councils to withdraw the inclusion of impact assessment from the REF proposals once plans for the new assessment of university research were released. This petition was signed by 17,570 academics (52,409 academics were returned to the 2008 Research Assessment Exercise), including Nobel laureates and Fellows of the Royal Society ( University and College Union 2011 ). Impact assessments raise concerns over the steer of research towards disciplines and topics in which impact is more easily evidenced and that provide economic impacts that could subsequently lead to a devaluation of ‘blue skies’ research. Johnston ( Johnston 1995 ) notes that by developing relationships between researchers and industry, new research strategies can be developed. This raises the questions of whether UK business and industry should not invest in the research that will deliver them impacts and who will fund basic research if not the government? Donovan (2011) asserts that there should be no disincentive for conducting basic research. By asking academics to consider the impact of the research they undertake and by reviewing and funding them accordingly, the result may be to compromise research by steering it away from the imaginative and creative quest for knowledge. Professor James Ladyman, at the University of Bristol, a vocal adversary of awarding funding based on the assessment of research impact, has been quoted as saying that ‘…inclusion of impact in the REF will create “selection pressure,” promoting academic research that has “more direct economic impact” or which is easier to explain to the public’ ( Corbyn 2009 ).
Despite the concerns raised, the broader socio-economic impacts of research will be included and count for 20% of the overall research assessment, as part of the REF in 2014. From an international perspective, this represents a step change in the comprehensive nature to which impact will be assessed within universities and research institutes, incorporating impact from across all research disciplines. Understanding what impact looks like across the various strands of research and the variety of indicators and proxies used to evidence impact will be important to developing a meaningful assessment.
What are the methodologies and frameworks that have been employed globally to evaluate research impact and how do these compare? The traditional form of evaluation of university research in the UK was based on measuring academic impact and quality through a process of peer review ( Grant 2006 ). Evidence of academic impact may be derived through various bibliometric methods, one example of which is the H index, which has incorporated factors such as the number of publications and citations. These metrics may be used in the UK to understand the benefits of research within academia and are often incorporated into the broader perspective of impact seen internationally, for example, within the Excellence in Research for Australia and using Star Metrics in the USA, in which quantitative measures are used to assess impact, for example, publications, citation, and research income. These ‘traditional’ bibliometric techniques can be regarded as giving only a partial picture of full impact ( Bornmann and Marx 2013 ) with no link to causality. Standard approaches actively used in programme evaluation such as surveys, case studies, bibliometrics, econometrics and statistical analyses, content analysis, and expert judgment are each considered by some (Vonortas and Link, 2012) to have shortcomings when used to measure impacts.
Incorporating assessment of the wider socio-economic impact began using metrics-based indicators such as Intellectual Property registered and commercial income generated ( Australian Research Council 2008 ). In the UK, more sophisticated assessments of impact incorporating wider socio-economic benefits were first investigated within the fields of Biomedical and Health Sciences ( Grant 2006 ), an area of research that wanted to be able to justify the significant investment it received. Frameworks for assessing impact have been designed and are employed at an organizational level addressing the specific requirements of the organization and stakeholders. As a result, numerous and widely varying models and frameworks for assessing impact exist. Here we outline a few of the most notable models that demonstrate the contrast in approaches available.
The Payback Framework is possibly the most widely used and adapted model for impact assessment ( Wooding et al. 2007 ; Nason et al. 2008 ), developed during the mid-1990s by Buxton and Hanney, working at Brunel University. It incorporates both academic outputs and wider societal benefits ( Donovan and Hanney 2011 ) to assess outcomes of health sciences research. The Payback Framework systematically links research with the associated benefits ( Scoble et al. 2010 ; Hanney and González-Block 2011 ) and can be thought of in two parts: a model that allows the research and subsequent dissemination process to be broken into specific components within which the benefits of research can be studied, and second, a multi-dimensional classification scheme into which the various outputs, outcomes, and impacts can be placed ( Hanney and Gonzalez Block 2011 ). The Payback Framework has been adopted internationally, largely within the health sector, by organizations such as the Canadian Institute of Health Research, the Dutch Public Health Authority, the Australian National Health and Medical Research Council, and the Welfare Bureau in Hong Kong ( Bernstein et al. 2006 ; Nason et al. 2008 ; CAHS 2009; Spaapen et al. n.d. ). The Payback Framework enables health and medical research and impact to be linked and the process by which impact occurs to be traced. For more extensive reviews of the Payback Framework, see Davies et al. (2005) , Wooding et al. (2007) , Nason et al. (2008) , and Hanney and González-Block (2011) .
A very different approach known as Social Impact Assessment Methods for research and funding instruments through the study of Productive Interactions (SIAMPI) was developed from the Dutch project Evaluating Research in Context and has a central theme of capturing ‘productive interactions’ between researchers and stakeholders by analysing the networks that evolve during research programmes ( Spaapen and Drooge, 2011 ; Spaapen et al. n.d. ). SIAMPI is based on the widely held assumption that interactions between researchers and stakeholder are an important pre-requisite to achieving impact ( Donovan 2011 ; Hughes and Martin 2012 ; Spaapen et al. n.d. ). This framework is intended to be used as a learning tool to develop a better understanding of how research interactions lead to social impact rather than as an assessment tool for judging, showcasing, or even linking impact to a specific piece of research. SIAMPI has been used within the Netherlands Institute for health Services Research ( SIAMPI n.d. ). ‘Productive interactions’, which can perhaps be viewed as instances of knowledge exchange, are widely valued and supported internationally as mechanisms for enabling impact and are often supported financially for example by Canada’s Social Sciences and Humanities Research Council, which aims to support knowledge exchange (financially) with a view to enabling long-term impact. In the UK, UK Department for Business, Innovation, and Skills provided funding of £150 million for knowledge exchange in 2011–12 to ‘help universities and colleges support the economic recovery and growth, and contribute to wider society’ ( Department for Business, Innovation and Skills 2012 ). While valuing and supporting knowledge exchange is important, SIAMPI perhaps takes this a step further in enabling these exchange events to be captured and analysed. One of the advantages of this method is that less input is required compared with capturing the full route from research to impact. A comprehensive assessment of impact itself is not undertaken with SIAMPI, which make it a less-suitable method where showcasing the benefits of research is desirable or where this justification of funding based on impact is required.
The first attempt globally to comprehensively capture the socio-economic impact of research across all disciplines was undertaken for the Australian Research Quality Framework (RQF), using a case study approach. The RQF was developed to demonstrate and justify public expenditure on research, and as part of this framework, a pilot assessment was undertaken by the Australian Technology Network. Researchers were asked to evidence the economic, societal, environmental, and cultural impact of their research within broad categories, which were then verified by an expert panel ( Duryea et al. 2007 ) who concluded that the researchers and case studies could provide enough qualitative and quantitative evidence for reviewers to assess the impact arising from their research ( Duryea et al. 2007 ). To evaluate impact, case studies were interrogated and verifiable indicators assessed to determine whether research had led to reciprocal engagement, adoption of research findings, or public value. The RQF pioneered the case study approach to assessing research impact; however, with a change in government in 2007, this framework was never implemented in Australia, although it has since been taken up and adapted for the UK REF.
In developing the UK REF, HEFCE commissioned a report, in 2009, from RAND to review international practice for assessing research impact and provide recommendations to inform the development of the REF. RAND selected four frameworks to represent the international arena ( Grant et al. 2009 ). One of these, the RQF, they identified as providing a ‘promising basis for developing an impact approach for the REF’ using the case study approach. HEFCE developed an initial methodology that was then tested through a pilot exercise. The case study approach, recommended by the RQF, was combined with ‘significance’ and ‘reach’ as criteria for assessment. The criteria for assessment were also supported by a model developed by Brunel for ‘measurement’ of impact that used similar measures defined as depth and spread. In the Brunel model, depth refers to the degree to which the research has influenced or caused change, whereas spread refers to the extent to which the change has occurred and influenced end users. Evaluation of impact in terms of reach and significance allows all disciplines of research and types of impact to be assessed side-by-side ( Scoble et al. 2010 ).
The range and diversity of frameworks developed reflect the variation in purpose of evaluation including the stakeholders for whom the assessment takes place, along with the type of impact and evidence anticipated. The most appropriate type of evaluation will vary according to the stakeholder whom we are wishing to inform. Studies ( Buxton, Hanney and Jones 2004 ) into the economic gains from biomedical and health sciences determined that different methodologies provide different ways of considering economic benefits. A discussion on the benefits and drawbacks of a range of evaluation tools (bibliometrics, economic rate of return, peer review, case study, logic modelling, and benchmarking) can be found in the article by Grant (2006) .
Evaluation of impact is becoming increasingly important, both within the UK and internationally, and research and development into impact evaluation continues, for example, researchers at Brunel have developed the concept of depth and spread further into the Brunel Impact Device for Evaluation, which also assesses the degree of separation between research and impact ( Scoble et al. working paper ).
Although based on the RQF, the REF did not adopt all of the suggestions held within, for example, the option of allowing research groups to opt out of impact assessment should the nature or stage of research deem it unsuitable ( Donovan 2008 ). In 2009–10, the REF team conducted a pilot study for the REF involving 29 institutions, submitting case studies to one of five units of assessment (in clinical medicine, physics, earth systems and environmental sciences, social work and social policy, and English language and literature) ( REF2014 2010 ). These case studies were reviewed by expert panels and, as with the RQF, they found that it was possible to assess impact and develop ‘impact profiles’ using the case study approach ( REF2014 2010 ).
From 2014, research within UK universities and institutions will be assessed through the REF; this will replace the Research Assessment Exercise, which has been used to assess UK research since the 1980s. Differences between these two assessments include the removal of indicators of esteem and the addition of assessment of socio-economic research impact. The REF will therefore assess three aspects of research:
Research impact is assessed in two formats, first, through an impact template that describes the approach to enabling impact within a unit of assessment, and second, using impact case studies that describe the impact taking place following excellent research within a unit of assessment ( REF2014 2011a ). HEFCE indicated that impact should merit a 25% weighting within the REF ( REF2014 2011b ); however, this has been reduced for the 2014 REF to 20%, perhaps as a result of feedback and lobbying, for example, from the Russell Group and Million + group of Universities who called for impact to count for 15% ( Russell Group 2009 ; Jump 2011 ) and following guidance from the expert panels undertaking the pilot exercise who suggested that during the 2014 REF, impact assessment would be in a developmental phase and that a lower weighting for impact would be appropriate with the expectation that this would be increased in subsequent assessments ( REF2014 2010 ).
The quality and reliability of impact indicators will vary according to the impact we are trying to describe and link to research. In the UK, evidence and research impacts will be assessed for the REF within research disciplines. Although it can be envisaged that the range of impacts derived from research of different disciplines are likely to vary, one might question whether it makes sense to compare impacts within disciplines when the range of impact can vary enormously, for example, from business development to cultural changes or saving lives? An alternative approach was suggested for the RQF in Australia, where it was proposed that types of impact be compared rather than impact from specific disciplines.
Providing advice and guidance within specific disciplines is undoubtedly helpful. It can be seen from the panel guidance produced by HEFCE to illustrate impacts and evidence that it is expected that impact and evidence will vary according to discipline ( REF2014 2012 ). Why should this be the case? Two areas of research impact health and biomedical sciences and the social sciences have received particular attention in the literature by comparison with, for example, the arts. Reviews and guidance on developing and evidencing impact in particular disciplines include the London School of Economics (LSE) Public Policy Group’s impact handbook (LSE n.d.), a review of the social and economic impacts arising from the arts produced by Reeve ( Reeves 2002 ), and a review by Kuruvilla et al. (2006) on the impact arising from health research. Perhaps it is time for a generic guide based on types of impact rather than research discipline?
What are the challenges associated with understanding and evaluating research impact? In endeavouring to assess or evaluate impact, a number of difficulties emerge and these may be specific to certain types of impact. Given that the type of impact we might expect varies according to research discipline, impact-specific challenges present us with the problem that an evaluation mechanism may not fairly compare impact between research disciplines.
5.1 Time lag
The time lag between research and impact varies enormously. For example, the development of a spin out can take place in a very short period, whereas it took around 30 years from the discovery of DNA before technology was developed to enable DNA fingerprinting. In development of the RQF, The Allen Consulting Group (2005) highlighted that defining a time lag between research and impact was difficult. In the UK, the Russell Group Universities responded to the REF consultation by recommending that no time lag be put on the delivery of impact from a piece of research citing examples such as the development of cardiovascular disease treatments, which take between 10 and 25 years from research to impact ( Russell Group 2009 ). To be considered for inclusion within the REF, impact must be underpinned by research that took place between 1 January 1993 and 31 December 2013, with impact occurring during an assessment window from 1 January 2008 to 31 July 2013. However, there has been recognition that this time window may be insufficient in some instances, with architecture being granted an additional 5-year period ( REF2014 2012 ); why only architecture has been granted this dispensation is not clear, when similar cases could be made for medicine, physics, or even English literature. Recommendations from the REF pilot were that the panel should be able to extend the time frame where appropriate; this, however, poses difficult decisions when submitting a case study to the REF as to what the view of the panel will be and whether if deemed inappropriate this will render the case study ‘unclassified’.
5.2 The developmental nature of impact
Impact is not static, it will develop and change over time, and this development may be an increase or decrease in the current degree of impact. Impact can be temporary or long-lasting. The point at which assessment takes place will therefore influence the degree and significance of that impact. For example, following the discovery of a new potential drug, preclinical work is required, followed by Phase 1, 2, and 3 trials, and then regulatory approval is granted before the drug is used to deliver potential health benefits. Clearly there is the possibility that the potential new drug will fail at any one of these phases but each phase can be classed as an interim impact of the original discovery work on route to the delivery of health benefits, but the time at which an impact assessment takes place will influence the degree of impact that has taken place. If impact is short-lived and has come and gone within an assessment period, how will it be viewed and considered? Again the objective and perspective of the individuals and organizations assessing impact will be key to understanding how temporal and dissipated impact will be valued in comparison with longer-term impact.
Impact is derived not only from targeted research but from serendipitous findings, good fortune, and complex networks interacting and translating knowledge and research. The exploitation of research to provide impact occurs through a complex variety of processes, individuals, and organizations, and therefore, attributing the contribution made by a specific individual, piece of research, funding, strategy, or organization to an impact is not straight forward. Husbands-Fealing suggests that to assist identification of causality for impact assessment, it is useful to develop a theoretical framework to map the actors, activities, linkages, outputs, and impacts within the system under evaluation, which shows how later phases result from earlier ones. Such a framework should be not linear but recursive, including elements from contextual environments that influence and/or interact with various aspects of the system. Impact is often the culmination of work within spanning research communities ( Duryea et al. 2007 ). Concerns over how to attribute impacts have been raised many times ( The Allen Consulting Group 2005 ; Duryea et al. 2007 ; Grant et al. 2009 ), and differentiating between the various major and minor contributions that lead to impact is a significant challenge.
Figure 1 , replicated from Hughes and Martin (2012) , illustrates how the ease with which impact can be attributed decreases with time, whereas the impact, or effect of complementary assets, increases, highlighting the problem that it may take a considerable amount of time for the full impact of a piece of research to develop but because of this time and the increase in complexity of the networks involved in translating the research and interim impacts, it is more difficult to attribute and link back to a contributing piece of research.
Time, attribution, impact. Replicated from ( Hughes and Martin 2012 ).
This presents particular difficulties in research disciplines conducting basic research, such as pure mathematics, where the impact of research is unlikely to be foreseen. Research findings will be taken up in other branches of research and developed further before socio-economic impact occurs, by which point, attribution becomes a huge challenge. If this research is to be assessed alongside more applied research, it is important that we are able to at least determine the contribution of basic research. It has been acknowledged that outstanding leaps forward in knowledge and understanding come from immersing in a background of intellectual thinking that ‘one is able to see further by standing on the shoulders of giants’.
5.4 Knowledge creep
It is acknowledged that one of the outcomes of developing new knowledge through research can be ‘knowledge creep’ where new data or information becomes accepted and gets absorbed over time. This is particularly recognized in the development of new government policy where findings can influence policy debate and policy change, without recognition of the contributing research ( Davies et al. 2005 ; Wooding et al. 2007 ). This is recognized as being particularly problematic within the social sciences where informing policy is a likely impact of research. In putting together evidence for the REF, impact can be attributed to a specific piece of research if it made a ‘distinctive contribution’ ( REF2014 2011a ). The difficulty then is how to determine what the contribution has been in the absence of adequate evidence and how we ensure that research that results in impacts that cannot be evidenced is valued and supported.
5.5 Gathering evidence
Gathering evidence of the links between research and impact is not only a challenge where that evidence is lacking. The introduction of impact assessments with the requirement to collate evidence retrospectively poses difficulties because evidence, measurements, and baselines have, in many cases, not been collected and may no longer be available. While looking forward, we will be able to reduce this problem in the future, identifying, capturing, and storing the evidence in such a way that it can be used in the decades to come is a difficulty that we will need to tackle.
Collating the evidence and indicators of impact is a significant task that is being undertaken within universities and institutions globally. Decker et al. (2007) surveyed researchers in the US top research institutions during 2005; the survey of more than 6000 researchers found that, on average, more than 40% of their time was spent doing administrative tasks. It is desirable that the assignation of administrative tasks to researchers is limited, and therefore, to assist the tracking and collating of impact data, systems are being developed involving numerous projects and developments internationally, including Star Metrics in the USA, the ERC (European Research Council) Research Information System, and Lattes in Brazil ( Lane 2010 ; Mugabushaka and Papazoglou 2012 ).
Ideally, systems within universities internationally would be able to share data allowing direct comparisons, accurate storage of information developed in collaborations, and transfer of comparable data as researchers move between institutions. To achieve compatible systems, a shared language is required. CERIF (Common European Research Information Format) was developed for this purpose, first released in 1991; a number of projects and systems across Europe such as the ERC Research Information System ( Mugabushaka and Papazoglou 2012 ) are being developed as CERIF-compatible.
In the UK, there have been several Jisc-funded projects in recent years to develop systems capable of storing research information, for example, MICE (Measuring Impacts Under CERIF), UK Research Information Shared Service, and Integrated Research Input and Output System, all based on the CERIF standard. To allow comparisons between institutions, identifying a comprehensive taxonomy of impact, and the evidence for it, that can be used universally is seen to be very valuable. However, the Achilles heel of any such attempt, as critics suggest, is the creation of a system that rewards what it can measure and codify, with the knock-on effect of directing research projects to deliver within the measures and categories that reward.
Attempts have been made to categorize impact evidence and data, for example, the aim of the MICE Project was to develop a set of impact indicators to enable impact to be fed into a based system. Indicators were identified from documents produced for the REF, by Research Councils UK, in unpublished draft case studies undertaken at King’s College London or outlined in relevant publications (MICE Project n.d.). A taxonomy of impact categories was then produced onto which impact could be mapped. What emerged on testing the MICE taxonomy ( Cooke and Nadim 2011 ), by mapping impacts from case studies, was that detailed categorization of impact was found to be too prescriptive. Every piece of research results in a unique tapestry of impact and despite the MICE taxonomy having more than 100 indicators, it was found that these did not suffice. It is perhaps worth noting that the expert panels, who assessed the pilot exercise for the REF, commented that the evidence provided by research institutes to demonstrate impact were ‘a unique collection’. Where quantitative data were available, for example, audience numbers or book sales, these numbers rarely reflected the degree of impact, as no context or baseline was available. Cooke and Nadim (2011) also noted that using a linear-style taxonomy did not reflect the complex networks of impacts that are generally found. The Goldsmith report ( Cooke and Nadim 2011 ) recommended making indicators ‘value free’, enabling the value or quality to be established in an impact descriptor that could be assessed by expert panels. The Goldsmith report concluded that general categories of evidence would be more useful such that indicators could encompass dissemination and circulation, re-use and influence, collaboration and boundary work, and innovation and invention.
While defining the terminology used to understand impact and indicators will enable comparable data to be stored and shared between organizations, we would recommend that any categorization of impacts be flexible such that impacts arising from non-standard routes can be placed. It is worth considering the degree to which indicators are defined and provide broader definitions with greater flexibility.
It is possible to incorporate both metrics and narratives within systems, for example, within the Research Outcomes System and Researchfish, currently used by several of the UK research councils to allow impacts to be recorded; although recording narratives has the advantage of allowing some context to be documented, it may make the evidence less flexible for use by different stakeholder groups (which include government, funding bodies, research assessment agencies, research providers, and user communities) for whom the purpose of analysis may vary ( Davies et al. 2005 ). Any tool for impact evaluation needs to be flexible, such that it enables access to impact data for a variety of purposes (Scoble et al. n.d.). Systems need to be able to capture links between and evidence of the full pathway from research to impact, including knowledge exchange, outputs, outcomes, and interim impacts, to allow the route to impact to be traced. This database of evidence needs to establish both where impact can be directly attributed to a piece of research as well as various contributions to impact made during the pathway.
Baselines and controls need to be captured alongside change to demonstrate the degree of impact. In many instances, controls are not feasible as we cannot look at what impact would have occurred if a piece of research had not taken place; however, indications of the picture before and after impact are valuable and worth collecting for impact that can be predicted.
It is now possible to use data-mining tools to extract specific data from narratives or unstructured data ( Mugabushaka and Papazoglou 2012 ). This is being done for collation of academic impact and outputs, for example, Research Portfolio Online Reporting Tools, which uses PubMed and text mining to cluster research projects, and STAR Metrics in the US, which uses administrative records and research outputs and is also being implemented by the ERC using data in the public domain ( Mugabushaka and Papazoglou 2012 ). These techniques have the potential to provide a transformation in data capture and impact assessment ( Jones and Grant 2013 ). It is acknowledged in the article by Mugabushaka and Papazoglou (2012) that it will take years to fully incorporate the impacts of ERC funding. For systems to be able to capture a full range of systems, definitions and categories of impact need to be determined that can be incorporated into system development. To adequately capture interactions taking place between researchers, institutions, and stakeholders, the introduction of tools to enable this would be very valuable. If knowledge exchange events could be captured, for example, electronically as they occur or automatically if flagged from an electronic calendar or a diary, then far more of these events could be recorded with relative ease. Capturing knowledge exchange events would greatly assist the linking of research with impact.
The transition to routine capture of impact data not only requires the development of tools and systems to help with implementation but also a cultural change to develop practices, currently undertaken by a few to be incorporated as standard behaviour among researchers and universities.
What indicators, evidence, and impacts need to be captured within developing systems? There is a great deal of interest in collating terms for impact and indicators of impact. Consortia for Advancing Standards in Research Administration Information, for example, has put together a data dictionary with the aim of setting the standards for terminology used to describe impact and indicators that can be incorporated into systems internationally and seems to be building a certain momentum in this area. A variety of types of indicators can be captured within systems; however, it is important that these are universally understood. Here we address types of evidence that need to be captured to enable an overview of impact to be developed. In the majority of cases, a number of types of evidence will be required to provide an overview of impact.
Metrics have commonly been used as a measure of impact, for example, in terms of profit made, number of jobs provided, number of trained personnel recruited, number of visitors to an exhibition, number of items purchased, and so on. Metrics in themselves cannot convey the full impact; however, they are often viewed as powerful and unequivocal forms of evidence. If metrics are available as impact evidence, they should, where possible, also capture any baseline or control data. Any information on the context of the data will be valuable to understanding the degree to which impact has taken place.
Perhaps, SROI indicates the desire to be able to demonstrate the monetary value of investment and impact by some organizations. SROI aims to provide a valuation of the broader social, environmental, and economic impacts, providing a metric that can be used for demonstration of worth. This is a metric that has been used within the charitable sector ( Berg and Månsson 2011 ) and also features as evidence in the REF guidance for panel D ( REF2014 2012 ). More details on SROI can be found in ‘A guide to Social Return on Investment’ produced by The SROI Network (2012) .
Although metrics can provide evidence of quantitative changes or impacts from our research, they are unable to adequately provide evidence of the qualitative impacts that take place and hence are not suitable for all of the impact we will encounter. The main risks associated with the use of standardized metrics are that
The full impact will not be realized, as we focus on easily quantifiable indicators
We will focus attention towards generating results that enable boxes to be ticked rather than delivering real value for money and innovative research.
They risk being monetized or converted into a lowest common denominator in an attempt to compare the cost of a new theatre against that of a hospital.
Narratives can be used to describe impact; the use of narratives enables a story to be told and the impact to be placed in context and can make good use of qualitative information. They are often written with a reader from a particular stakeholder group in mind and will present a view of impact from a particular perspective. The risk of relying on narratives to assess impact is that they often lack the evidence required to judge whether the research and impact are linked appropriately. Where narratives are used in conjunction with metrics, a complete picture of impact can be developed, again from a particular perspective but with the evidence available to corroborate the claims made. Table 1 summarizes some of the advantages and disadvantages of the case study approach.
The advantages and disadvantages of the case study approach
By allowing impact to be placed in context, we answer the ‘so what?’ question that can result from quantitative data analyses, but is there a risk that the full picture may not be presented to demonstrate impact in a positive light? Case studies are ideal for showcasing impact, but should they be used to critically evaluate impact?
7.3 Surveys and testimonies
One way in which change of opinion and user perceptions can be evidenced is by gathering of stakeholder and user testimonies or undertaking surveys. This might describe support for and development of research with end users, public engagement and evidence of knowledge exchange, or a demonstration of change in public opinion as a result of research. Collecting this type of evidence is time-consuming, and again, it can be difficult to gather the required evidence retrospectively when, for example, the appropriate user group might have dispersed.
The ability to record and log these type of data is important for enabling the path from research to impact to be established and the development of systems that can capture this would be very valuable.
7.4 Citations (outside of academia) and documentation
Citations (outside of academia) and documentation can be used as evidence to demonstrate the use research findings in developing new ideas and products for example. This might include the citation of a piece of research in policy documents or reference to a piece of research being cited within the media. A collation of several indicators of impact may be enough to convince that an impact has taken place. Even where we can evidence changes and benefits linked to our research, understanding the causal relationship may be difficult. Media coverage is a useful means of disseminating our research and ideas and may be considered alongside other evidence as contributing to or an indicator of impact.
The fast-moving developments in the field of altmetrics (or alternative metrics) are providing a richer understanding of how research is being used, viewed, and moved. The transfer of information electronically can be traced and reviewed to provide data on where and to whom research findings are going.
The understanding of the term impact varies considerably and as such the objectives of an impact assessment need to be thoroughly understood before evidence is collated.
While aspects of impact can be adequately interpreted using metrics, narratives, and other evidence, the mixed-method case study approach is an excellent means of pulling all available information, data, and evidence together, allowing a comprehensive summary of the impact within context. While the case study is a useful way of showcasing impact, its limitations must be understood if we are to use this for evaluation purposes. The case study does present evidence from a particular perspective and may need to be adapted for use with different stakeholders. It is time-intensive to both assimilate and review case studies and we therefore need to ensure that the resources required for this type of evaluation are justified by the knowledge gained. The ability to write a persuasive well-evidenced case study may influence the assessment of impact. Over the past year, there have been a number of new posts created within universities, such as writing impact case studies, and a number of companies are now offering this as a contract service. A key concern here is that we could find that universities which can afford to employ either consultants or impact ‘administrators’ will generate the best case studies.
The development of tools and systems for assisting with impact evaluation would be very valuable. We suggest that developing systems that focus on recording impact information alone will not provide all that is required to link research to ensuing events and impacts, systems require the capacity to capture any interactions between researchers, the institution, and external stakeholders and link these with research findings and outputs or interim impacts to provide a network of data. In designing systems and tools for collating data related to impact, it is important to consider who will populate the database and ensure that the time and capability required for capture of information is considered. Capturing data, interactions, and indicators as they emerge increases the chance of capturing all relevant information and tools to enable researchers to capture much of this would be valuable. However, it must be remembered that in the case of the UK REF, impact is only considered that is based on research that has taken place within the institution submitting the case study. It is therefore in an institution’s interest to have a process by which all the necessary information is captured to enable a story to be developed in the absence of a researcher who may have left the employment of the institution. Figure 2 demonstrates the information that systems will need to capture and link.
Research findings including outputs (e.g., presentations and publications)
Communications and interactions with stakeholders and the wider public (emails, visits, workshops, media publicity, etc)
Feedback from stakeholders and communication summaries (e.g., testimonials and altmetrics)
Research developments (based on stakeholder input and discussions)
Outcomes (e.g., commercial and cultural, citations)
Impacts (changes, e.g., behavioural and economic)
Overview of the types of information that systems need to capture and link.
Attempting to evaluate impact to justify expenditure, showcase our work, and inform future funding decisions will only prove to be a valuable use of time and resources if we can take measures to ensure that assessment attempts will not ultimately have a negative influence on the impact of our research. There are areas of basic research where the impacts are so far removed from the research or are impractical to demonstrate; in these cases, it might be prudent to accept the limitations of impact assessment, and provide the potential for exclusion in appropriate circumstances.
This work was supported by Jisc [DIINN10].
Citing articles via.
- Recommend to your Library
- Online ISSN 1471-5449
- Print ISSN 0958-2029
- Copyright © 2023 Oxford University Press
- About Oxford Academic
- Publish journals with us
- University press partners
- What we publish
- New features
- Open access
- Institutional account management
- Rights and permissions
- Get help with access
- Media enquiries
- Oxford University Press
- Oxford Languages
- University of Oxford
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
- Copyright © 2023 Oxford University Press
- Cookie settings
- Legal notice
This Feature Is Available To Subscribers Only
Sign In or Create an Account
This PDF is available to Subscribers Only
For full access to this pdf, sign in to an existing account, or purchase an annual subscription.
Performance Assessment Research Paper
This sample education research paper on performance assessment features: 6100 words (approx. 20 pages) and a bibliography with 38 sources. Browse other research paper examples for more inspiration. If you need a thorough research paper written according to all the academic standards, you can always turn to our experienced writers for help. This is how your paper can get an A! Feel free to contact our writing service for professional assistance. We offer high-quality assignments for reasonable rates.
This research paper begins with an introduction to performance assessments. Performance assessments mirror the performance that is of interest, require students to construct or perform an original response, and use predetermined criteria to evaluate students’ work. The different uses of performance assessments will then be discussed, including the use of performance assessments in large-scale testing as a vehicle for educational reform and for making important decisions about individual students, schools, or systems and the use of performance assessments by classroom teachers as an instructional tool. Following this, there is a discussion on the nature of performance assessments as well as topics related to the design of performance assessments and associated scoring methods. The research paper ends with a discussion on how to ensure the appropriateness and validity of the inferences we draw from performance assessment results.
Academic Writing, Editing, Proofreading, And Problem Solving Services
Get 10% off with fall23 discount code, introduction.
Educational reform in the 1980s was based on research suggesting that too many students knew how to repeat facts and concepts, but were unable to apply those facts and concepts to solve meaningful problems. Because assessment plays an integral role in instruction, it was not only instruction that was the target of change but also assessment. Proponents of the educational reform argued that assessments needed to better reflect students’ competencies in applying their knowledge and skills to solve real tasks. Advances in the 1980s in the study of both student cognition and measurement also prompted individuals to think differently about how students process and reason with information and, as a result, how assessments can be designed to capture meaningful aspects of students’ thinking and learning. Additionally, advocates of curriculum reform considered performance assessments a valuable tool for educational reform in that they were considered to be useful vehicles to initiate changes in instruction and student learning. It was assumed that if large-scale assessments incorporated performance assessments it would signal important goals for educators and students to pursue.
Performance assessments are well-suited to measuring students’ problem-solving and reasoning skills and the ability to apply knowledge to solve meaningful problems. Performance assessments are intended to “emulate the context or conditions in which the intended knowledge or skills are actually applied” (American Educational Research Association, [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999, p. 137). The unique characteristic of a performance assessment is the close similarity between the performance on the assessment and the performance that is of interest (Kane, Crooks, & Cohen, 1999). Consequently, performance assessments provide more direct measures of student achievement and learning than multiple-choice tests (Frederiksen & Collins, 1989). Direct assessments of writing that require students to write persuasive letters to the local newspaper or the school board provide instances of the tasks we would like students to perform. Most performance assessments require a student to perform an activity such as conducting a laboratory experiment or constructing an original report based on the experiment. In the former, the process of solving the task is of interest, and in the latter, the product is of interest.
Typically, performance assessments assess higher-level thinking and problem-solving skills, afford multiple solutions or strategies, and require the application of knowledge or skills in relatively novel real-life situations or contexts (Baron, 1991; Stiggins, 1987). Performance assessments like conducting laboratory experiments, performing musical and dance routines, writing an informational article, and providing explanations for mathematical solutions may also provide opportunities for students to self-reflect, collaborate with peers, and have a choice in the task they are to complete (Baker, O’Neil, & Linn, 1993; Baron, 1991). Providing opportunities for self-reflection, such as asking students to explain in writing the thinking process they used to solve a task, allows students to evaluate their own thinking. Some believe that choice allows examinees to select a task that has a context they are familiar with, which may lead to better performance. Others argue that choice introduces an irrelevant feature into the assessment because choice not only measures a student’s proficiency in a given subject area but also their ability to make a smart choice (Wainer & Thissen, 1994).
Performance assessments require carefully designed scoring procedures that are tailored to the nature of the task and the skills and knowledge being assessed. The scoring procedure for evaluating the performance of students when conducting an experiment would necessarily differ from the scoring procedure for evaluating the quality of a report based on the experiment. Scoring procedures for performance assessments require some judgment because the student’s response is evaluated using predetermined criteria.
Uses of Performance Assessments
Performance assessments for use in large-scale assessment and accountability systems.
The use of performance assessments in large-scale assessment programs, such as state assessments, has been a valuable tool for standards-based educational reform (Resnick & Resnick, 1992). A few appealing assumptions underlie the use of performance assessments: (a) they serve as motivators in improving student achievement and learning, (b) they allow for better connections between assessment practices and curriculum, and (c) they encourage teachers to use instructional strategies that promote reasoning, problem solving, and communication (Frederik-sen & Collins, 1989; Shepard, 2000). It is important, however, to ensure that these perceived benefits of performance assessments are realized.
Performance assessments are used for monitoring students’ progress toward meeting state and local content standards, promoting standards-based reform, and holding schools accountable for student learning (Linn, Baker, & Dunbar, 1991). Most state assessment and accountability programs include some performance assessments, although in recent years there has been a steady decline in the use of performance assessments in state assessments due, in part, to limited resources and the amount of testing that had to be implemented in a short period of time under the No Child Left Behind (NCLB) Act of 2001. As of the 200506 school year, NCLB requires states to test all students in reading and mathematics annually in Grades 3 through 8 and at least once in high school. By 2007-08, states must assess students in science annually in one grade in elementary, middle, and high school. While the intent of NCLB is admirable—to provide better, more demanding instruction to all students with challenging content standards, to provide the same educational opportunities to all students, and to strive for all students to reach the same levels of achievement—the burden put on the assessment system has resulted in less of an emphasis on performance assessments. This is partly due to the amount of time and resources it takes to develop performance tasks and scoring procedures, and the time it takes to administer and score performance assessments.
The most common state performance assessment is a writing assessment, and in some states performance tasks have been used in reading, mathematics, and science assessments. Typically, performance tasks are used in conjunction with multiple-choice items to ensure that the assessment represents the content domain and that the assessment results allow for inferences about student performance within the broader content domain. An exception was the Maryland School Performance Assessment Pro-gram (MSPAP), which was entirely performance-based and was implemented from 1993 to 2002. On MSPAP, students developed written responses to interdisciplinary performance tasks that required the application of skills and knowledge to real-life problems (Maryland State Board of Education, 1995). Students worked collaboratively on some of the tasks, and then submitted their own written responses to the tasks. MSPAP provided school-level scores, not individual student scores. Students received only a small sample of performance tasks, but all of the tasks were administered within a school. This allowed for the school score to be based on a representative set of tasks, allowing one to make inferences about the performance of the school within the broader content domain. The goal of MSPAP was to promote performance-based instruction and classroom assessment. The National Assessment of Educational Progress (NAEP) uses performances tasks in conjunction with multiple-choice items. As an example, NAEP’s mathematics assessment is composed of multiple-choice items as well as constructed-response items that require students to explain their mathematical thinking.
Many of the uses of large-scale assessments require reporting scores that are comparable over time, which require standardization of the content, administration, and scoring of the assessment. Features of performance assessments, such as extended time periods, collaborative work, and choice of task, pose challenges to the standardization and administration of performance assessments.
Performance Assessments for Classroom Use
Educators have argued that classroom instruction should, as often as possible, engage students in learning activities that promote the attainment of important and needed skills (Shepard et al., 2005). If the goal of instruction is to help students reason with and use scientific knowledge, then students should have the opportunity to conduct experiments and use scientific equipment so that they can explain how the process and outcomes of their investigations relate to theories they learn from textbooks (Shepard et al., 2005). The use of meaningful learning activities in the classroom requires that assessments be adapted to align to these instructional techniques. Further, students learn more when they receive feedback about particular qualities of their work, and are then able to improve their own learning (Black & Wiliam, 1998). Performance assessments allow for a direct alignment between important instructional and assessment activities, and for providing meaningful feedback to students regarding particular aspects of their work (Lane & Stone, 2006). Classroom performance assessments have the potential to better simulate the desired criterion performance as compared to large-scale assessments because the time students can spend on performance assessments for classroom use may range from several minutes to sustained work over a number of days or weeks.
Classroom performance assessments provide information about students’ knowledge and skills to guide instruction, provide feedback to students so as to monitor their learning, and can be used to evaluate instruction. In addition to eliciting complex thinking skills, classroom performance assessments can assess processes and products that are important across subject areas, such as those involved in writing a position paper on an environmental issue (Wiggins, 1989). The assessment of performances across disciplines allows teachers of different subjects to collaborate on projects that ask students to make connections across content. Students who are asked to use scientific procedures to investigate the effects of toxicity levels on fish populations in different rivers may also be asked to use their social studies knowledge to describe what these levels mean for different communities whose economy is dependent on the fishing industry (Taylor & Nolen, 2005). Teachers may ask students to write an article for the local newspaper informing the community of the effects of toxicity levels in rivers. The realistic nature of these connections and tasks themselves may help motivate students to learn.
Similarity between simulated tasks and real-world performances is not the only characteristic that makes performance assessments motivational for students. Having the freedom to choose how they will approach certain problems and engage in certain performances allows students to use their strengths while examining an assignment from a variety of standpoints. For example, students required to write a persuasive essay may be allowed to choose the topic they wish to address. Furthermore, if one of the tasks is to gather evidence that supports one’s claims, students could choose the means by which they obtain that information. Some students may feel more comfortable researching on the Internet, while others may choose to interview individuals whose comments support their position (Taylor & Nolen, 2005). Allowing choice ensures that most students will perceive at least some freedom and control over their own work, which is inherently motivational. When designing performance assessments, however, teachers need to ensure that the choices they provide will allow for a fair assessment of all students’ work (Taylor & Nolen, 2005). Tasks that students choose should allow them to demonstrate what they know and can do. Providing a clear explanation of each task, its requirements or directions, and the criteria by which students will be evaluated will help ensure a fair assessment of all students. Students’ understanding of the criteria and what constitutes successful performance will provide an equitable opportunity for students to demonstrate their knowledge. Providing appropriate feedback at different points in the assessment will also help ensure a fair assessment.
Although performance assessments can be beneficial when incorporated into classroom instruction, they are not always necessary. If a teacher is interested in assessing a student’s ability to recall factual information, a different form of assessment would be more appropriate. Before designing any assessment, teachers need to ask themselves two questions: What do I want my students to know and be able to do, and what is the most appropriate way for them to demonstrate what they know and can do?
Nature of Performance Tasks
Performance tasks may assess the process used by students to solve the task or a product, such as a sculpture. They may involve the use of hands-on activities, such as building a model or using scientific equipment, or they may require students to produce an original response to a constructed-response item, write an essay, or write a position paper. Constructed-response items and essays provide students with the opportunity to “give explanations, articulate reasoning, and express their own approaches toward solving a problem” (Nitko, 2004, p. 240). Constructed-response items can be used to assess the process that students employ to solve a problem by requiring them to explain how they solved it.
The National Assessment of Educational Progress (NAEP) includes hands-on performance tasks in their science assessment. These tasks require students to conduct experiments, and to record their observations and conclusions by answering constructed-response items and multiple-choice items. As an example, a public release eigth-grade task provides students with a pencil with a thumb tack in the eraser, red marker, paper towels, a plastic bowl, graduated cylinder, and bottles of fresh water, salt water, and mystery water. Students are asked to perform a series of investigations that allow them to investigate the properties of freshwater and salt water, and are asked to determine whether a bottle of mystery water is freshwater or salt water. After each step, students respond to questions regarding the investigation. Some of the constructed-response items require students to provide their reasoning and to explain their answers (U.S. Department of Education, 1996).
NAEP also includes constructed-response items in their mathematics assessment. An eighth-grade sample NAEP constructed-response item requires student to use a ruler to determine how many boxes of tiles are needed to cover a diagram of a floor. The diagram of the floor is represented as a 3 //- x 5 H-inch rectangle with a given scale of 1 inch equals 4 feet. The instructions are:
The floor of a room shown in the figure above is to be covered with tiles. One box of floor tiles will cover 25 square feet. Use your ruler to determine how many whole boxes of these tiles must be bought to cover the entire floor. (U.S. Department of Education, 1997)
Performance tasks may also require students to write an essay, story, poem, or other piece of writing. On the Maryland School Performance Assessment Program (MSPAP), the assessment of writing would have taken place over a number of days. First the student would engage in some prewriting activity; then the student would write a draft; and finally the student would revise the draft, perhaps considering feedback from a peer. Consider this example of a 1996 Grade 8 writing prompt from MSPAP:
Suppose your class has been asked to create a literary magazine to be placed in the library for you schoolmates to read. You are to write a story, play, poem, or any other piece of creative writing about any topic you choose. Your writing will be published in the literary magazine. (Maryland State Department of Education, 1996)
This task allows student to choose both the form and topic of writing. Only the revised writing was evaluated using a scoring rubric.
Design and Scoring of Performance Assessments
The design of classroom performance assessments follows the same guidelines as the design of performance assessments for large-scale purposes (see for example, Lane & Stone, 2006). Because of differences in the stakes associated with performance assessments that are used for classroom purposes, however, they do not need to meet the same level of standardization and reliability as do large-scale performance assessments. In large-scale assessments, performance is examined over time so the content of the assessment and procedures for administration and for scoring need to be the same across students and time—they need to be standardized. This ensures that the interpretations of the results are appropriate and fair.
Design of Performance Assessments
Performance assessment design begins with a description of the purpose of the assessment; construct or content domain (i.e., skills, knowledge, and their applications) to be measured; and the intended inferences to be drawn from the results. Delineating the domain to be measured ensures that the performance tasks and scoring procedures are aligned appropriately with the content standards. Whenever possible, assessments should be grounded in theories and models of student cognition and learning. Researchers have suggested that the assessment of students’ understanding of matter and atomic-molecular theory can draw on research on how students learn and develop understandings of the nature of matter and materials, how matter and materials change, and the atomic structure of matter (National Research Council, 2006). An assessment within this domain may be able to reflect students’ learning progressions—the path that students may follow from a novice understanding to a sophisticated understanding of atomic molecular theory. Assessments that are designed to capture the developmental nature of student learning can provide valuable information about a student’s level of understanding.
To define the domain to be assessed, test specifications are developed that provide information regarding the assessment. For large-scale assessments, test specifications provide detailed information regarding the content and cognitive processes, format of tasks, time requirements for each task, materials needed to complete each task, scoring procedures, and desired statistical characteristics of the tasks (AERA, APA, & NCME, 1999). Explicit specifications of the content of the tasks are essential in designing performance assessments because they include fewer tasks and they tend to be more unique than multiple-choice items. Test specifications will help ensure that the tasks being developed are representative of the intended domain. Specifications for classroom assessments may not be as comprehensive as those for large-scale assessments; however, they are a valuable tool to classroom teachers for aligning their assessment with instruction. There are excellent sources for step-by-step guidelines for designing performance tasks and scoring rubrics, including Nitko (2004), Taylor and Nolen (2005), and Welch (2006).
Scoring Procedures for Performance Assessments
The evaluation of student performance requires scoring procedures based on expert judgment. Clearly defined scoring procedures are essential for developing performance assessments from which valid interpretations can be made. The first step is to delineate the performance criteria. Performance criteria reflect important characteristics of successful performance. At the classroom level, teachers use performance criteria developed by the state or may develop their own criteria. Defining the performance criteria deserves careful consideration, and informing students of the criteria safeguards a fair evaluation of student performance.
For classroom performance assessments, scoring rubrics, rating scales, or checklists are used to evaluate students. Scoring rubrics are rating scales that consist of pre-established performance criteria at each score level. The criteria specified at each score level are linked to the construct being assessed and depend on a number of factors, including whether a product or process is being assessed, the demands of the tasks, the nature of the student population, and the purpose of the assessment and the intended score interpretations (Lane & Stone, 2006). The number of score levels used depends on the extent to which the criteria across the score levels can meaningfully distinguish among student work. The performance reflected at each score level should differ distinctly from the performance at other score levels.
Designing scoring rubrics is an iterative process. A general scoring rubric that serves as a conceptual outline of the skills and knowledge underlying the given performance may be developed first. Performance criteria are specified for each score level. The general rubric can then be used to guide the design of each task-specific rubric. The performance criteria for the specific rubrics would reflect the criteria of the general rubric but also include criteria representing unique aspects of the individual task. Typically, the assessment of a particular genre of writing (e.g., persuasive writing) may have only a general rubric because the performance criteria at each score level are very similar across writing prompts within a genre. In mathematics, science, or history, performance assessment may have a general scoring rubric in addition to specific rubrics for each task. Specific rubrics guarantee accuracy in applying the criteria to student work, and facilitate generalizing from the performance on the assessment to the larger content domain of interest. Figure 1 shows a general and a specific rubric for a history performance task that assesses declarative knowledge. The specific rubric was designed for item 3 in the following performance task:
President Harry S. Truman has requested that you serve on a White House task force. The goal is to decide how to force the unconditional surrender of Japan, yet provide a secure postwar world. You are now a member of a committee of four and have reached the point at which you are trying to decide whether to drop the bomb.
- Identify the alternatives you are considering and the criteria you are using to make the decision.
- Explain the values that influence the selection of the criteria and the weights you placed on each.
- Explain how your decision has helped you better understand this statement: War forces people to confront inherent conflicts of values. (Marzano, Pickering, & McTighe, 1993, p. 28, note that numbers 1, 2, and 3 are added)
Figure 1 General and Specific Scoring Rubric for a History Standard
The specific rubric in Figure 1 could be expanded to provide examples of student misconceptions that are unique to question 3.
In addition to the distinction between general and specific rubrics, scoring rubrics may be either holistic or analytic. A holistic rubric requires the evaluation of the process or product as a whole. In other words, raters make a single, holistic judgment regarding the quality of the work, and they assign a single score using a rubric rather than evaluating the component parts separately. Figure 2 provides a holistic general scoring rubric for a mathematics performance assessment.
Analytic rubrics require the evaluation of component parts of the product or performance, and separate scores are assigned to those parts. If students are required to write a persuasive essay, rather than evaluate all aspects for their work as a whole, students may be evaluated separately on a few components such as strength of argument, organization, and writing mechanics. Each of these components would have a scoring rubric with distinct performance criteria at each score level. Analytic rubrics have the potential to provide meaningful information about students’ strengths and weaknesses and the effectiveness of instruction on each of these components. Individual scores from an analytic rubric can also be summed to obtain a total score for the student.
Figure 2 Holistic General Scoring Rubric for Mathematics Constructed-Response Items Performance Criteria
Although rubrics are most commonly associated with scoring performance assessments, checklists and rating scales are sometimes used by classroom teachers. Checklists are lists of specific performance criteria expected to be present in a student’s performance (Taylor & Nolen, 2005). Checklists only allow the option of indicating whether each criterion was present or absent in the student’s work. They are best used to provide students with a quick and general indication of their strengths and weaknesses. Checklists tend to be easy to construct and simple to use; however, they are not appropriate for many performance assessments because no judgment is made as to the quality of the observation. If a teacher indicates that the criterion “grammar is correct” is present in a student’s response, there is no indication of the degree to which the student’s grammar was correct (see Taylor & Nolen, 2005).
Rating scales allow teachers to indicate the degree to which a student exhibits a particular characteristic or skill. A rating scale can be used if the teacher wants to differentiate the degree to which students use grammar correctly or incorrectly. Figure 3 provides an example of a rating scale for a persuasive essay.
Figure 3 Rating Scale for the First Draft of a Persuasive Essay Circle a rating for each characteristic that is present in the student’s essay.
In this rating scale, the rating labels are not the same across the performance criteria (e.g., clear, understand able, and unfocused versus strong, adequate, and weak). This may pose a problem if teachers are hoping to combine the various ratings into a single, meaningful score for students. For some tasks, rating scales can be developed that have the same rating labels across items, which would allow for combining the ratings to produce a single score. When using rating scales, teachers must make explicit distinctions between the rating labels, such as few errors and many errors. If these distinctions are not made, scoring will be inconsistent and, as a result, accurate inferences about students’ learning will be jeopardized. Teachers should strive to use scoring procedures that have explicitly defined performance criteria. Often, this is best achieved with using scoring rubrics to evaluate student performance.
Evaluating the Validity of Performance Assessments
Assessments are used in conjunction with other information to make important inferences and decisions about students, schools, and districts; and it is important to obtain evidence about the appropriateness of those inferences and decisions. Validity refers to the appropriateness, meaningfulness, and usefulness of the inferences drawn from assessment results (AERA, APA, & NCME, 1999). Evidence is needed therefore to help support the interpretation and use of assessment results. Validity criteria that are specific to performance assessments have been proposed, including, but not limited to, content representation, cognitive complexity, meaningfulness and transparency, transfer and generalizability, fairness, and consequences (Linn et al., 1991; Messick, 1995). A brief discussion of these validity criteria is presented in this section.
Because performance assessments consist of a small number of tasks, the ability to generalize from the student’s performance on the assessment to the broader domain of interest may be hindered. It is important to consider carefully which tasks will com-pose a performance assessment. Test specifications will contribute to the development of tasks that systematically represent the content domain. For large-scale assessments that have high stakes associated with individual student
scores, multiple-choice items are typically used in conjunction with performance assessments to better represent the content domain and to help ensure accurate inferences regarding student performance. Classroom assessments can be tailored to the instructional material, allowing for a wide variety of assessment formats to capture the breadth of the content.
Some performance tasks may not require complex thinking skills. If the performance of interest is whether students can use a ruler accurately, then a performance task aligned to this skill requires the use of a ruler for measuring and does not require students to engage in complex thinking. Many performance assessments, however, are intended to assess students’ reasoning and problem-solving skills. In fact, one of the promises of performance assessments is that they can assess these higher-order, complex skills. One cannot assume that complex thinking skills are being used by students when working on performance assessments; validity evidence is needed to establish that performance assessments do indeed evoke complex thinking skills (Linn et al., 1991). First, in assessment design it is important to consider how the task’s format, content, and context may affect the cognitive processes used by students to solve the task. Second, the underlying cognitive demands of performance assessments, such as inference, integration, and reasoning, can be verified. Analyses of the strategies and processes that students use in solving performance tasks can provide such evidence. Students may be asked to think aloud as they respond to the task, or they may be asked to explain, in writing, reasons for their responses to tasks. If the purpose of a mathematics task is to determine whether students can solve it with a nonroutine solution strategy, but most students apply a strategy that they recently learned in instruction, then the task is not eliciting the intended cognitive skills and should be revised. The cognitive demands of performance tasks need to be considered in their design and then verified.
Meaningfulness And Transparency
If performance assessments are to improve instruction and student learning, both teachers and students should consider the tasks meaningful and of value to the instructional and assessment process (Frederiksen & Collins, 1989). Performance assessments also need to be transparent to both teachers and students—each party needs to know what is being assessed, by what methods, the criteria used to evaluate performances, and what constitutes successful performance. For large-scale assessments that have high-stakes associated with them, students need to be familiar with the task format and the performance criteria of the general scoring rubric. This will help ensure that all students have the opportunity to demonstrate what they know and can do. Throughout the instructional year, teachers can use performance tasks with their students, and engage them in discussions about what the tasks are assessing and the nature of the criteria used for scoring student performances. Teachers can also engage students in using scoring rubrics to evaluate their own work and the work of their peers. These activities will provide opportunities for students to become familiar with the nature of the tasks and the criteria used in evaluating their work. It is best to embed these activities within the instructional process rather than treat them as test preparation activities. Research has shown that students who have had the opportunity to become familiar with the format of the performance assessment and the nature of the scoring criteria perform better than those who were not provided with such opportunities (Fuchs et al., 2000).
Transfer And Generalizability
An assessment is only a sample of tasks that measure some portion of the content domain of interest. Results from the assessment are then used to make inferences about student performance within that domain. The intent is to generalize from student performance on a sample of tasks to the broader domain of interest. The better the sample represents the domain, the more accurate the generalizations of student performance on the assessment to the broader domain of interest. It is necessary, then, to consider which tasks and how many tasks are needed within an assessment to give us confidence that we are making accurate inferences about student performance. For a writing assessment, generalizations about student performance across the writing domain may be of interest. Consequently, the sample of writing prompts should elicit various types of writing, such as narrative, persuasive, and expository. A student skilled in narrative writing may not be as skilled in persuasive writing. To generalize performance on an assessment to the entire writing domain, students would need to write essays across various types of writing. This is more easily accomplished in classroom performance assessments because samples of students’ writing can be collected over a period of weeks.
An investigation of the generalizability of a performance assessment also requires examining the extent to which one can generalize the results across trained raters. Raters affect the extent to which accurate generalizations can be made about the assessment results because raters may differ in their appraisal of the quality of a student’s response. Generalizability results across trained raters are typically much better than generalizability results across tasks in science, mathematics, and writing (e.g., Hieronymus & Hoover, 1987; Lane, Liu, Ankenmann, & Stone, 1996; Shavelson, Baxter, & Gao, 1993). To help achieve accuracy among raters in large-scale assessments, care is needed in designing precise scoring rubrics, selecting and training raters, and checking rater performance throughout the scoring process. Classroom performance assessments that are accompanied by scoring rubrics with clearly defined performance criteria will help ensure the accuracy of teachers’ assigned scores to student work.
Some proponents of performance assessments in the 1980s believed that performance assessments would help reduce the performance differences between subgroups of students—students with different cultural, ethnic, and socioeconomic backgrounds. As Linn et al. (1991) cautioned, however, it would be unreasonable to assume that group differences that are exhibited on multiple-choice tests would be smaller or alleviated by using performance assessments. Differences among groups are not necessarily due to test format, but are due to differences in learning opportunities, differences with the familiarity of the assessment format, and differences in student motivation. Closing the achievement gap for subgroups of students regardless of the format of the assessment requires that all students have the opportunity to learn the subject matter covered by the assessment (AERA, APA, & NCME, 1999). An integrated instruction and assessment system, as well as highly qualified teachers, are requisite for closing the achievement gap. Performance assessments can play an important role in closing the performance gap only if all students have opportunities to be engaged in meaningful learning.
An assumption underlying the use of performance assessments is that they serve as motivators in improving student achievement and learning, and that they encourage the use of instructional strategies that promote students’ reasoning, problem-solving, and communication skills. It is particularly important to obtain evidence about the consequences of performance assessments because particular intended consequences, such as fostering reasoning and thinking skills in instruction, are an essential part of the assessment system’s rationale (Linn, 1993; Messick, 1995). Performance assessments are intended to promote student engagement in reasoning and problem solving in the classroom, so evidence of this intended consequence would support the validity of their use. Evaluation of both the intended and unintended consequences of assessments is fundamental to the validation of test use and the interpretation of the assessment results (Messick, 1995). To identify potential negative consequences, one needs to examine whether the purpose of the assessment is being compromised, such as teaching only the content standards that are on state assessments rather than to the entire set of content standards deemed important.
When some states relied more heavily on performance assessments in the early 1990s, there was evidence that many teachers revised their own instruction and classroom assessment accordingly. Teachers used more performance tasks and constructed-response items for classroom purposes. In a study examining the consequences of Washington’s state assessment program, approximately two-thirds of teachers reported that the state’s content standards and extended-response items on the state assessment were influential in promoting better instruction and student learning (Stecher, Barron, Chun, & Ross, 2000). Further, observations of classroom instruction in exemplary schools in Washington revealed that teachers were using reform-oriented strategies in meaningful ways (Borko, Wolf, Simone, & Uchiyama, 2003). Teachers and students were using scoring rubrics similar to those on the state assessment in the classroom, and their use of these rubrics promoted meaningful learning. In a study examining the consequences of Maryland’s state performance assessment (MSPAP), teacher’s reported use of reform-oriented instructional strategies was associated with positive changes in school performance on MSPAP in reading and writing over time (Stone & Lane, 2003). Schools in which teachers indicated that they had used more reform-oriented instructional strategies in reading and writing were associated with greater rates of change in school performance on MSPAP over a 5-year period. Further, teacher perceived effect of MSPAP on math and science instructional practices was also found to explain differences in changes in MSPAP school performance in math and science. The more impact MSPAP had on science and math instruction, the greater the gains in MSPAP school performance in math and science over a 5-year period.
Recently, however, many states have relied more heavily on multiple-choice items and short-answer formats, and as a consequence, teachers in some of these states are using fewer constructed-response items for classroom purposes (Hamilton et al., 2007). If extended constructed-response items that require students to explain their reasoning are not on the high-stakes state assessments, instruction may focus more on computation and less on complex math problems and teachers may rely less on constructed-response items on their classroom assessments.
Performance assessments are useful tools for initiating changes in instruction and student learning. Large-scale assessments that incorporate performance tasks can signal important goals for educators and students to pursue, which can have a positive effect on instruction and student learning. A balanced, coordinated assessment and instructional system is needed to help foster student learning. Coherency among content standards, instruction, large-scale assessments, and classroom assessments is necessary if we are committed to the goal of enhancing student achievement and learning. Because an important role for both large-scale and classroom performance assessments is to serve as models of good instruction, performance assessments should be grounded in current theories of student cognition and learning, be capable of evoking meaningful reasoning and problem-solving skills, and provide results that help guide instruction.
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
- Baker, E. L., O’Neil, H. F., & Linn, R. L. (1993). Policy and validity prospects for performance-based assessment. American Psychologist, 1210-1218.
- Baron, J. B. (1991). Strategies for the development of effective performance exercises. Applied Measurement in Education, 4(4), 305-318.
- Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan, 80(2), 139-148.
- Borko, H., Wolf, S. A., Simone, G., & Uchiyama, K. (2003). Schools in transition: Reform efforts in exemplary schools of Washington. Educational Evaluation and Policy Analysis, 25(2), 171-202.
- Frederiksen, J. R., & Collins, A. (1989). A systems approach to educational testing. Educational Researcher, 18(9), 27-32.
- Fuchs, L. S., Fuchs, D., Karns, K., Hamlett, C. L., Dutka, S., &
- Katzaroff, M. (2000). The importance of providing background information on the structure and scoring of performance assessments. Applied Measurement in Education, 13(1), 1-34.
- Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S.,
- Robyn, A., Russell, J. L., et al. (2007), Standards-based accountability under No Child Left Behind. Pittsburgh, PA: RAND.
- Hieronymus, A. N., & Hoover, H. D. (1987). Iowa tests of basic skills: Writing supplement teacher’s guide. Chicago: Riverside.
- Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and
- Practice, 18(2), 5-17.
- Lane, S. (1993). The conceptual framework for the development of a mathematics performance assessment instrument. Edu-cational Measurement: Issues and Practice, 12(3), 16-23.
- Lane, S., Liu, M., Ankenmann, R. D., & Stone, C. A. (1996). Generalizability and validity of a mathematics performance assessment. Journal of Educational Measurement, 33(1), 71-92.
- Lane, S., & Stone, C. A. (2006). Performance assessments. In B. Brennan (Ed.), Educational Measurement (pp. 387-432). Westport, CT: Praeger.
- Linn, R. L. (1993). Educational assessment: Expanded expectations and challenges. Educational Evaluation and Policy Analysis, 15, 1-16.
- Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex performance assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15-21.
- Maryland State Board of Education. (1995). Maryland school performance report: State and school systems. Baltimore: Author.
- Maryland State Department of Education. (1996). 1996 MSPAP public release task: Choice in reading and writing. Baltimore: Author.
- Marzano, R. J., Pickering, R. J., & McTighe, J. (1993). Assessing student outcomes: Performance assessment using the dimensions of learning model. Alexandria, VA: Association for Supervision and Curriculum Development.
- Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5-8.
- National Research Council. (2006). Systems for state science assessment. In M. R. Wilson & M. W. Bertenthal (Eds.), Committee on test design for K-12 science achievement. Washington, DC: National Academies Press.
- Nitko, A. J. (2004). Educational assessment of students (4th ed.). Upper Saddle River, NJ: Pearson.
- Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for educational reform. In B. G. Gif-ford & M. C. O’Conner (Eds.), Changing assessment: Alternative views of aptitude, achievement, and instruction (pp. 37-55). Boston: Kluwer Academic.
- Schmeiser, C. B., & Welch, C. J. (2007). Test development. In B. Brennan (Ed.), Educational Measurement (pp. 307-354). Westport, CT: Praeger.
- Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215-232.
- Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.
- Shepard, L., Hammerness, K., Darling-Hammond, L., Rust, F., with Snowdon, J. B., Gordon, E., Gutierrez, C., & Pacheco, A. (2005). Assessment. In L. Darling-Hammond & J. Bransford (Eds.), Preparing teachers for a changing world: What teachers should learn and be able to do. San Francisco: Jossey-Bass.
- Stecher, B., Barron, S., Chun, T., & Ross, K. (2000, August). The effects of the Washington state education reform in schools and classrooms (CSE Tech. Rep. NO. 525). Los Angeles: University of California, National Center for Research on Evaluation, Standards and Student Testing.
- Stiggins, R. J. (1987). Design and development of performance assessments. Educational Measurement: Issues and Practice, 6(1), 33-42.
- Stone, C. A., & Lane, S. (2003). Consequences of a state accountability program: Examining relationships between school performance gains and teacher, student and school variables. Applied Measurement in Education, 16(1), 1-26.
- Taylor, C. S., & Nolen, S. B. (2005). Classroom assessment: Supporting teaching and learning in real classrooms. Upper Saddle River, NJ: Pearson.
- U.S. Department of Education. (1996). NAEP 1996 Science Report Card for the Nation and the States. Washington, DC: Author. Retrieved from https://nces.ed.gov/nationsreportcard/pubs/main1996/
- U.S. Department of Education. (1997). NAEP 1996 Mathematics Report Card for the Nation and the States. Washington, DC: Author. Retrieved from https://nces.ed.gov/nationsreportcard//pdf/main1996/97488.pdf
- Wainer, H., & Thissen, D. (1994). On examinee choice in educational testing. Review of Educational Research, 64, 159-195.
- Welch, C. (2006). Item and prompt development in performance testing. In S. M. Downing & T. M. Haladyna (Eds.), Hand-book of test development (pp. 303-329). Mahwah, NJ: Lawrence Erlbaum Associates.
- Wiggins, G. (1989, April). Teaching to the (authentic) test. Educational Leadership, 41-47.
ORDER HIGH QUALITY CUSTOM PAPER
‘You only assess what you care about’: a new report looks at how we assess research in Australia
Emeritus Professor , UNSW Sydney
Kevin McConkey has previously received funding from the Australian Research Council. He is the current chair of the Policy Committee of the Academy of the Social Sciences in Australia. He is the chair of the Expert Working Group of the the Australian Council of Learned Academies, which prepared the report referred to in this article.
UNSW Sydney provides funding as a member of The Conversation AU.
View all partners
Research plays a pivotal role in society. Through research, we gain new understandings, test theories and make discoveries.
It also has a huge economic value. In 2021, the CSIRO found every A$1 of research and development investment in Australia creates an average of $3.50 in economy-wide benefits.
But how do we know if individual research projects being conducted in Australia are good quality? How is research recognised? The key way this happens is through “research assessment”.
Read more: Tumult and transformation: the story of Australian universities over the past 30 years
What is research assessment?
Research assessment is not a centralised or necessarily formal process. It can involve various processes and measures to evaluate the performance of individual researchers and research institutions. This includes assessing the quality, excellence and impact of various outputs.
Research assessment can be qualitative or quantitative. It can include publications in journals and the number of people who cite the research, gaining grants to do further research, commercialisation, media engagement and impact on decision-making or public policy, prizes and invitations to speak at conferences.
If research assessment is working fairly and effectively, it should achieve several things. This includes: helping to develop researchers’ careers, making sure innovative research does not get avoided in favour of short-term gains and helping funders and the community have confidence research is providing value for money and adding to the public good.
Read more: We solve problems in 30 days through 'research sprints': other academics can do this too
Our new project aimed to provide a better understanding of how research assessment affects research in Australia.
In a report released today , we surveyed more than 1,000 Australian researchers and more than 50 research organisations.
This included universities, research institutes, industry bodies, government and not-for-profit organisations. The majority of researchers (74%) were in academic roles. Across those research sectors, we also conducted 11 roundtables involving around 120 people and 25 intensive interviews to understand the issues.
This work was commissioned by Chief Scientist Cathy Foley and conducted by the Australian Council of Learned Academies (involving the academies of science, medical science, engineering and technological sciences, social sciences and humanities).
It also comes as the Universities Accord review examines how research is funded and approached within higher education.
What we found
We found some difficulties with the current approach to research assessment.
We heard there is a tendency by some researchers to “play it safe” in terms of doing research they believe will score well. We also heard how the assessment process can unintentionally exclude or devalue particular forms of knowledge, particularly in the humanities and the social sciences, where outputs can be less easily quantified or less immediately seen.
As one interviewee said:
What is assessed and how it is assessed are an indication of what the organisation values. You only assess what you care about. Values and culture drive assessment.
Our roundtables told us senior staff and supervisors are often seen to reinforce the culture of “publish or perish”, with the number of articles being valued more highly than the quality.
We heard early and mid-career researchers and people from underrepresented backgrounds can have difficulties trying to “play the game” to advance their careers. For example, early-career researchers are often expected to produce work that benefits their larger team, at a cost to their own capacity for promotion.
As one interviewee noted:
Metrics are essential for defining value and comparative difference, but Australia requires a modern and fair framework for assessing our current and next generation of researchers.
Our survey found a high level of dissatisfaction with the state of research assessment. This included:
73% of respondents agreed assessment processes are not consistently or equitably applied across disciplines, in particular between the humanities and the sciences
67% said there are not enough opportunities to provide input into research assessment practices
70% said assessments took up unreasonable time and effort.
Read more: Fieldwork can be challenging for female scientists. Here are 5 ways to make it better
The way forward
In our survey, we asked “What is one specific change you would recommend to improve current research assessment processes?”.
Respondents wanted to see a shift towards quality over quantity. This means not just a focus on publishing as many papers as possible, but supporting research that may take longer for its value and benefits to emerge.
They wanted interdisciplinary research to be promoted and rewarded, because many of the complex problems of our world – from climate change to domestic violence to housing affordability – require multiple disciplines to be involved in finding solutions. In the same vein, they also wanted collaboration and team work to be rewarded more clearly and transparently.
They wanted less bias towards STEM (science, technology, engineering and maths) research and more promotion of diversity and of early-career researchers. This included better understanding of their personal and cultural situation, more focused career development and better managed teamwork.
To achieve all of this, and more, we will also need to understand that no single measure can assess all research or researchers. So, several tools will be needed, including quantitative indicators as well as qualitative measures and peer review.
Ana Deletic, Louisa Jorm, Duncan Ivison, Robyn Owens, Jill Blackmore, Adrian Barnett, Kate Thomann, Caroline Hughes, Andrew Peele, Guy Boggs and Raffaella Demichelis were all part of the expert working group supporting this work.
- Research impact
- Research assessment
- Research quality
St Baker Soyer Chair of Dermatology
Environmental, Social and Governance (ESG) Lead
Deputy Editor - Sports and Society
Lecturer/Senior Lecturer in Electrical Engineering
Lecturer in Occupational Therapy
- Essay Database >
- Essays Samples >
- Essay Types >
- Research Paper Example
Assessment Research Papers Samples For Students
708 samples of this type
If you're seeking a viable method to simplify writing a Research Paper about Assessment, WowEssays.com paper writing service just might be able to help you out.
For starters, you should skim our vast collection of free samples that cover most diverse Assessment Research Paper topics and showcase the best academic writing practices. Once you feel that you've analyzed the basic principles of content presentation and taken away actionable ideas from these expertly written Research Paper samples, composing your own academic work should go much smoother.
However, you might still find yourself in a situation when even using top-notch Assessment Research Papers doesn't allow you get the job accomplished on time. In that case, you can get in touch with our experts and ask them to craft a unique Assessment paper according to your individual specifications. Buy college research paper or essay now!
Holistic Assessment Of Students Learning Outcome Research Paper
Assessing infants and young children research paper examples, introduction, ethics, reliability and validity issues for vocational assessment research papers example, vocational assessments.
Don't waste your time searching for a sample.
Get your research paper done by professional writers!
Just from $10/page
Correlation Between Personal Values And Ethical Code Research Paper Examples
Research paper on ethical legal and utility issues in assessment, ethical, legal, and utility issues in assessment.
Substance abuse, especially addiction of is a serious problem in the society. Substance abuse is due to addiction, which is a disease procedure characterized by habitual use of a particular psychoactive substance. The person who has the problem of substance addiction engages in a sequence of behaviors concerning the substance abuse, which leads to dependence syndrome or abuse disorder. Therefore, there is the need to develop appropriate and reliable assessment instrument and mechanism, which will enable the physician to address the problem of substance addiction. Therefore there is a great need to consider the element of uncertainty in evaluation and during making conformity decisions.
Example Of Research Paper On Personality Assessment
Behavioral assessments research paper example, carmen parsons-bowie, free how to model and determine threats research paper sample, research paper on the design of a technology-enhanced learning environment (tele) aimed at promoting self-regulated learning and problem solving in science, tali berglas-shapiro, undertaking vulnerability assessment with kali linux research paper sample, good research paper on research methods and practice, family assessment research papers example, an assessment of the department of the navy information technology progress towards research paper examples.
I declare that the content of this paper is my own creation
Part 1: Introduction
Technology information security and risk management research paper sample, good single-subject design research paper example, risk assessment methodology research paper sample, most effective pain management strategy for older adults with dementia research paper examples, free research paper on benefits of technology in academics, research paper on assessments in reading, assessments in reading, comparing risk assessment tools research paper, perfect model research paper on early detection of postpartum depression, free research paper about corroborated megan tschannen-moran survey, validation of principal efficacy survey– rationale for adding new questions to the, example of feedback response research paper, note 1: selene, performance assessment implementation challenges research paper example, performance assessment implementation challenges.
The performance assessment process requires well formulated ideas and planning to ensure everything is done appropriately. There is need for efficient communication between the people conducting the performance assessment process in the organization. Proper communication helps maintain orderly assessment processes and make employees accept the whole process in general. However, there are several impediments that are encountered during the implementation of assessment outcomes in an organization. Despite these challenges, there are several advantages to gain from the process of implementing assessment outcomes (Schultz & Schultz, 2010).
Good research paper about results, good scenario on learning assessment research paper example, good example of research paper on life cycle assessment, spiritual needs assessment research paper examples, risk assessment and vulnerability assessment research paper sample, commentary 1, example of research paper on risk identification in projects, research paper on consultant case study, pros of standardized tests research paper, example of nutrition risks of older adults research paper, example of external/environmental analysis research paper.
Nutritional Assessment For Andaman Coast Lactating Mothers Research Paper Examples
Brief description of the lead organization.
The lead Organization charged with this assessment is Triodia consultancy. It is an Australian NGO, which provides specialized training services and project management for communities. Their mandate includes landscaping, native vegetation, horticultural and community gardening management projects. It is a specialization in service-delivery for the Aboriginal communities in Australia. Triodia’s bases its services on two main areas, namely, training and project management (Australian Indigenous)
Statement of the Nutritional Problem
Research paper on inclusion in special education, chavet breslin, example of research paper on data collection techniques matrix.
Techniques of data collection in employee selection are useful because they help in identification of competent and reliable individuals for a given job. Organizations seeking to employ new workers identify techniques that suit the selection process, the job post and the kind of employee they desire to hire. Popular data collection techniques include record tracking, assessment instruments, and assessment centers, in baskets, nonverbal tests, leaderless group discussions, interests’ tests, situational tests, projective tests, weighted application blanks and work samples. This is a review of the various techniques of collecting data.
Assessment centers and assessment instruments
Intelligence and adaptive behavior research paper, intelligence and adaptive behavior, nursing in a global environment research paper.
From a rather general point of view, I’d like to begin this work by giving a brief description based on the aspect of patient assessment. It can be defined as the process which involves gathering information about a patient and it is normally done by the medical staff. The information may include a number of aspects for instance; the sociological, physiological, spiritual and the psychological status of a patient(Stroud, 1963). In this particular case, the discussion shall largely comprise the aspect of assessing a patient’s spiritual and cultural needs.
Good Example Of Personality And Ability Research Paper
Apn professional development plan: exemplar research paper to follow, good research paper on hr training class, patterns of development research paper samples, example of early childhood assessment research paper, free research paper about examination of statistical nursing research articles, article 1: patterns of anxiety in critically ill patients receiving mechanical ventilatory support, performance management plan research paper, research paper on challenges of the assessment and measurement process paper, challenges of the assessment and measurement process paper, example of research paper on alternative responses to investigation, example of research paper on personality assessment in the workplace, obtaining approval research paper example, implementation plan.
Implementation Plan This Capstone Project focuses on the fall of elderly patients in hospitals. The intervention proposed by this researcher is proper assessment and customizing fall prevention, and the measured outcome is decrease in fall rates. Below are details on how this project is going to be implemented.
Job Description Inventory Research Paper Example
Research paper on genetics and genomics healthcare: ethical issues, research paper on general motors in mexico, why general motors should expand its automobile operations in mexico.
BACKGROUND Positive prospect for Mexico calls for greater attention. The recovery from the hard hit of 2009 global financial crises that affected the performance of the industrial sector of the economy has increased greater business prospects. This call for attention is in addition to the GM expansion plan $420 million of investments in the states of Guanajuato and San Luis Potosi. The expansion suggestion is based on various factors highlighted in the underlying suggestion.
Security vulnerability assessment research paper example, research paper on ethical legal and social issues paper, selection guidelines review, iep research research paper examples, iep research, example of research paper on feasibility analysis, research paper on needs assessment and healthcare process assessment.
Dedicated caregivers are responsible for providing unlimited assistance including complex nursing care, cognitive support and care management both in a nursing institution and at home. An intensive study of the caregiving process revealed that caregivers are poorly prepared for their roles and are providing care with less or no support. They are burdened and experiencing physical and psychological illnesses. These issues brought adverse effects on their delivery of care which indicate a need to assess caregivers and implement reform in the system.
System effectiveness research paper examples.
Organizational consultants have many tools and measures that they can use to evaluate performances of employees against the expected standards. There are seven major tools consultants use often. They include the quality of information, number of people evaluated, benefit ratio, system satisfaction, organization level and unit level performance, performance quality and performance rating distribution. The seven measures can be utilized by organizational consultants to interpret assessment results. There are three main kinds of assessment results (Schultz & Schultz, 2010).
Planning For Risk Management Research Paper
In business and project ventures of any kind, risk is a constant – an unkillable phantom that seeks to strike down any and all projects that one is working on. No matter how much of a sure thing it may seem to be, there is always a risk that is seeking to completely derail one’s hard work. There is, however, a solution that can help minimize the chance of something going wrong, thus saving investors incredible amounts of money in potential foibles that can delay or halt the progress of a project – that solution is risk management.
Password recovery email has been sent to [email protected]
Use your new password to log in
You are not register!
Now you can download documents directly to your device!
Check your email! An email with your password has already been sent to you! Now you can download documents directly to your device.
or Use the QR code to Save this Paper to Your Phone
The sample is NOT original!
Short on a deadline?
Don't waste time. Get help with 11% off using code - GETWOWED
No, thanks! I'm fine with missing my deadline
Research Paper Assessment Rubric
©2002 Golden Hills School Division #75 ©2002 The Galileo Educational Network Association GENA
The Daily Greenwich
- Free Examples
- Creating a thesis statement
- Term paper format samples
- Research paper of MLA style
- Creating a research essay body
- Cyber-terrorism term project writing
- Finding a term project writer
- Research project summary examples
- Video games: crafting a research project
- Area 51: writing a research paper
- Parts of a research project introduction
- Crafting research project abstracts
- Global warming: term project examples
- Example research project conclusions
- Purchasing a research project for cheap
- Term project thesis statement writing
- The use of writing companies: benefits
- Project assessments: a paper sample
- The college term project normal length
- Finding sample research projects
- APA research paper writing for high school
- Creating a year 4 Science project
- Seeking year 7 research papers
- Composing a paper in Literature
- An MLA project outline: writing hints
- Purchasing research papers
- College term project tips
- Picking a project writing service
- Buying term projects with no risk
- Finding group paper templates
- High school project: writing advice
- Finding someone to write my paper
- In quest of a qualified writer
- Seeking proper editing companies
- Crafting research projects: simple hints
- MLA paper examples
- Crafting a project on computers
- Selecting a Social Psychology term paper topic
- Immigration research project topics
- Topics for thesis on computer engineering
- Research essay ideas on obesity surgery
- Nursing research project ideas
- British literature research paper ideas
- DPRK: ideas for a research project
- Economics term project topics
- Medieval literature paper ideas
- Picking a topic for English literature paper
- Inventing a research project title
- Vegetarianism: project writing tips
- Inventing high school term project topics
- Topics for a term project on government
- Inventing project topics on Nutrition
- Term project topic questions in Economics
- Ideas for a paper about Shakespeare
- Selecting an original term paper topic
- Funny high school paper topic questions
- Medical paper topic suggestions
- Features of a great paper topic
- Fresh term paper topics in Biology
- History research paper questions
- Ideas for a US history project
- Football culture paper topics
- Basic tips on paper topic selection
- Writing ideas for a 7-page paper
- Term paper questions on IT
- High school project topics: biology
- KKK: research project topic questions
- Climate change project ideas
- Topic prompts on education
- Business paper topics
- Project writing ideas in literature
Project-Based Assessments In The Classroom Using Technology
The core objective of this study is to develop a framework to incorporate project-based assessments in the classroom using technology. In the literature review section of the paper, numerous important issues were highlighted, as the focus is on the persistent use of various technological applications in the classroom. Education as a product and service reflects the transformation from a production-line approach to a service provision structure. Educators tend to provide a logical assessment of solidly established curriculum evaluation structures using ITSMF and ITIL principles. The present section emphasizes the discussion of suitable research design methods including data collection and data analysis methods. An important aspect which had been thoroughly considered is the concern to employ research methods that are consistent with the aims and objectives of the current study. Practical issues and procedures pertaining to research methodology are discussed, along with the academic principles underlining the specificity of the selected method.
Technology has become a ubiquitous concept because it is an inseparable part of human life. The number of schools that want to integrate technology into classroom learning has been increasing. It is clear that when technology is properly used in the classroom, this can assist students to acquire relevant skills and competencies that can help them succeed in the technological-based contemporary economy. Scholars in the field claim that integrating technology into the classroom implies something more than simply teaching basic computer skills and software programs in particular computer classes. Efficient technology integration into the classroom needs to take place across the curriculum in ways that can enhance the learning process. This process of integration should consider four essential aspects of learning such as active and regular engagement, enhancing group participation in various learning activities, ongoing interaction and feedback, and frequent consultation with experts in the field. Technology has definitely become an important part of the educational process, and educational technology experts are unanimous that technology should not be taught as a separate subject but rather as a powerful tool to promote student learning. Nevertheless, a substantial number of teachers lack a sufficient knowledge and experience with technology, which may impede the process of technology integration into the classroom. In order to succeed in incorporating relevant technological solutions into the classroom, these teachers should consider the option of extending their learning and training options so that they can acquire practical technological skills and expertise. Recommendations derived from literature on the subject indicate that emphasis is put on constructivism when it comes to technology integration into the classroom. The constructivist approach is perceived as an ultimate way to resolve certain educational problems. In fact, constructivism emerges as a specific theory of learning that focuses on the way in which one’s mind create knowledge, and thus a particular conceptual understanding of certain ideas occurs. However, educational goals and objectives are evolving as result of new social needs corresponding to the constant advances of technology. The earliest use of computers to aid teaching instruction was based on the work of Skinner, who promoted the use of computers to strengthen practice and the acquisition of essential skills among students. Skinner presented the assumption that the teacher plays an extremely important role to modify the behavior of students in a desirable direction through implementing the mechanism of positive reinforcement. Indeed, computers have been perceived as adequate resource tools to teach problem solving and critical thinking skills. Technology apparently demonstrates the advantage of giving a proper visual representation of higher-order concepts discussed in the classroom. In addition, technology integration into the classroom reflects in the use of different types of graphics and simulations that trigger students’ interest to explore such technological means of learning and apply them in practice. One of the most efficient uses of technology in the field of education is to adjust instruction to the individual learning needs and expectations of students. For instance, technology can provide adequate learning opportunities for students with special needs. Gifted students are provided with freedom and flexibility to work and maintain various research projects at their own pace.
Researchers argue that teachers in contemporary educational environment tend to integrate technology into the classroom in meaningful ways in the sense of supporting the respective curriculum rather than dominating it. It has been shown that the regular use of technology in education aims at creating a collaborative learning environment marked by the facilitating and learning roles of educators. Scholars outline essential pedagogical principles when it comes to technology integration into the classroom such as active learning, collaboration, mediation, and a relevant level of interactivity.
Active learning while implementing technology in the classroom means that students are adequately engaged in particular activities in the classroom. The use of technology for active learning makes students more focused and motivated to complete certain tasks. Furthermore, technology emerges as a proper mechanism for increasing the amount of human interaction between educators and students in the classroom. For instance, information and communication technology provides learners with a viable opportunity to acquire transferable skills and use distinct learning styles in the ongoing, flexible educational process. It appears that technology can change teaching practices to a huge extent simply because the classroom has become quite student-centered. Teaching roles have changed in a direction of becoming collaborative and facilitating.
From the perspective of students, technology can definitely make learning easier and more exciting. In this way, learners are provided with an opportunity to keep up with essential skills that are necessary for their professional future. It is important to point out that the future of learners is perceived in terms of advanced technology. Students are obviously entering a world in which most of the jobs will require competence in technology. In order to become successful in the highly competitive knowledge-based economy, learners should constantly update their occupational and technological skills.
Implementing a Data Collection Plan
When quantitative researchers choose random probability sampling, there are no specific regulations for the establishment of sample sizes in the process of adopting the qualitative method. Sample size slightly depends on the contemplation of the researcher linked to the objective of the study, the convenience and the reliability of the selected cases or events and, last but not least, on the accessible time and resources. There are different theories linked with quantitative and qualitative research methods regarding sample size, perceived as a proper exchange between extent and intensity.
Quantitative tools restrict responses to prearranged categories by means of identical questions. Therefore, quantitative researchers are capable to determine the responses of many participants and this manner may advance data and capacity. On the opposite side, qualitative researchers usually authorize the investigation of only a few selected cases or events, but in great depth and with concentration to detail and framework, which may lead to improving the depth of the research. Whereas the quantitative method may not ensure adequate depth, the width or number of individuals that might be selected or examined in qualitative research is restricted.
The adoption of both primary and secondary data is believed to expand the feasibility of the research itself. The advantage of this method is that it would be possible to establish a link between the real world’s first-hand information and the different theories and literature, which is already present on the research topic. This would not only make the research study more evidence-based but also more critically oriented in a particular direction. The primary data would be collected from conducting personal interviews with the identified sample of participants. The rationale for conducting those personal interviews is that the interviewer may have a chance to develop new variables in new questions on the basis of responses from the interviewees. This would give a direct insight on the strategies used by the professors from the respective education community.
Purposeful sampling, as utilized in the present qualitative research, is a practice toward sampling intended at diversifying the perspectives of individuals in relation to the explored issue. Once an ITIL based framework model had been established, the researcher decided to carry out 20 interviews with higher education experts (male and female with minimum 15 years of professional academic experience) from various countries to include their judgment on the proposed structure (the following are the countries the researcher planned to target: Australia, Sweden, United Arab Emirates, France, and United States).
Purposeful sampling approach is a phrase utilized to describe the “tactical and determined collection of information-rich cases”, with the objective of guaranteeing that the chosen sample provides the essential depth, but at the same time accomplishes the objective of a preferably high level of width. Still, the precise type and number of the chosen cases depend on the research objectives and other aspects previously discussed.
The chosen cases or events will be reviewed according to the objectives of the study and to their significance in responding to the research questions on the topic under examination. In the following, the sampling process as implemented in this particular research on higher education will be explained in detail. This process is highly structured to correspond to the goals outlined for the present study. There are five characteristics that could be recognized as the most important in the sampling process: (1) time frame, (2) financial capability, (3) geographic location, (4) a broad number of interviewers, and (5) adequate access to university or academic information. The time frame for carrying out in-depth interviews was roughly identified as 1-2 months, with a sufficient budget available. Research was geographically ideal and targeted the following countries: Australia, Sweden, UAE, France and United States. It has been pointed out that only one interviewer was required to carry out all expert interviews, though there were three representatives from non-English speaking countries. Thus, it would be preferable to include interpreters in order to report academic or university-related information.
Since the higher education sector, e.g. in the context of curriculum evaluation framework, is known for arranging diverse implementation measures, professors with 25-year professional academic experience were contacted for the conduct of the interviews. This strategy was also based on the hypothesis that ideas related to such a framework might be superior and more complicated with an extensive size of research typically included. As a result, considering the objective of the study and the research questions mentioned before, the researcher decided to concentrate on research of the curriculum evaluation framework in higher education.
With the help of the first interview, it might have become evident for the researcher to fully discover the experience under examination, particularly the higher education part of the curriculum evaluation framework as well as the issues related to technology incorporation in the classroom. In this context, in order to get a two-way point of view of the facts, higher education professors, particularly from the five mentioned countries, were incorporated in the sample. All the chosen participants have extensive experience with the adoption of a curriculum evaluation framework in the field of higher education.
The model of quantitative vs. qualitative research technique can be traced back to the theoretical perceptions, as the technique is based on positivism and interpretivism, correspondingly. Positivism, the source of quantitative research, emphasizes on objective validity or aspects of the real world, which exists autonomously of individual views or opinions. To support research on this approach means utilizing “theoretical-deductive” judgments, progressing from specific hypotheses to facts with the aim of describing the “real world views”. Interpretivism, on the opposite side, emphasizes a hypothesis, which will always belong to the specific background and conditions where and under which it was established. This technique, the source of qualitative research, does not strive for international laws, as it has been indicated that participants' understanding and their experiences is what matters and shapes validity. With these different theoretical techniques in mind, the alternative norms for guaranteeing the quality of a qualitative research will be discussed below.
Alternative norms for evaluating data quality: There are different sets of norms related to evaluating data quality, since they have significance for this study: (1) traditional scientific research standards, and (2) social building and constructivist standards. Traditional scientific norms are based on the theoretical approach of positivism, the foundation of quantitative techniques. This means that when implementing those norms to a qualitative research, the researcher is required to be as unbiased as possible, and implement methodical data collection standards, as well as strive for connecting justifications and generalizability through hypothesis analysis. The objective is to explain facts as accurately and correctly as possible in order to present findings on how the “current setting” is. Traditionally, the standard of research studies has been assessed based on internal and external validity and reliability.
Internal validity points out to the level to which a hypothesis conclusions are decidedly drawn from its grounds; external validity explains the level to which the research findings in one setting can be implemented in a setting diverse from the first; reliability finally is the level to which another researcher could illustrate the same recommendations from the same explanations. These standards are important for researchers implementing traditional scientific standards. As to the general conclusion or external validity of information, it is believed that induction (as implemented in quantitative method) is never fully validated reasonably since inductive recommendations are always based on certain grounds. As a result, a boost in sample size might be helpful for the research, but the specific advantages refer to the enhanced reliability of the sampling process and not to the enhanced generalization of a sample population.
For this rationale, statistical generalizability barely corresponds to a general model for all types of generalizability. Since the objective of this research was not to statistically measure educational data, but to enhance understanding of specific facts, traditional methodical research standards do not offer an adequate structure for the conclusion of the quality of this research. More appropriate for the nature of this research are social constructivist research standards (as described in the next part) that refer to internal and external validity as well as to reliability. Social building and constructivist standards, from an interpretivist perception, identify that the external environment is a building – be it a social, political, or emotional one. Researchers supporting this approach are rather concerned with expanding their understanding of specific incidents within a particular background than in the formation of hypotheses and general conclusions.
As explained earlier, this study identifies the different formations of reality and as a result the different views of participants regarding a certain fact. Social constructivist research and standards therefore represent an appropriate structure for this research. The next section will illustrate how quality was guaranteed in this study, according to social constructivist standards, with a special focus on integrity and reliability. The process of guaranteeing quality points out to rigorous methods as the most essential factor on which the integrity of research depends. The utilization of rigorous techniques refers to the use of methodical data collection during fieldwork, and concludes with methodical analysis strategies of the information gathered. The conclusion includes integrity in data analysis in terms of producing and examining alternative justifications of the fact studied.
Reporting Findings and Implications for Practice
NVivo is an effective software application used for the analysis of qualitative data obtained from interviews with participants. Data sources in NVivo are identified as research or project based materials, as they simply refer to flexible research settings and typed memos capturing the thoughts of the researcher on the explored problem of adopting a service oriented framework in higher education. It is important to categorize the sources used for the study, such as internals, externals and memos. The possibility for teamwork are limitless in case the data will be analyzed with NVivo software. A distinct approach to be utilized in this case is to assign unique user profiles especially upon the initial launch of the respective software. Team members are expected to work on various data sources, with the idea to bring unique perspectives when applied to the use of the same sources of information. The contributions of all contributors are important to explore in order to expand the insights into the research process. Another significant part of the data analysis process on NVivo refers to importing relevant documents. The creation of a new project in NVivo requires that the files should be imported in order to be properly analyzed. There are specific requirements to follow in the process of importing documents, and it should be considered in order to conduct all stages of the data analysis accordingly. Moreover, the data analysis process is related to working with nodes and coding of essential data. In fact, nodes represent specific codes assigned to themes and ideas about the data that will be included in the present project. The other important step is to create node hierarchies, as the common idea is to move from general to more specific topics related to the service oriented framework that will be implemented in higher education.
Term Paper Help
- US history of previous century essay themes
- Mistakes in MLA format research project
- Twenty topics for motivation mitdterm writing
- Abstract of a research essay on history
- Composing an essay on healthy lifestyle
- Topics on juvenile delinquency
- Choosing a writing agency
- Advice on Guide dogs for blind people writing
- Instructions on research paper writing
- Exploratory essay: how to write it?
- Tips to prevent you from failure
- Sample research proposal
- Prompts for Philippine Government essay
- Reference page of APA format writing
- Sample literature paper
- Where to get sample abstracts
- Writing a paper on education
Need help with research paper or essay? https://mypaperwriter.com/buy-college-papers-online.htm - expert paper writers for hire.
Are you searching for college homework help ? Get it fro professionals!
Hire a professional writer and get the thesis help you need.
Our Team Work
Writing a term paper is an significant task for any academic student. In order to make it a success, you'll need to do a lot of research how to write it in a proper way. Here are many great tutorials gathered for you by a professional writing team. What you need to do is to read and try to use them systematically in order to succeed.
If You Know How
Some students walked a long way full of ups and downs when writing their term paper. The benefit of this type of experience is that you know what you did right and what you'd better change in your research writing process. If you want to share your secrets of success be sure to get in touch with us.
For any questions, suggestions or complains:
Chestnut Hill, NA 05435, United States
© 2012-2023 TheDailyGreenwich.com. All rights reserved. Effective advice for making a great research, term & midterm paper.
- FlashLine Login
- Phone Directory
- Maps & Directions
- Advisory Committee for Academic Assessment (ACAA)
- University Accreditation and Accountability Committee (UAAC)
- Institutional Accreditation
- Specialized Accreditation
- State Authorization
- APR Schedule and Calendar
- APR Overview and Events
- Self-Study Info and Support
- Site Visit Info and Support
- Action Plan Info and Support
- Follow-Up Report Info and Support
- Assessment Committees
- General Education Assessment
- University-Wide Survey Reports
- Assessment Resources
- Annual Assessment Report
- Assessment Platform Guide
- Customized Assessment Report Request
- Assessment Support Group
- Assessment Highlights
- AAL Quick Forms
- Student Achievement Data
- Assessment Report Sample
To support annual academic program assessment reporting process, our office has provided an example for reference.
- Assessment Report - Word Template
- Taskstream AMS FAQ
- [email protected]
- Kent State Kent Campus - facebook
- Kent State Kent Campus - twitter
- Kent State Kent Campus - youtube
- Kent State Kent Campus - instagram
- Kent State Kent Campus - linkedin
- Kent State Kent Campus - snapchat
- Kent State Kent Campus - pinterest
- Annual Security Reports
- Emergency Information
- For Our Alumni
- For the Media
- Health Services
- Jobs & Employment
- Privacy Statement
- HEERF CARES/CRRSAA/ARP Act Reporting and Disclosure
- Website Feedback
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
A multi-taxon analysis of European Red Lists reveals major threats to biodiversity
Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliations Musée National d’Histoire Naturelle, Luxembourg, Luxembourg, Department of Biogeography, Trier University, Trier, Germany, IUCN SSC Invertebrate Conservation Committee, Trier, Germany, IUCN SSC Steering Committee, Caracas, Venezuela, IUCN SSC Grasshopper Specialist Group, Trier, Germany
Roles Conceptualization, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing
Affiliations Institute of Landscape Architecture and Environmental Planning, Technische Universität Berlin, Berlin, Germany, IUCN SSC Freshwater Plant Specialist Group, Stroud, United Kingdom, IUCN European Regional Office, Brussels, Belgium
Roles Conceptualization, Funding acquisition, Writing – original draft, Writing – review & editing
Affiliations IUCN European Regional Office, Brussels, Belgium, UFZ—Helmholtz Centre for Environmental Research, Department of Conservation Biology, Leipzig, Germany
Roles Formal analysis, Methodology, Visualization, Writing – review & editing
Affiliation Department of Biogeography, Trier University, Trier, Germany
Roles Conceptualization, Data curation, Funding acquisition, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing
Affiliation IUCN, Biodiversity Assessment and Knowledge Team, Cambridge, United Kingdom
Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Writing – original draft, Writing – review & editing
Affiliations IUCN European Regional Office, Brussels, Belgium, IUCN, Species Conservation Action Team, Gland, Switzerland
Roles Conceptualization, Investigation, Writing – original draft, Writing – review & editing
Affiliations Global Mammal Assessment program, Department of Biology and Biotechnologies, Sapienza University of Rome; Rome, Italy, Global Wildlife Conservation Center, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States of America
Roles Investigation, Writing – review & editing
Roles Data curation, Methodology, Writing – review & editing
Affiliations IUCN SSC Invertebrate Conservation Committee, Trier, Germany, IUCN SSC Mollusc Specialist Group, Devon, United Kingdom
Roles Funding acquisition, Investigation, Project administration, Writing – review & editing
Affiliations IUCN SSC Steering Committee, Caracas, Venezuela, Fondation Franklinia, Genève, Switzerland, IUCN SSC Plant Conservation Committee, Pretoria, South Africa
Affiliation IUCN Specialist Adviser on European Saproxylic Beetles, Truro, United Kingdom
Affiliation Botanic Gardens Conservation International, Richmond, United Kingdom
Affiliations Funchal Natural History Museum, Funchal, Portugal, MARE-Marine and Environmental Sciences Centre, Lisboa, Portugal
Affiliation IUCN SSC Grasshopper Specialist Group, Trier, Germany
Affiliations BirdLife International, Cambridge, United Kingdom, IUCN SSC Red List Authority for Birds, Cambridge, United Kingdom
Affiliations IUCN SSC Grasshopper Specialist Group, Trier, Germany, Fondazione Museo Civico di Rovereto, Sezione Zoologia, Rovereto, Italy
Roles Investigation, Project administration, Writing – review & editing
Affiliations IUCN European Regional Office, Brussels, Belgium, Rewilding Portugal, Guarda, Portugal
Affiliation IUCN Marine Biodiversity Unit, Biological Sciences, Norfolk, VA, United States of America
Affiliation National Museum of Marine Biology, Checheng, Taiwan
Affiliations IUCN SSC Grasshopper Specialist Group, Trier, Germany, Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, Sofia, Bulgaria
Affiliation Plant Gateway Herbarium, Kingston upon Thames, United Kingdom
Affiliation IUCN Tuna and Billfish Specialist Group, National Museum of Natural History, Washington, DC, United States of America
Affiliations ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, Australia, Water Resources Research Center, University of Hawai’i, Honolulu, HI, United States of America
Affiliation IUCN-Conservation International Biodiversity Assessment Unit, Washington, DC, United States of America
Affiliation National Oceanic and Atmospheric Administration, National Marine Fisheries Service, Southwest Fisheries Science Center, La Jolla, CA, United States of America
Affiliation IUCN Red List Unit, IUCN Global Species Programme, Cambridge, United Kingdom
Affiliation Earth to Ocean Research Group, Department of Biological Sciences, Simon Fraser University, Burnaby, Canada
Affiliations IUCN European Regional Office, Brussels, Belgium, Joint Nature Conservation Committee, Peterborough, United Kingdom
Affiliation IUCN SSC Orchid Specialist Group, Royal Botanic Gardens; Richmond, United Kingdom
Affiliations IUCN European Regional Office, Brussels, Belgium, Scott Cawley, Dublin, Ireland
Affiliation Museum für Naturkunde, Leibniz Institute for Evolution and Biodiversity Science, Berlin, Germany
Affiliation Oceana, Madrid, Spain
Affiliations IUCN European Regional Office, Brussels, Belgium, School of Geosciences, The University of Edinburgh, Edinburgh, United Kingdom
Affiliation European Committee for the Conservation of Bryophytes, Portree, United Kingdom
Affiliation BirdLife Cyprus, Nikosia, Cyprus
Affiliation Naturalis Biodiversity Center, Leiden, The Netherlands
Affiliation The University of Birmingham, School of Biosciences, Birmingham, United Kingdom
Affiliation IUCN European Regional Office, Brussels, Belgium
Affiliation IUCN SSC Cave Invertebrate Specialist Group, Cambridge, United Kingdom
Affiliation IUCN SSC Freshwater Plant Specialist Group, Stroud, United Kingdom
Affiliations IUCN Red List Unit, IUCN Global Species Programme, Cambridge, United Kingdom, Bren School of Environmental Science & Management, University of California, Santa Barbara, Santa Barbara, CA, United States of America
Affiliation IUCN SSC Medicinal Plant Specialist Group, Ottawa, Canada
Affiliations The University of Birmingham, School of Biosciences, Birmingham, United Kingdom, IUCN SSC Crop Wild Relative Specialist Group, Birmingham, United Kingdom
Affiliation Natural History Museum Bern, Bern, Switzerland
Affiliations IUCN SSC Grasshopper Specialist Group, Trier, Germany, FLORON Plant Conservation Netherlands, Nijmegen, Netherlands
Roles Conceptualization, Writing – review & editing
Affiliation Department of Ichthyology, Australian Museum, Sydney, Australia
Affiliation Species Recovery Program, Seattle Aquarium, Seattle, WA, United States of America
Affiliation BirdLife International, Cambridge, United Kingdom
Affiliation Departamento de Zoología, Facultad de Biología, Universidad de Murcia; Murcia, España
Affiliations Botanic Gardens Conservation International, Richmond, United Kingdom, IUCN SSC Global Tree Specialist Group, Richmond, United Kingdom
Affiliation Department of Agroecology, Université Libre de Bruxelles (ULB), Brussels, Belgium
Affiliation IUCN Snapper, Seabream and Grunt Specialist Group, Museum and Art Gallery of the Northern Territory, Darwin, Australia
Affiliation Botanical Museum, Finnish Museum of Natural History, University of Helsinki, Helsinki, Finland
Affiliation Office National des Forêts, Laboratoire National d’Entomologie Forestière, Quillan, France
Affiliations Natural History Museum, Department of Life Sciences, London, United Kingdom, Coleopterological Research Center, Institute of Life Sciences and Technology, Daugavpils University, Daugavpils, Latvia, Institute of Biology, University of Latvia, Rīga, Latvia
Affiliation The Biodiversity Consultancy, Cambridge, United Kingdom
Affiliation Zoological Society of London, London, United Kingdom
Affiliations IUCN SSC Medicinal Plant Specialist Group, Ottawa, Canada, TRAFFIC, Cambridge, United Kingdom
Affiliation Vlinderstichting (Dutch Butterfly Conservation), Wageningen, Netherlands
Affiliation Reef Environmental Education Foundation, Key Largo, FL, United States of America
Affiliations IUCN SSC Grasshopper Specialist Group, Trier, Germany, Naturalis Biodiversity Center, Leiden, The Netherlands
Affiliation Department of Plant Sciences, University of Cambridge, Cambridge, United Kingdom
Affiliation Joint Nature Conservation Committee, Peterborough, United Kingdom
- [ ... ],
Affiliations IUCN SSC Grasshopper Specialist Group, Trier, Germany, Ingenieurbüro für Landschaftsplanung und Landschaftspflege, Vienna, Austria
- [ view all ]
- [ view less ]
- Axel Hochkirch,
- Melanie Bilz,
- Catarina C. Ferreira,
- Anja Danielczak,
- David Allen,
- Ana Nieto,
- Carlo Rondinini,
- Kate Harding,
- Craig Hilton-Taylor,
- Published: November 8, 2023
- Reader Comments
Biodiversity loss is a major global challenge and minimizing extinction rates is the goal of several multilateral environmental agreements. Policy decisions require comprehensive, spatially explicit information on species’ distributions and threats. We present an analysis of the conservation status of 14,669 European terrestrial, freshwater and marine species (ca. 10% of the continental fauna and flora), including all vertebrates and selected groups of invertebrates and plants. Our results reveal that 19% of European species are threatened with extinction, with higher extinction risks for plants (27%) and invertebrates (24%) compared to vertebrates (18%). These numbers exceed recent IPBES (Intergovernmental Platform on Biodiversity and Ecosystem Services) assumptions of extinction risk. Changes in agricultural practices and associated habitat loss, overharvesting, pollution and development are major threats to biodiversity. Maintaining and restoring sustainable land and water use practices is crucial to minimize future biodiversity declines.
Citation: Hochkirch A, Bilz M, Ferreira CC, Danielczak A, Allen D, Nieto A, et al. (2023) A multi-taxon analysis of European Red Lists reveals major threats to biodiversity. PLoS ONE 18(11): e0293083. https://doi.org/10.1371/journal.pone.0293083
Editor: Neelesh Dahanukar, Shiv Nadar University, INDIA
Received: January 16, 2023; Accepted: October 4, 2023; Published: November 8, 2023
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: All original data is available in the IUCN Red List database ( iucnredlist.org )
Funding: The European Commission (EC) has funded all European Red List projects. Co-funders of some of the assessments were National Parks and Wildlife Service, Republic of Ireland; Ministry of Economic Affairs, Department of Nature & Biodiversity (Ministerie van Economische Zaken, Directie Natuur & Biodiversiteit), the Netherlands; Council of Europe; Office fédéral de l’environnement, Switzerland; Swedish Environmental Protection Agency (Naturvardsverket), Sweden; British Entomological Society, United Kingdom; Ministry of Sustainable Development and Infrastructure, Government of the Grand-Duché of Luxembourg; Ministry of the Environment of the Czech Republic; and ArtDatabanken from the Swedish University of Agricultural Sciences. The funders had no role in data collection and analysis, decision to publish or preparation of the manuscript, but the funding decisions determined the taxa that have been assessed.
Competing interests: The authors have declared that no competing interests exist.
Biodiversity is declining globally at an unprecedented rate [ 1 – 3 ], with around 1 million animal, fungal and plant species potentially at risk of extinction within the next few decades [ 4 ]. Several international policies have been designed to tackle this crisis, namely by defining specific biodiversity recovery goals and targets (e.g., the United Nations Sustainable Development Goals (SDG 14, 15), the Convention on Biological Diversity (CBD) Aichi Targets and Kunming-Montreal Global Biodiversity Framework Targets) that have been transposed into national or regional policy by countries worldwide. To document progress towards these targets spatially explicit information on the distribution of species, their ecological requirements and major threats is needed [ 5 , 6 ]. Red List assessments that compile the best available evidence on species’ extinction risk are pivotal to measure progress towards international biodiversity conservation objectives by underpinning suitable biodiversity indicators [ 7 ]. The IUCN Red List of Threatened Species TM (hereafter, the IUCN Red List) is widely recognized as the most comprehensive and objective approach for evaluating the conservation status of species, and is considered a global ‘barometer of life’ [ 8 ]. More than 142,000 species have been assessed for the IUCN Red List thus far, but at the global scale there are strong taxonomic biases [ 6 ].
In Europe, taxonomic coverage of the IUCN Red List is more extensive than in other parts of the world, as the European Commission has funded European Red List assessments of thousands of species from a wide variety of taxonomic groups since 2006. These include all vertebrates (amphibians, birds, fishes, mammals and reptiles), functionally important invertebrate groups (all bees, butterflies, dragonflies, grasshoppers, crickets, bush-crickets, freshwater and terrestrial molluscs, and a selection of saproxylic beetles) and about 12% of the known plant species in Europe (including all ferns and lycopods, orchids, trees, aquatic plants and bryophytes, as well as selected shrubs, medicinal plants, priority crop wild relatives, and plants listed in policy instruments). This Herculean effort provides a wealth of information on the conservation status of 14,669 species, including spatial information on an exceptionally broad range of species that is derived using a standardized methodology and includes taxa that are usually underrepresented in conservation [ 6 ]. The assessed taxa have not been chosen to ensure representativeness but based upon funders’ priorities. However, they are by far more diverse than any dataset used for global analyses so far, such as the Living Planet Index [ 9 ]. These data will help to guide and monitor progress in achieving the targets of the EU Biodiversity Strategy for 2030 [ 10 ], i.e., to ensure that Europe’s biodiversity is on the path to recovery by 2030. Here, we synthesize the findings of all European Red List species assessments published up to the end of 2020 to analyze major biodiversity distribution patterns and threats to biodiversity in Europe. This analysis also provides a baseline against which to measure progress towards biodiversity targets to be achieved in the coming decade.
In Europe, approximately one-fifth (19.4%, 2,839 species) of the 14,669 species assessed are threatened with extinction ( Fig 1 ) with 50 species being Extinct, Regionally Extinct or Extinct in the Wild (EX, RE, EW) and a further 75 tagged as Possibly Extinct. The percentage of threatened species (those classified as Critically Endangered (CR), Endangered (EN) or Vulnerable (VU)) was higher among plants (27%) and invertebrates (24%) than among vertebrates (18%). This pattern is noteworthy considering that vertebrates receive substantially more conservation attention and that the latest IPBES (Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services) global assessment on biodiversity and ecosystem services used a conservative “tentative estimate” that 10% of all insects are threatened with extinction, while noting that “ the prevalence of extinction risk in high-diversity insect groups is a key unknown ” [ 4 ]. Using our value of 24% threatened invertebrates, would roughly double the IPBES extrapolation (1.97 ± 0.23 million species threatened rather than 1 million). It is worth noting that IPBES also used the European Red Lists for bees, butterflies and saproxylic beetles to estimate the global extinction risk of insects. While the extrapolation of European data to a global estimate involves several uncertainties, evidence from some comprehensively assessed species groups suggests that global extinction risk does not deviate strongly from the European status (e.g. Odonata: European Red List [ 11 ]: 15.7% threatened, Global Red List [ 12 ]: 16.1% threatened; Birds: European Red List [ 13 ]: 13.2%, Global Red List [ 11 ]: 12.6%). Our higher assumption of the number of threatened insect species is mainly explained by the inclusion of recent European Red Lists compared to the IPBES assessment, and partly by the high number of Data Deficient (DD) species among insects ( S2 Fig ). Indeed, the number of DD species is quite high even in Europe (18%), despite this being a very well-studied region. Data deficiency is notably higher among invertebrates (24%) than plants (11%) or vertebrates (10%). Further, for nearly half of all species (49%) and for 60% of invertebrates, the population trend was classified as ‘unknown’ by the Red List assessors, which is in line with global estimates and illustrates a general lack of data on population status and demographics and confirms the need for biodiversity monitoring programs [ 6 ].
- PPT PowerPoint slide
- PNG larger image
- TIFF original image
Abbreviations: EX: Extinct, EW: Extinct in the Wild, RE: Regionally Extinct, CR: Critically Endangered, EN: Endangered, VU: Vulnerable, DD: Data Deficient, NT: Near Threatened, LC: Least Concern.
Nearly half (47%, n = 6,926 of the 14,669) of Europe’s assessed species are endemic, including 2,125 threatened species. Most (86%, n = 1,171) threatened invertebrates are endemic to Europe. Across all taxa, only half (54%) of the threatened species have been documented in protected areas, a percentage lower than among Near Threatened (NT) or Least Concern (LC) species (61%), raising concerns about the suitability of the European protected area network as a means to protect all threatened species [ 14 , 15 ] and emphasizing the need to expand and improve it. Our spatial analysis of terrestrial species diversity in Europe ( Fig 2 ) further emphasizes the importance of mountain systems for biodiversity persistence in Europe. Mountains support a high number of endemic species and are also less transformed by humans than lowland plains and coasts. The highest species numbers by area were recorded in the southern Alps, the eastern Pyrenees and the Pirin Mountains in Bulgaria ( Fig 2 ), while threatened biodiversity peaks in the Alps and the Balkans ( S5 Fig ).
Spatial distribution of terrestrial and freshwater species richness in Europe based on an analysis of all European IUCN Red List assessments.
Our analyses confirm that multiple threats impact biodiversity, with agricultural land-use change (including tree plantations) being the most important threat to European species, followed by biological resource use (overexploitation), residential and commercial development, and pollution ( Fig 3 ). The strong impact of agricultural land-use is more prominent in invertebrates and plants, whereas vertebrates (particularly fishes) are more often threatened by overexploitation as they may be directly hunted, caught and fished (also by incidental catch) resulting in extensive threat to marine fishes and other marine vertebrates. Residential and commercial development is an important cause of habitat loss and degradation affecting many invertebrate and plant species, whereas pollution is particularly threatening to freshwater species, such as fishes, molluscs and dragonflies. Climate change is also an important threat to many species and has been classified as the most important emerging future threat ( S3 Fig ). This is corroborated by the increasing number of droughts in Europe, which accelerate the risk of wildfires [ 16 ], aggravated by an increased off-take of water for agriculture and domestic supplies.
For all species, vertebrates, invertebrates and plants (CR: Critically Endangered, EN: Endangered, VU: Vulnerable, DD: Data Deficient, NT: Near Threatened, LC: Least Concern; N: All species = 14,669, Vertebrates = 2,494, Invertebrates = 7,600, Plants = 4,575).
The finding of agricultural land-use change as a major threat to biodiversity has often been reported [e.g. 17 , 18 ]. However, our analysis is the most comprehensive and unequivocal to date reaffirming the magnitude of the impact of this threat at a continental scale. Many European species require or are adapted to traditional agricultural land-use but cannot cope with the magnitude of this change. Changes in agriculture are manifold and include conversion of natural habitats into farmland (partly as a consequence of detrimental subsidies under the EU Common Agricultural Policy (CAP)), changing agricultural and forestry practices (particularly intensification and homogenization of land-use with larger plots, larger and heavier machines, use of fertilizers and pesticides, decreasing crop diversity, higher livestock densities, earlier and more frequent mowing, drainage, irrigation, plowing, rolling, abandonment of historical management techniques, etc.), but also land abandonment coupled with rural exodus [ 19 ]. In Europe, habitat conversion into arable land mainly occurred in the past, while during the last decades abandonment has become more common. Intensification in the use of agricultural land had started already in the 19 th century in northwestern Europe with the replacement of traditional pastoral farming (mainly of sheep) by settled agriculture with cattle farming [ 20 ]. While pastoral systems are still abundant in the Mediterranean, they are also in decline due to the EU CAP funding systems [ 21 ]. While improvements to the CAP have constantly been proposed [ 22 ], the recent policy reform remained rather unambitious in this regard despite the promising wind of change brought by the European Green Deal. Most importantly, direct payments under the CAP have favored larger farms, while smallholder farming is in decline, leading to the abandonment of marginal lands, which are often particularly species-rich and reliant on extensive agricultural land-use [ 23 ]. While agricultural intensification is sometimes proposed as a means to increase the amount of natural habitats (“land sparing”), many threatened species in Europe are adapted to grassland habitats, which can only be retained by livestock grazing or mowing. Maintaining such habitat types will be challenging as traditional agricultural management is often not profitable anymore. Abandonment of traditional land use is also a threat to some forest species, which may depend on historical management such as coppicing or forest pastures.
Moreover, our analysis highlights some major knowledge gaps and research needs ( S4 Fig ). For a quarter of invertebrate species, the evidence available was not sufficient to determine their conservation status—most notably, 57% of European bees were assessed as Data Deficient [ 24 ]. Half of all species lack population trend data, which is a key requirement for assessing species extinction risk. This also means that for many species, Red List assessments are based on habitat trend information or other proxies. Unsurprisingly, the top research priorities identified for most species by the assessors include research on distribution, population sizes and trends, threats, life history and ecology as well as taxonomy ( S4 Fig ). Monitoring of population trends is also needed for many species, particularly for threatened taxa. In this context, it is important to highlight that general biodiversity monitoring schemes are usually not suitable for monitoring the status of highly threatened taxa (as these species are too rarely recorded to enable an analysis of trends). This means that targeted monitoring programs are required for species with a high extinction risk [ 25 ]. For vertebrate species, the need for research on the effectiveness of conservation actions has been identified more often than for plants or invertebrates. This could reflect a higher number of ongoing conservation projects for vertebrates compared to other taxa, which still require basic data to improve conservation assessments or compile conservation plans. While Europe probably has the most comprehensive Red List information in terms of species groups covered compared to other continents, the status of some key groups is still unexplored, such as freshwater quality indicators (e.g. mayflies, stoneflies, caddisflies), soil biota (e.g. fungi, springtails, earthworms, mites), decomposers (e.g. dung beetles, carrion beetles), marine invertebrates (e.g. marine crustaceans and mollusks), species-rich insect groups (e.g. weevils, rove beetles, leaf beetles, ground beetles) and many plant taxa. However, European Red List assessments have just been completed for hoverflies, are currently underway for moths, and a substantial portion of the taxa analyzed here are undergoing a reassessment which will lead to the development of Red List indices. Hence, the taxonomic and temporal coverage of the European Red Lists is constantly being increased.
Red Lists provide a valuable baseline for measuring progress towards biodiversity targets. Due to their wide taxonomic scope, the European Red Lists have revealed high extinction risks for some taxa, such as freshwater molluscs (59% threatened, [ 26 ]), trees (42%, [ 27 ]), freshwater fishes (40%, [ 28 ]) and Orthoptera (29%, [ 29 ]). As biodiversity recovery targets have become more refined under the Kunming-Montréal Global Biodiversity Framework, it will be important to continue to take snapshots of the biodiversity status not only in Europe, but at a global scale. To that end, metrics derived from the Red Lists, such as the Red List Index, have been adopted as indicators to track progress on meeting international conservation policy commitments and Sustainable Development Goals [ 7 , 30 ].
While the measurement and assessment of biodiversity trends is crucial to guide policy, it is even more important to implement necessary conservation action in a timely manner. We already have enough evidence at hand to act—what we are missing is action. This requires collaboration among multiple stakeholders to abate the major threats identified [ 31 ]. Indeed, conservation NGOs, conservation authorities, species experts and citizens in Europe have started numerous projects, focusing on highly threatened species, and even including threatened invertebrates, as a consequence of Red List publications [ 32 – 35 ]. Funding mechanisms for implementing conservation action exist at the European level (e.g. EU LIFE program) as well as on an international, national or even local scale. Member States now need to increase their capacity to conduct or support conservation projects and create optimal structures to plan and implement conservation action. Furthermore, biodiversity conservation needs to be better integrated or mainstreamed within other policies, so that the impact of major threats (such as agriculture, overfishing, forestry, pollution, urban and rural development) is mitigated. So far, financial investment in activities detrimental to biodiversity far outstrips biodiversity-friendly investments [ 36 , 37 ]. Biodiversity is the foundation underpinning food security, human well-being and wealth generation and securing a future for European life requires greener agriculture and fishing policies and a rapid phasing out of incentives detrimental to biodiversity in agriculture, forestry, fisheries and energy production are needed.
Materials and methods
All European Red Lists published to date can be found at http://ec.europa.eu/environment/nature/conservation/species/redlist/ .
The following Red Lists were considered for the analyses: European Red List of amphibians [ 38 ], European Red List of birds [ 13 ], European Red List of freshwater fishes [ 28 ], European Red List of marine fishes [ 39 ], European Red List of mammals [ 40 ], European Red List of reptiles [ 41 ], European Red List of bees [ 24 ], European Red List of saproxylic beetles [ 42 , 43 ], European Red List of butterflies [ 44 ], European Red List of dragonflies [ 11 ], European Red List of non-marine molluscs [ 26 ], European Red List of terrestrial molluscs [ 45 ], European Red List of grasshoppers, crickets and bush-crickets [ 29 ], European Red List of vascular plants [ 46 ], European Red List of medicinal plants [ 47 ], European Red List of trees [ 27 ], European Red List of lycopods and ferns [ 48 ], European Red List of mosses, liverworts and hornworts [ 49 ].
The European Red List operates at the geographical scope of Europe extending to the Urals in the east, and from Franz Josef Land in the north to the Mediterranean in the south ( S1 Fig ). The Canary Islands, Madeira and the Azores are also included. In the southeast, the Caucasus region is not included in most assessments, except for the bird assessments, which also cover Turkey, the Caucasus region, and Greenland [ 13 ]. For the boundaries of marine assessments see S1 Fig . The European Red Lists were compiled using the IUCN Red List Categories and Criteria at regional level [ 50 ]. All species were assessed against the IUCN Red List Criteria to assess their extinction risk and categorized into nine categories [ 51 ] at the regional scale: Data Deficient (DD), Least Concern (LC), Near Threatened (NT), Vulnerable (VU), Endangered (EN), Critically Endangered (CR), Regionally Extinct (RE), Extinct in the Wild (EW), Extinct (EX). These categories are defined in the IUCN guidelines for application of IUCN Red List criteria at regional and national levels [ 50 ]. The terms RE and EW are sometimes referred to as “regionally extirpated” or “extirpated in the wild”, but we follow the IUCN definition here, which is widely used in the scientific literature. Species classified as CR, EN, or VU are considered threatened with extinction. Each assessment is supported, where available, by information on distribution (including a range map), population, ecology, threats, as well as necessary or existing conservation action and research. This information is provided as free text, but also collected in standardized classification schemes (following the standard system provided by [ 52 ]), which were analyzed here to obtain European distribution, threat and research information across taxa. Species presence in protected areas was also recorded (as presence in protected areas yes/no).
All analyses (Red List categories and totals by classification field) were carried out for the set of all species as well as for vertebrates, invertebrates and plants separately. To account for changes in the assessments since 2006, an updated dataset was created from the IUCN Red List version 2019–2. The percentage of threatened species was calculated as the “best estimate” as recommended by IUCN [ 53 ]: EW + CR + EN + VU / (total assessed—EX—DD). This method assumes that a similar relative percentage of the Data Deficient (DD) species are likely to be threatened. All following analyses considered only species extant in the wild (i.e. excluding species categorized as EX, EW and RE). The ongoing and future threats recorded for extant species were analyzed based upon the classification schemes in the IUCN Red List. The highest threat level category was used [ 52 ], except for category 7 ‘natural system modifications’, where the second level was analyzed (i.e. ‘Fire & fire suppression’, ‘Dams & water management/use’ and ‘Other ecosystem modifications’).
For each species, assessors were asked to produce the most accurate depiction of a taxon’s current and historical distribution based on their knowledge and the available data. Data sources informing the production of range maps have changed over the various European Red Lists as a result of the increasing availability of digitized georeferenced locality record data (e.g. Global Biodiversity Information Facility (GBIF), frequently viewed through the Geospatial Conservation Assessment Tool (GeoCAT) which was launched in 2011 [ 54 ]. The general approach has been for assessors to compile and review available locality records for a taxon, and then produce polygons that encompass the known (locality records) and inferred (based on ecological requirements of the taxon) range of the taxon. Freshwater taxa (fishes, molluscs, Odonata, aquatic plants) were mapped to river sub-catchments (HydroBASINS or earlier iterations). All distribution maps were produced as polygon GIS shapefiles in WGS 1984 (World Geodetic Survey 1984 projection; see [ 55 ] for metadata requirements). For detailed mapping methodology, see the individual European Red List reports. The species richness maps presented in this publication were analyzed using a geodesic discrete global grid system, defined on an icosahedron and projected to the sphere using the inverse Icosahedral Snyder Equal Area (ISEA) Projection (S39). This corresponds to a hexagonal grid composed of individual units (cells) that retain their shape and area (864 km²) throughout the globe. For the spatial analyses, only the extant (resident) and possibly extant (resident) distributions of each species were converted to the hexagonal grid; polygons coded as ‘possibly extinct’, ‘extinct’, ‘re-introduced’, ‘introduced’, ‘vagrant’ and/or ‘presence uncertain’ were not considered in the analyses. Coastal cells were clipped to the coastline. Thus, patterns of species richness were mapped by counting the number of species in each cell (or cell section, for species with a coastal distribution). Data Deficient species and species that were only mapped to country-level were excluded from the analysis. Patterns of threatened species richness (Categories CR, EN, VU) were mapped by counting the number of threatened species in each cell or cell section.
S1 fig. spatial extent of european red list assessments for most terrestrial and freshwater taxa (orange), marine mammals (light blue) and marine fishes (dark blue)..
S2 Fig. IUCN Red List Categories and number of species assessed for Europe by taxonomic group (groups marked with * have not been assessed comprehensively; black lines indicate the best estimate for the proportion of extant species considered to be threatened).
Seven mollusc species have been classed as both freshwater and terrestrial and are listed in both groups.
S3 Fig. Emerging future threats to biodiversity in Europe for all species, and for vertebrates, invertebrates and plants separately (CR: Critically Endangered, EN: Endangered, VU: Vulnerable, DD: Data Deficient, NT: Near Threatened, LC: Least Concern; N: All species = 14,669, Vertebrates = 2,494, Invertebrates = 7,600, Plants = 4,575).
S4 Fig. Major research needs in Europe as classified by the Red List assessors for all species, and for vertebrates, invertebrates and plants separately (CR: Critically Endangered, EN: Endangered, VU: Vulnerable, DD: Data Deficient, NT: Near Threatened, LC: Least Concern; N: All species = 14,669, Vertebrates = 2,494, Invertebrates = 7,600, Plants = 4,575).
S5 Fig. Number of threatened terrestrial and freshwater species across Europe (i.e. Red List categories CR, EN, VU).
The European Red List assessments have been compiled by numerous species experts, many of whom are affiliated with the IUCN Species Survival Commission and are listed as co-authors of the assessments on the IUCN Red List of Threatened Species. The views expressed in this publication do not necessarily reflect those of IUCN or those of the EC. The designation of geographical entities in this paper, and the presentation of the material, do not imply the expression of any opinion whatsoever on the part of IUCN or the EC concerning the legal status of any country, territory, or area, or of its authorities, or concerning the delimitation of its frontiers or boundaries.
- View Article
- Google Scholar
- 4. IPBES. Global assessment report on biodiversity and ecosystem services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. (IPBES secretariat, Bonn, Germany; 2019.
- 12. IUCN. The IUCN Red List of Threatened Species. https://www.iucnredlist.org/ (Accessed 25.05.2023)
- 13. BirdLife International. European Red List of Birds. Publications Office of the European Union, Luxembourg; 2015.
- 24. Nieto A, Roberts SPM, Kemp J, Rasmont P, Kuhlmann M, García Criado M, et al. European Red List of bees. Publications Office of the European Union, Luxembourg; 2014.
- 25. Potts S, Dauber J, Hochkirch A, Oteman B, Roy D, Ahnre K, et al. Proposal for an EU Pollinator Monitoring Scheme. EUR 30416 EN, Publications Office of the European Union, Luxembourg; 2020.
- 26. Cuttelod A, Seddon MB, Neubert E. European Red List of non-marine molluscs. Publications Office of the European Union, Luxembourg; 2011.
- 27. Rivers M, Beech E, Bazos I, Bogunić F, Buira A, Caković D, et al. European Red List of trees. Publications Office of the European Union, Luxembourg; 2019.
- 28. Freyhof J, Brooks E. European Red List of freshwater fishes. Publications Office of the European Union, Luxembourg; 2011.
- 29. Hochkirch A, Nieto A, García Criado M, Cálix M, Braud Y, Buzzetti FM, et al. European red list of grasshoppers, crickets and bush-crickets. Publications Office of the European Union, Luxembourg; 2016.
- 30. UN. Kunming-Montreal Global Biodiversity Framework. Conference of the Parties to the Convention on Biological Diversity CBD/COP/DEC/15/4. 2022. 15 pp.
- 33. Monasterio León Y, Ruiz Carreira C, Escobés Jiménez R, Almunia J, Wiemers M, Vujić A et al. Canarian Islands endemic pollinators of the Laurel Forest zone–Conservation plan 2023–2028. Publications Office of the European Union, Luxembourg; 2023.
- 34. Vujić A, Miličić M, Milosavljević MJ, van Steenis J, Macadam C, Raser J et al. Hoverflies specialised to veteran trees in Europe–Conservation Action Plan 2023–2030. Publications Office of the European Union, Luxembourg; 2023.
- 35. Michez D, Radchenko V, Macadam C, Wilkins V, Raser J, Hochkirch A. Teasel-plant specialised bees in Europe—Conservation action plan 2023–2030. Publications Office of the European Union, Luxembourg; 2023.
- 38. Temple HJ, Cox NA. European Red List of Amphibians. Publications Office of the European Union, Luxembourg; 2009.
- 39. Nieto A, Ralph GM, Comeros-Raynal MT, Kemp J, García Criado M, Allen DJ, et al. European Red List of marine fishes. Publications Office of the European Union, Luxembourg; 2015.
- 40. Temple HJ, Terry A. The status and distribution of European mammals. Publications Office of the European Union, Luxembourg; 2007.
- 41. Cox NA, Temple HJ. European Red List of reptiles. Publications Office of the European Union, Luxembourg; 2009.
- 42. Nieto A, Alexander K. European Red List of saproxylic beetles. Publications Office of the European Union, Luxembourg; 2010.
- 43. Cálix M, Alexander KNA, Nieto A, Dodelin B, Soldati F, Telnov D, et al. European Red List of saproxylic beetles. Publications Office of the European Union, Luxembourg; 2018.
- 44. van Swaay C, Cuttelod A, Collins S, Maes D, López Munguira M, Šašić M, et al. European Red List of Butterflies. Publications Office of the European Union, Luxembourg; 2010.
- 45. Neubert E, Seddon MB, Allen DJ, Arrébola J, Backeljau T, Balashov I, et al. European Red List of Terrestrial Molluscs. Publications Office of the European Union, Luxembourg; 2019.
- 46. Bilz M, Kell SP, Maxted N, Lansdown R. European Red List of vascular plants. Publications Office of the European Union, Luxembourg; 2011.
- 47. Allen DJ, Bilz M, Leaman DJ, Miller RM, Timoshyna A, Window J. European Red List of medicinal plants. Publications Office of the European Union, Luxembourg; 2014.
- 48. García Criado M, Väre H, Nieto A, Bento Elias R, Dyer RA, Ivanenko Y, et al. European red list of lycopods and ferns. Publications Office of the European Union, Luxembourg; 2018.
- 49. Hodgetts N, Cálix M, Englefield E, Fettes N, García Criado M, Patin L, et al. A miniature world in decline: European Red List of Mosses, Liverworts and Hornworts. Publications Office of the European Union, Luxembourg; 2019.
- 50. IUCN. Guidelines for Application of IUCN Red List Criteria at Regional and National Levels, Version 4.0. IUCN, Gland; Switzerland and Cambridge, UK; 2012.
- 51. IUCN. IUCN Red List Categories and Criteria: Version 3.1. Second edition. IUCN, Gland, Switzerland and Cambridge, UK; 2012.
- 53. IUCN. Summary Statistics. https://www.iucnredlist.org/resources/summary-statistics (Accessed 20.02.2022).
- 55. IUCN. Mapping Standards and Data Quality for the IUCN Red List Spatial Data. Version 1.19 (May 2021). IUCN SSC Red List Technical Working Group, Gland; Switzerland and Cambridge; 2021.