• Higher Ed Trends

Performance-Based Assessment: How to Implement It in the Classroom

Performance assessment is an increasingly common assessment method that offers significant advantages over traditional high-stakes testing

assessment task based performance

Top Hat Staff

Performance-Based Assessment: How to Implement It in the Classroom

Performance assessment is an increasingly common assessment method that offers significant advantages over traditional high-stakes testing.

What is a performance assessment?

Performance assessment is a summative assessment tool that is used as a substitute for high-stakes testing. It’s intended to focus more on practical or applied skills—more “do you know how to use your knowledge?” versus “tell me what you know.” Other common terms include “authentic assessment” or “performance-based assessment.”

So, what is a performance-based assessment? It can be an individual or group project, a portfolio (with potentially one or more pieces foregrounded) or an open-ended response exercise. The creation process of the work is then graded according to a set of pre-agreed criteria or a checklist, shared with the student in advance.

This is the “performance” part of the “performance assessment”—and this accountability for the process is what sets it apart from grading a regular assignment.

Performance assessment: Why now, and why in higher education?

Standardized testing is becoming increasingly outdated in K–12 contexts, according to a report published jointly by the Massachusetts Consortium for Innovative Education Assessment and the Center for Collaborative Education. This kind of traditional testing exacerbates socioeconomic differences while failing to properly assess skills pre-higher education. A new Quality Performance Assessment scheme is underway to “engage students in ways that standardized tests cannot, giving students more say in how they demonstrate their knowledge in culturally responsive ways.” 1

If this kind of shift in performance-based assessment is truly underway for the freshmen of the future, performance assessment is worth considering sooner rather than later.

Why use performance assessment?

Here are some benefits of performance assessment over standardized testing:

1. Performance assessment looks at higher-order thinking skills and problem-solving abilities. Other features like time management and clear communication are also tested in these kinds of assessments. This ultimately leads to a deeper and more meaningful learning process.

2. High-stakes standardized testing evaluates whether students know enough about a subject. Performance assessments, on the other hand, measure whether students can apply the knowledge appropriately in various contexts.

3. If interim goals are created and applied correctly, performance assessments allow students to monitor themselves. This type of metacognition, particularly in a test environment, is enormously beneficial to higher-level student learning.

4. Any instructors who use performance assessments need to include the standards they expect and the steps that they must take in applying the knowledge in the curriculum. This makes “teaching to the test” a positive teaching and learning strategy.

5. Performance assessments go hand-in-hand with modern teaching strategies like active learning and critical thinking. If a student undertakes collaboration and discussion in a classroom context (and in formative assessment), those learned skills will be more easily applied and evaluated in summative assessments, and eventually reflected in students’ performance. 2

How do performance assessments work?

The educator sets a task for which there is more than one route to completion or a complex problem to tackle with considerable leeway for interpretation.

Students must reach an answer—but the answer is not the most important part. Rather, the journey is the destination. Students must demonstrate competencies in production, communication and applying their content knowledge.

The most effective way of measuring this is by assigning a list of performance tasks, along with an achievement level for each. This list should be reasonably comprehensive and scoring for each task should take place on a scale.

These tasks can reflect industry best practices. A performance assessment example for a computer science class could be answering “Did the candidate effectively document their code?” That task could be measured on a grade of “not achieved,” “partly achieved” or “fully achieved.” A performance assessment example for an art class could be “Did the student correctly gather requirements to complete the project?” Final scores can then be calculated from this list.

The key to performance assessment is that students develop how they approach their tasks while understanding the standards to which they will be evaluated.

What you need to make performance assessment successful

1. templates and scoring rubrics.

For students, performance assessments are a balancing act between the open-ended nature of the project, and the competencies and mastery they need to demonstrate to meet learning objectives. You can either share the full guidelines of how the project will be graded with your students. Instructors can also build templates for intermittent assessments that explain what must happen at each stage and when—for instance, an abstract, a first draft, and one for the final presentation.

2. Examples/benchmarks: Good (or bad)

Students who have set open-ended tasks for summative assessments will find previous examples crucial to success. These examples could be ‘ideal’ versions of work for them to follow. However, they could also be flawed or low-quality work that can be used as part of a teaching activity. For example, students can then try evaluating and discussing in class what they would improve, why and how in order to arrive at the correct answer.

3. Help your students prepare and practice

Although many of your students will have participated in performance assessments in the past, there will be others to whom the concept is completely new. Setting milestones, in the form of mini-performance assessments, in preparation for the final tally will help them get used to thinking in a new way. This may help reduce anxiety that might affect their overall performance.

4. Leverage your community

It’s rare that a performance assessment would just touch on a single course—they are almost always interdisciplinary. Rather than producing a performance assessment and the communications you need with your students alone—get assistance from fellow instructors in your field. This can also be a form of professional development for instructors. After all, if performance assessment is meant to measure real-world application of knowledge rather than producing another version of your lessons, your tasks should reflect real-world situations. And reality is seldom based on a single subject area. 3

  • Famularo, J., French, D., Noonan, J., Schneider, J., Sienkiewicz, E. (2018) Beyond Standardized Tests: A New Vision for Assessing Student Learning and School Quality. [White paper]. Retrieved from http://cce.org/files/MCIEA-White-Paper_Beyond-Standardized-Tests.pdf
  • >Hibbard, K.M., et al. (1996) A Teacher’s Guide to Performance-Based Learning and Assessment. Alexandria, Virg.: Association for Supervision & Curriculum Development.
  • Performance Assessment. [White paper]. Retrieved from https://www.learner.org/workshops/socialstudies/pdf/session7/7.PerformanceAssessment.pdf

Recommended Readings

assessment task based performance

Infographic: How You Used Top Hat In 2023

assessment task based performance

3 Reasons This Academic Year Was So Impactful for Your Students

Subscribe to the top hat blog.

Join more than 10,000 educators. Get articles with higher ed trends, teaching tips and expert advice delivered straight to your inbox.

What Is Performance Assessment?

  • Share article

Project-based learning is nothing new. More than 100 years ago, progressive educator William Heard Kilpatrick published “The Project Method,” a monograph that took the first stab at defining alternatives to direct instruction. Predictably, the document sparked a squabble over definitions and methods—between Kilpatrick and his friend and colleague John Dewey.

Not much has changed. Today, despite major advances in ways to measure learning, we still don’t have common definitions for project-based learning or performance assessment.

Sometimes, for example, performance assessment is framed as the opposite of the dreaded year-end, state-required multiple-choice tests used to report on schools’ progress. But in fact, many performance assessments are standardized and can—and do—produce valid and reliable results.

Experts also emphasize the “authentic” nature of performance assessment and project-based learning, although “authentic” doesn’t always mean lifelike: A good performance assessment can use simulations, as long as they are faithful to real-world situations. (An example: In science class, technology can simulate plant growth or land erosion, processes that take too long for a hands-on experiment.)

In the absence of agreed-upon definitions for this evolving field, Education Week reporters developed a glossary based on interviews with teachers, assessment experts, and policy analysts. They’ve organized the terms here generally from less specific to more specific. These terms aren’t mutually exclusive. (A performance assessment, for instance, may be one element of a competency-based education program.)

Proficiency-based or competency-based learning: These terms are interchangeable. They refer to the practice of allowing students to progress in their learning as they master a set of standards or competencies. Students can advance at different rates. Typically, there is an attempt to build students’ ownership and understanding of their learning goals and often a focus on “personalizing” students’ learning based on their needs and interests.

Project-based learning: Students learn through an extended project, which may have a number of checkpoints or assessments along the way. Key features are inquiry, exploration, the extended duration of the project, and iteration (requiring students to revise and reflect, for example). A subset of project-based learning is problem-based learning, which focuses on a specific challenge for which students must find a solution.

Standards-based grading: This refers to the practice of giving students nuanced and detailed descriptions of their performance against specific criteria or standards, not on a bell curve. It can stand alone or exist alongside traditional letter grading.

Performance assessment: This assessment measures how well students apply their knowledge, skills, and abilities to authentic problems. The key feature is that it requires the student to produce something, such as a report, experiment, or performance, which is scored against specific criteria.

Portfolio: This assessment consists of a body of student work collected over an extended period, from a few weeks to a year or more. This work can be produced in response to a test prompt or assignment but is often simply drawn from everyday classroom tasks. Frequently, portfolios also contain an element of student reflection.

Exhibition: A type of performance assessment that requires a public presentation, as in the sciences or performing arts. Other fields can also require an exhibition component. Students might be required, for instance, to justify their position in an oral presentation or debate.

Performance task: A piece of work students are asked to do to show how well they apply their knowledge, skills, or abilities—from writing an essay to diagnosing and fixing a broken circuit. A performance assessment typically consists of several performance tasks. Performance tasks also may be included in traditional multiple-choice tests.

With thanks to: Paul Leather, director for state and local partnerships at the Center for Innovation in Education; Mark Barnes, founder of Times 10 Publications; Peter Ross, principal at Education First; Scott Marion, executive director at the Center for Assessment; Sean P. “Jack” Buckley, president, Imbellus; Starr Sackstein, an educator and opinion blogger at edweek.org; and Steve Ferrara, senior adviser at Measured Progress.

Have we missed any terms that confuse you? Why not write and tell us?

A version of this article appeared in the February 06, 2019 edition of Education Week as Performance Assessment: A Guide to the Vocabulary

Sign Up for EdWeek Update

Edweek top school jobs.

Image shows a multi-tailed arrow hitting the bullseye of a target.

Sign Up & Sign In

module image 9

Book cover

Encyclopedia of Language and Education pp 2251–2262 Cite as

Task and Performance Based Assessment

  • Gillian Wigglesworth 3  
  • Reference work entry

1346 Accesses

7 Citations

40 Altmetric

Introduction

A performance test is “a test in which the ability of candidates to perform particular tasks, usually associated with job or study requirements, is assessed” (Davies et al., 1999 , p. 144). In the assessment of second languages, tasks are designed to measure learners’ productive language skills through performances which allow candidates to demonstrate the kinds of language skills that may be required in a real world context. For example, a test candidate whose language is being evaluated for the purposes of entry into an English‐speaking university or college might be asked to write a short academic essay, or an overseas‐qualified doctor might participate in a job‐specific role play with a ‘patient’ interviewer. These kinds of assessments are increasingly used in specific workplace language evaluations, and in educational contexts to evaluate language gains during a period of teaching.

The relationship between task and performance testing is a complex one. In the context...

This is a preview of subscription content, log in via an institution .

Bachman, L.: 1990, Fundamental Considerations in Language Testing , Oxford University Press, Oxford.

Google Scholar  

Bachman, L.: 2002, ‘Some reflections on task‐based language performance assessment’, Language Testing 19(4), 453–476.

Article   Google Scholar  

Bachman, L. and Palmer, A.: 1996, Language Testing in Practice , Oxford University Press, Oxford.

Brindley, G.: 2001, ‘Outcomes‐based assessment in practice: Some examples and emerging insights’, Language Testing 18(4), 393–407.

Brindley, G. and Slatyer, H.: 2002, ‘Exploring task difficulty in ESL listening assessment’, Language Testing 19(4), 369–394.

Brown, A.: 2003, ‘Interviewer variation and the co‐construction of speaking proficiency’, Language Testing 20(1), 1–25.

Brown, A.: 1995, ‘The effect of rater variables in the development of an occupation‐specific language performance test’, Language Testing 12(1), 1–15.

Brown, J.D., Hudson, T., Norris, J., and Bonk, W.J.: 2002, An Investigation of Second Language Task‐Based Performance Assessments , Technical report no. 24, University of Hawaii Press, Honolulu.

Chalhoub‐Deville, M.: 2001, ‘Task based assessments: Characteristics and validity evidence’, in M. Bygate, P. Skehan, and M. Swain (eds.), Researching Pedagogic Tasks: Second Language Learning, Teaching and Testing , Longman, Harlow.

Cumming, A., Grant, L., Mulcahy‐Ernt, P., and Powers, D.: 2004, ‘A teacher‐verification study of speaking and writing prototype tasks for a new TOEFL’, Language Testing 21(2), 107–145.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., and McNamara, T.: 1999, Dictionary of Language Testing , Cambridge University Press, Cambridge.

Douglas, D.: 2000, Assessing Languages for Specific Purposes , Cambridge University Press, Cambridge.

Elder, C.: 1993, ‘How do subject specialists construe classroom language proficiency’, Language Testing 10(3), 235–254.

Elder, C. and Brown, A.: 1997, ‘Performance testing for the professions: Language proficiency or strategic competence?’ Melbourne Papers in Language Testing 6(1), 68–78.

Elder, C. and Iwashita, N.: 2005, ‘Planning for test performance: Does it make a difference?’ in R. Ellis (ed.), Planning and Task Performance in a Second Language , John Benjamins, Philadelphia, 219–238.

Elder, C., Iwashita, N., and McNamara, T.: 2002, ‘Estimating the difficulty of oral proficiency tasks: What does the test‐taker have to offer?’ Language Testing 19(4), 347–368.

Elder, C. and Wigglesworth, G.: 2005, ‘An investigation of the effectiveness and validity of planning time in part 2 of the oral module’. Report for IELTS Australia.

Ellis, R. (ed.): 2005, Planning and Task Performance in a Second Language , John Benjamins, Philadelphia.

Ellis, R.: 2003, Task Based Language Learning , Oxford University Press, Oxford.

Ellis, R. and Yuan, F.: 2004, ‘The effects of planning on fluency, complexity, and accuracy in second language narrative writing’, Studies in Second Language Acquisition 26, 59–84.

Foster, P. and Skehan, P.: 1996, ‘The influence of planning and task type on second language performance’, Studies in Second Language Acquisition 18, 299–323.

Foster, P. and Skehan, P.: 1999, ‘The influence of source of planning and focus of planning on task‐based performance’, Language Teaching Research 3, 299–324.

Iwashita, N., Elder, C., and McNamara, T.: 2001, ‘Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information‐processing approach to task design’, Language Learning 51(3), 401–436.

Lewkowicz, J.: 2000, ‘Authenticity in language testing’, Language Testing 17(1), 43–64.

Lumley, T.: 2002, ‘Assessment criteria in a large‐scale writing test: What do they really mean to the raters?’ Language Testing 19(3), 246–276.

McNamara, T.: 1996, Measuring Second Language Performance , Longman, London.

McNamara, T.F. and Lumley, T.: 1997, ‘The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings’, Language Testing 14(2), 140–156.

McNamara, T. and Roever, C.: Forthcoming, Language Testing: The Social Turn , Blackwell, London.

Mislevy, R.J., Steinberg, L.S., and Almond, R.G.: 2002, ‘Design and analysis in task‐based language assessment’, Language Testing 19(4), 477–496.

Morton, J., Wigglesworth, G., and Williams, D.: 1997, ‘Approaches to validation: Evaluating interviewer performance in oral interaction tests’, in G. Brindley and G. Wigglesworth (eds.), Access: Issues in Language Test Design and Delivery , NCELTR, Sydney, 175–196.

Norris, J.: 2002, ‘Interpretations, intended uses and designed in task‐based language assessment’, Language Testing 19(4), 337–346.

Norris, J.M., Brown, T.D., and Bonk, W.: 2002, ‘Examinee abilities and task difficulty in task‐based second language performance assessment’, Language Testing 19(4), 395–418.

Norris, J.M., Brown, J.D., Hudson, T., and Yoshioka, J.: 1998, Designing Second Language Performance Assessments , University of Hawaii Press, Honolulu.

O'Sullivan, B.: 2002, ‘Learner acquaintanceship and oral proficiency test pair‐task performance’, Language Testing 19(3), 277–295.

Skehan, P.: 1998, A Cognitive Approach to Language Learning , Oxford University Press, Oxford.

Skehan, P.: 2001, ‘Tasks and language performance’, in M. Bygate, P. Skehan, and M. Swain (eds.), Researching Pedagogic Tasks: Second Language Learning, Teaching and Testing , Longman, Harlow, 167–187.

Skehan, P. and Foster, P.: 1997, ‘Task type and task processing conditions as influence on foreign language performance’, Language Teaching Research 1, 185–211.

Skehan, P. and Foster, P.: 1999, ‘The influence of task structure and processing conditions on narrative retelling’, Language Learning 49(1), 93–120.

Spence‐Brown, R.: 2001, ‘The eye of the beholder: Authenticity in an embedded assessment task’, Language Testing 18(4), 463–481.

Upshur, J. and Turner, C.: 1999, ‘Systematic effects in the rating of second language speaking ability: Test method and learner discourse’, Language Testing 16(1), 82–111.

Weigle, S.C.: 2002, Assessing Writing , Cambridge University Press, Cambridge.

Wigglesworth, G.: 1997, ‘An investigation of planning time and proficiency level on oral test discourse’, Language Testing 14(1), 85–106.

Wigglesworth, G.: 2001, ‘Influences on performance in task‐based oral assessments’, in M. Bygate, P. Skehan, and M. Swain (eds.), Researching Pedagogic Tasks: Second Language Learning, Teaching and Testing , Longman, Harlow, 186–209.

Wu, W. and Stansfield, C.: 2001, ‘Toward authenticity of task in test development’, Language Testing 18(2), 187–206.

Yuan, F. and Ellis, R.: 2003, ‘The effects of pretask planning and on‐line planning on fluency, complexity, and accuracy in L2 monologic oral production’, Applied Linguistics 24, 1–27.

Download references

Author information

Authors and affiliations.

Arts Centre Building, School of Languages, University of Melbourne, Victoria, Australia

Gillian Wigglesworth

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Graduate School of Education, University of Pennsylvania, 19104-6216, Philadelphia, PA, USA

Nancy H. Hornberger

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer Science+Business Media LLC

About this entry

Cite this entry.

Wigglesworth, G. (2008). Task and Performance Based Assessment. In: Hornberger, N.H. (eds) Encyclopedia of Language and Education. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30424-3_171

Download citation

DOI : https://doi.org/10.1007/978-0-387-30424-3_171

Publisher Name : Springer, Boston, MA

Print ISBN : 978-0-387-32875-1

Online ISBN : 978-0-387-30424-3

eBook Packages : Humanities, Social Sciences and Law

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Problem-solving Skills: A Comprehensive Overview
  • Evaluation Skills: A Comprehensive Overview
  • o2c-library/governance/arc-organisation-reports/final%20report.pdf
  • Understanding Curriculum Mapping
  • Classroom Management
  • Behavior management techniques
  • Classroom rules
  • Classroom routines
  • Classroom organization
  • Assessment Strategies
  • Summative assessment techniques
  • Formative assessment techniques
  • Portfolio assessment
  • Performance-based assessment
  • Teaching Strategies
  • Active learning
  • Inquiry-based learning
  • Differentiated instruction
  • Project-based learning
  • Learning Theories
  • Behaviorism
  • Social Learning Theory
  • Cognitivism
  • Constructivism
  • Critical Thinking Skills
  • Analysis skills
  • Creative thinking skills
  • Problem-solving skills
  • Evaluation skills
  • Metacognition
  • Metacognitive strategies
  • Self-reflection and metacognition
  • Goal setting and metacognition
  • Teaching Methods and Techniques
  • Direct instruction methods
  • Indirect instruction methods
  • Lesson Planning Strategies
  • Lesson sequencing strategies
  • Unit planning strategies
  • Differentiated Instruction Strategies
  • Differentiated instruction for English language learners
  • Differentiated instruction for gifted students
  • Standards and Benchmarks
  • State science standards and benchmarks
  • National science standards and benchmarks
  • Curriculum Design
  • Course design and alignment
  • Backward design principles
  • Curriculum mapping
  • Instructional Materials
  • Textbooks and digital resources
  • Instructional software and apps
  • Engaging Activities and Games
  • Hands-on activities and experiments
  • Cooperative learning games
  • Learning Environment Design
  • Classroom technology integration
  • Classroom layout and design
  • Instructional Strategies
  • Collaborative learning strategies
  • Problem-based learning strategies
  • 9-12 Science Lesson Plans
  • Life science lesson plans for 9-12 learners
  • Earth science lesson plans for 9-12 learners
  • Physical science lesson plans for 9-12 learners
  • K-5 Science Lesson Plans
  • Earth science lesson plans for K-5 learners
  • Life science lesson plans for K-5 learners
  • Physical science lesson plans for K-5 learners
  • 6-8 Science Lesson Plans
  • Earth science lesson plans for 6-8 learners
  • Life science lesson plans for 6-8 learners
  • Physical science lesson plans for 6-8 learners
  • Science Teaching
  • Performance-Based Assessment: A Comprehensive Overview

This article offers a comprehensive overview of performance-based assessment, including what it is, how it works, and its advantages and disadvantages.

Performance-Based Assessment: A Comprehensive Overview

Performance-based assessment (PBA) is an increasingly popular strategy for assessing student learning, offering a comprehensive and holistic approach that measures both knowledge and skills. This type of performance-based assessment requires students to demonstrate their understanding of a concept or topic by applying their knowledge in a practical context. PBA can be used to assess a variety of subject areas, from science to language arts, and has been widely adopted by educators as a valuable tool to measure student growth and progress. In this article, we will provide an overview of performance-based assessment (PBA), discussing its advantages and disadvantages, as well as exploring strategies for successful implementation of performance-based assessment in the classroom.

PBA is a valuable method for evaluating student understanding and progress. This type of assessment requires students to demonstrate their knowledge in ways that go beyond traditional exams. PBA typically involves activities that assess a student's ability to apply, analyze, evaluate, and create the knowledge they have acquired. These activities can include projects, simulations, role-playing, and hands-on activities. It is important to note that PBA does not replace traditional tests; rather, it provides an alternative method for assessing student learning. When designing PBA activities, it is important to keep in mind the goals and objectives of the assessment.

It is also important to consider the context in which the assessment will be conducted. For example, is it a summative or formative assessment? Is it used to measure mastery of a concept or skill or to assess a student's progress? Once these decisions are made, the assessment can be designed and implemented. Advantages of using performance-based assessment include the ability to assess higher-order thinking skills and provide more authentic evaluation of student learning. It also encourages students to be creative and take ownership of their learning. Disadvantages include the need for more time and resources for planning, implementation, and assessment.

It can also be more difficult for teachers to assess students' performance objectively. Implementing PBA in the classroom requires careful planning and designing. The first step is to identify the desired outcome of the assessment. Next, the teacher should select tasks that are appropriate for the students’ age and skill level. The tasks should be aligned with curriculum goals and be organized in a way that allows students to demonstrate their understanding.

The teacher should also provide clear instructions and criteria for success. There are many different types of PBA activities that can be used in the classroom. These include simulations, projects, portfolios, oral presentations, debates, role-playing, and hands-on activities. Each type of activity has its own benefits and challenges. For example, simulations allow students to apply their knowledge in a real-world context but may require more time and resources than other types of activities.

Projects allow students to explore topics in depth but may require more guidance from the teacher. The impact of performance-based assessment on student learning can be substantial. PBA encourages students to think critically and develop higher-order thinking skills such as analysis, synthesis, and evaluation. It also encourages students to take ownership of their learning and become more engaged in their studies. Finally, PBA provides students with an opportunity to demonstrate their mastery of concepts in an authentic way. When using performance-based assessment in the classroom, there are several things teachers should keep in mind.

First, it is important to provide clear instructions and criteria for success. Second, teachers should plan assessments carefully to ensure that they are appropriate for the students’ age and skill level. Third, teachers should provide feedback that is timely and constructive. Finally, teachers should differentiate instruction when necessary to ensure all students are able to participate fully in PBA activities. In conclusion, performance-based assessment is an effective way to evaluate student learning.

Impact of Performance-Based Assessment on Student Learning

This type of assessment also provides teachers with more detailed information about a student's comprehension of the material. One advantage of PBA is that it allows teachers to assess a student's knowledge in a more meaningful way. Unlike traditional exams that focus on memorization and recall, PBA requires students to show they understand the material by performing specific tasks. This type of assessment can be used to measure a student's problem-solving skills, critical thinking , and creativity.

Another benefit of PBA is that it allows for more individualized instruction. By examining each student's strengths and weaknesses, teachers can tailor the instruction to best meet the needs of each student. This type of assessment also allows for more effective feedback since teachers are able to provide more detailed guidance on how to improve. Finally, using PBA can be a motivating factor for students.

Recommendations for Teachers

This will help ensure that students are able to complete the assessment successfully. 2.Provide clear instructions and expectations – Teachers should provide clear instructions and expectations for the assessment. This will help ensure that students understand what is expected of them and are able to complete the assessment correctly.3.Allow for collaboration – PBA can be used as an opportunity for students to collaborate with one another. This can help foster a sense of community amongst students and encourages them to work together to achieve success.4.Monitor progress – Teachers should regularly monitor student progress and provide feedback when necessary.

How to Implement Performance-Based Assessment in the Classroom

One option is criterion-referenced assessment , which measures a student's performance against a pre-set standard. This type of assessment can be used to measure mastery of particular skills or knowledge in a subject. Another option is standardized testing , which measures a student's performance against the performance of other students. This type of assessment can be used to compare students on a larger scale.

Once the type of PBA has been decided, teachers must then decide how to assess student performance. This will depend on the type of assessment chosen, but may include tasks such as writing an essay, presenting a project, or completing a lab experiment. Each task should be designed to measure student understanding of the material and should be tailored to the specific objectives of the course. In addition, teachers must ensure that they provide clear instructions and expectations for each task.

What is Performance-Based Assessment?

PBA typically involves the use of projects, activities, and simulations that require students to demonstrate their understanding of the concepts they have learned. Projects may involve research, experiments, or other activities that require students to apply the knowledge they have acquired in order to complete the task. Activities and simulations may involve role-playing or game-like scenarios that require students to demonstrate their understanding of a particular concept or process. The advantages of using PBA are many.

It allows students to show their mastery of a subject in ways that go beyond simply memorizing facts and answering multiple choice questions. It also allows educators to assess the student's ability to think critically and apply the knowledge they have acquired in real-world situations. Additionally, it can be used as a form of formative assessment, providing feedback to both the student and the teacher on areas that need improvement. The disadvantages of using PBA include the fact that it can be time-consuming and resource-intensive, as it typically requires more planning and preparation than traditional exams.

Types of Performance-Based Assessment

Presentations, performance tasks, observations.

It requires students to demonstrate their knowledge in ways that go beyond traditional exams, which can help them develop higher-order thinking skills and become more engaged in their learning. However, it can be challenging to implement PBA in the classroom, so it's important for teachers to have a plan for assessing student performance and addressing any challenges that may arise. Teachers should also be aware of the various types of PBA and the impact it can have on student learning and achievement. With careful planning and implementation, performance-based assessment can be a powerful tool for promoting student learning.

  • performance

Shahid Lakha

Shahid Lakha

Shahid Lakha is a seasoned educational consultant with a rich history in the independent education sector and EdTech. With a solid background in Physics, Shahid has cultivated a career that spans tutoring, consulting, and entrepreneurship. As an Educational Consultant at Spires Online Tutoring since October 2016, he has been instrumental in fostering educational excellence in the online tutoring space. Shahid is also the founder and director of Specialist Science Tutors, a tutoring agency based in West London, where he has successfully managed various facets of the business, including marketing, web design, and client relationships. His dedication to education is further evidenced by his role as a self-employed tutor, where he has been teaching Maths, Physics, and Engineering to students up to university level since September 2011. Shahid holds a Master of Science in Photon Science from the University of Manchester and a Bachelor of Science in Physics from the University of Bath.

New Articles

Understanding National Science Standards and Benchmarks

  • Understanding National Science Standards and Benchmarks

This article explains the different types of national science standards and benchmarks, including their purpose and importance.

Problem-solving Skills: A Comprehensive Overview

Learn the fundamentals of problem-solving skills and how they can help you develop critical thinking skills within the Science Learning silo.

Earth Science Lesson Plans for 6-8 Learners

  • Earth Science Lesson Plans for 6-8 Learners

Learn about earth science lesson plans for 6-8 learners, including how to create and implement engaging lesson plans for this age group.

Physical Science Lesson Plans for 9-12 Learners

  • Physical Science Lesson Plans for 9-12 Learners

Find out about physical science lesson plans for 9-12 learners, including what topics to cover and how to make them engaging and informative.

Leave Reply

Your email address will not be published. Required fields are marked *

I agree that spam comments wont´t be published

Behavior Management Techniques

  • Behaviorism: A Comprehensive Overview
  • Social Learning Theory Explained
  • Summative Assessment Techniques: An Overview
  • Formative Assessment Techniques
  • Active Learning: A Comprehensive Overview
  • Inquiry-Based Learning: An Introduction to Teaching Strategies
  • Understanding Cognitivism: A Learning Theory
  • Analysis Skills: Understanding Critical Thinking and Science Learning
  • Creative Thinking Skills
  • Constructivism: Exploring the Theory of Learning
  • Classroom Rules - A Comprehensive Overview
  • Exploring Portfolio Assessment: An Introduction
  • Differentiated Instruction: A Comprehensive Overview

Classroom Routines: A Comprehensive Overview

  • Effective Classroom Organization Strategies for Science Teaching
  • Project-Based Learning: An In-Depth Look
  • Understanding Direct Instruction Methods
  • State Science Standards and Benchmarks
  • Course Design and Alignment
  • The Advantages of Textbooks and Digital Resources
  • Engaging Hands-on Activities and Experiments
  • An Overview of Metacognitive Strategies
  • Backward Design Principles: Understanding Curriculum Design
  • Engaging Cooperative Learning Games
  • Integrating Technology into the Classroom
  • Understanding Classroom Layout and Design
  • Lesson Sequencing Strategies: A Comprehensive Overview
  • Instructional Software and Apps: A Comprehensive Overview
  • Collaborative Learning Strategies
  • Indirect Instruction Methods: A Comprehensive Overview
  • Exploring Problem-Based Learning Strategies
  • Unit Planning Strategies

Exploring Self-Reflection and Metacognition

  • Exploring Goal Setting and Metacognition
  • Life Science Lesson Plans for 9-12 Learners
  • Earth Science Lesson Plans for K-5 Learners
  • Differentiated Instruction for English Language Learners
  • Life Science Lesson Plans for K-5 Learners
  • Earth Science Lesson Plans for 9-12 Learners
  • Life Science Lesson Plans for 6-8 Learners
  • Physical Science Lesson Plans for K-5 Learners
  • Physical Science Lesson Plans for 6-8 Learners
  • Differentiated Instruction Strategies for Gifted Students

Recent Posts

Behavior Management Techniques

Which cookies do you want to accept?

Join us for our next live demo on Thursday, March 7th to get a closer look at the Otus platform

assessment task based performance

Assessments

assessment task based performance

Grading & Reporting

assessment task based performance

Data & Analytics

assessment task based performance

Progress Monitoring

Request A Demo

Otus Assessments

Common Assessments

Standards-based grading, data-driven decisions, multi-tiered systems of support (mtss), portrait of a graduate, family engagement, product updates, implementation & support, partners & integrations, success stories, in the news, the ultimate guide to performance-based assessments.

Guides | 21 minutes

What do teachers do in performance-based assessment?

What is a performance-based assessment.

Performance-based assessments move beyond multiple-choice and written tests in order to determine not only what students know but how they apply their knowledge. Occasionally called authentic assessments, performance assessments emphasize the importance of “real-world” application.

The real-world emphasis of performance assessments occurs not only in the final assessment – are the skills being assessed translatable to skills students would need to use in the real world? –  but also in the ongoing instruction that occurs both before and during the performance assessment.

A downloadable PLC toolkit that includes templates, tips, and more

What are the two types of performance-based assessment?

Generally, there are two broad umbrellas that cover performance-based assessments: performance assessments can end either in some sort of product or some sort of performance .

Typically, a product-oriented performance assessment ends with the students producing some sort of tangible element not only encapsulating the summation of the knowledge they’ve gained but demonstrating that they’ve learned how to apply gained knowledge. For example, a student may grow a garden, create a budget, write an argument, or build a model.

Performance-oriented performance assessments, on the other hand, allow students to interact with an audience to demonstrate their applied knowledge. For example, a student may participate in a debate, perform a piece of music they composed, engage in a mock interview, or teach their class how to cook a meal.

An illustration that shows the 2 different types of performance-based assessments

Regardless of which of the two a teacher chooses to end the performance assessment with, it is important that the final grading of either is process-based . True performance assessments value the learning process as much as – if not more than – the final result. The majority of the learning should occur not before the performance assessment begins but along the way.

Additionally, performance assessments highly value processes that move students beyond acquiring knowledge to thinking critically about what they are learning and how they can best apply what they are learning. Whether an assessment ends in a product or performance, students should be able to articulate both how and why they ended their assessment in the way that they did.

What are the key features of a performance assessment?

Proponents of performance assessments argue that traditional tests are passive and do not accurately reflect what a student truly knows and can apply. Thus, performance-based assessments require students to actively engage with the material they are learning in authentic, practical, real-life scenarios.

Although there are no set key features of performance assessments, most writers and educators agree that assessments claiming to be performance-based must include the following elements:

  • Performance assessments must be complex . Very rarely in the real world will a person need only one skill to complete a work or life task. Consequently, performance-based assessments must require students to draw from a variety of skills and wells of information in order to complete their assessment. Thus, performance-based assessments are ideal for educators wishing to collaborate across departments and subjects.
  • Performance assessments must be authentic to real-world situations. The underlying purpose of performance assessments – and of education itself, according to most educators – is to ensure students are prepared to leave school ready for any challenge. Thus, a solid performance-based assessment will reflect scenarios students may face outside the school setting.
  • Performance assessments must be open-ended. Just as we rarely find real-life situations with only one right solution, well-written performance assessments allow students to explore the topic in a way that they could potentially arrive at a number of “correct” solutions, or present a final product in a variety of different ways and still receive full credit.

An infographic showing the 6 elements of performance based assessments

  • Performance assessments must be process-oriented, and often have an end product to present. Similar to the open-ended point, performance assessments are most successful when students have multiple ways of accomplishing them. The best performance assessments provide students with opportunities for exploration, learning, analysis, and other higher-level thinking processes as they complete the task. The final score isn’t entirely about the end result, but rather rests heavily on how students used what they know to get to the end result. Often – although not always – students will have some sort of product to present at the end of their assessment. Although this may be some sort of creative work, it can also be something as simple as a decision or a recommendation. For example, students may have had to compare several different types of job offers in which they analyzed the different types of salaries, health insurance plans, retirement, and other options, and balanced those against their budgets including rent, groceries, and other expenses, as well as their own talents and passions, and decided on which job they should take.
  • Performance assessments must require higher-order thinking. In performance assessments, the score does not rest on the final result, as mentioned above. Instead, a large portion of a student’s score comes from the students’ ability to demonstrate that they have the knowledge and skills to complete the assessment successfully. In performance assessments, students must demonstrate problem-solving, critical thinking, and analytical reasoning skills. Most performance assessments require students to synthesize, apply evidence, analyze, critique, judge, and more in order to pass. This ensures that teachers are able to see that student learning has truly transferred to their ability to apply what they have learned to real-life situations.
  • Performance assessments must be graded on a clear rubric. It is important before beginning that students understand exactly what is expected of them, especially as many will have never been given this type of open-ended assessment based on performance before. Students will not be able to successfully explore and create if they don’t clearly understand their boundaries. Additionally, the rubric should have a clear timestamp. Performance-based assessments can range from taking a few hours to a few months. There’s not a correct time frame; however, it must be clear at the outset. It’s not wrong to set up a tight timeframe either; often in the real world, there are very quick time turnarounds to complete projects.

Why are performance-based assessments important?

Due to the constantly expanding reach of technology and its dramatic impact on almost every area of life, job expectations and even jobs themselves are constantly changing and evolving. It’s no longer enough to teach students knowledge; students can find that knowledge with a few clicks of a mouse. Instead, teachers must teach students how to use that knowledge in equally evolving, creative ways. 

Traditional assessments – while helpful for some forms of assessments – simply aren’t enough anymore to ensure students are leaving school well-prepared for real life and the constant changes it brings. 

The major advantage of performance-based assessments, on the other hand, is that they do exactly that. Teachers can be confident that students who pass performance-based assessments are ready to face the challenges of the real world with creativity, analysis, and excellent problem-solving skills because those skills are exactly what performance assessments are designed to impart to students.

Performance assessments require students to take what they have learned and not simply recall it but apply it in a variety of ways to different situations. They provide students with the opportunity to not only find the knowledge they need to solve a problem but find ways to process and analyze that knowledge to make a good – not ‘correct’ – decision.

Moreover, another advantage of performance-based assessments is that students are not focused fully on the end product but on how to get there. This encourages a love of learning, as well as increased metacognition as they process where they are and what they need to learn to get where they want to be. As part of their scoring, they must be able to articulate not only their ‘final answer’ and product but explain how they got there. Performance assessments build communication with others as well as self-reflection, both of which are essential skills in today’s world.

Performance-based assessments encourage students to make good decisions rather than search for the correct answer .

Finally, one of performance assessments’ greatest advantages is that they allow students to see how many different subjects not only overlap but how they are useful and applicable in the real world. This usually has the effect of motivating students to learn and invest effort as they see how the assessments – and the learning that comes along with it – will benefit them in the long run.

It is important that teachers interested in moving toward performance-based assessments understand that it is a significant amount of work in the beginning. It takes far more time to set up than more traditional approaches to instruction and assessment. However, what time teachers may feel like is lost initially is more than made up for during the assessment process. Many teachers find themselves in more of a “guide” position, coming alongside students as the students take ownership of their learning exploration.

To begin a performance-based assessment, teachers must actually think first about the very end goal. What standards and learning objectives do the teachers want to measure? The entire assessment must be designed backward from this starting – or ending – goal.

Then teachers must consider what type of process or product students could complete demonstrating mastery of these standards and objectives. As they design the performance assessment, teachers must ensure the assessment is complex, allowing students the opportunity to apply a variety of skills and knowledge, ideally across subject areas; is authentic, allowing students to practice applying their skills to real-world situations; is open-ended, allowing students to explore a variety of solutions instead of reaching only one narrow “right choice;” is truly process-oriented, allowing students to explore and learn during the process; and requires higher-order thinking, challenging students to move beyond retention towards application.

See How Otus Makes Rubric Grading Easy

assessment task based performance

It is important then for teachers to create very clear rubrics. These performance rubrics should highlight the standards and skills to be mastered and demonstrated but should leave room for the students to determine at some level how they will demonstrate mastery. There should be some element in the rubric that requires students to explain how they developed their final product, supporting it with evidence and analysis. Rubrics should clearly outline the timeline for the project.

Additionally, teachers may want to brainstorm a variety of formative assessments they can give along the way to ensure students are on the right track. Teachers should have intervention plans in place if students do not perform as expected on the formative assessments.

Once all of these elements are in place, teachers must determine the best way to clearly explain the performance assessment to the students. As many performance-based assessments are done in tangent with a teacher’s instruction, even incorporating that instruction into the assessment, it is usually best to introduce the assessment along with the unit introduction. Often teachers will tie these together with one or more open-ended questions to guide the unit.

It is also important for teachers to take the time to help students understand the authenticity of the performance assessment – how developing the skills across disciplines and applying them to the performance assessment will help them tackle similar real-life situations outside of school. Only then can teachers expect the buy-in helpful in making performance-based assessments such valuable learning tools.

Privacy Overview

Log in to Witsby: ASCD’s Next-Generation Professional Learning and Credentialing Platform

Designing Performance Assessment Tasks

Getting at criteria, effective tasks, benefits for teachers.

premium resources logo

Premium Resource

  • Meaningful context. "Good performance assessments are more contextualized" than traditional tests, he says, "more like how people use knowledge and skills in the larger world." Unlike many multiple-choice tests, good tasks do not jump from one area of knowledge to another.
  • Thinking process. "Ask students to actually use knowledge," he says, "to thoughtfully apply knowledge and skills to a new situation. If you really understand something, you can work with it, analyze it, argue against it, and present it." Educators should ask of their assessments, "Could students accomplish this task and still not understand what we want to assess?"
  • Appropriate product or performance. Avoid "products or performances that don't relate to the content" of what is being assessed, even though they may seem like good activities on their own. "Sometimes students get so caught up in the product that they lose sight of what they're actually intending to show with the product." One common problem is an overemphasis on aesthetic elements of an assessment task.
  • Student choice. "Student choice has lots of benefits," McTighe says, "but you want to make sure that opportunities for choice don't get in the way of what you're trying to assess." Allowing students to choose subjects, resources, methods, and whether to work alone or in groups has instructional benefits, but complicates assessment."From a measurement perspective, giving students choices is a terrible dilemma," Herman agrees. Some options or topics may yield easier projects than others, and "not all children are equally good choosers." On the other hand, assigning topics runs the risk of giving an advantage to students who are more inclined toward what the teacher selects.
  • Interdisciplinary tasks. Herman prefers these tasks because of their instructional value. But interdisciplinary assessment is most effective when a teacher is familiar with students' progress in several areas. A writing assessment on a history subject is hard to evaluate unless the teacher can distinguish the level of performance in writing versus that in history. These distinctions are harder to make when the people who rate the assessments don't work with the students every day.
  • Cooperative grouping. "Any kind of group activity confounds the measurement of individual ability," Herman says, although group work supports learning. Many educators include an individual component of the assessment in cooperative situations, but the performance of other students in the group can affect that component, Herman adds. And if teachers want to assess the ability to work as a team, that ability should be included in the criteria.

assessment task based performance

Philip N. Cohen has contributed to Educational Leadership.

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action., related articles.

undefined

Giving Retakes Their Best Chance to Improve Learning

undefined

The Way We Talk About Assessment Matters

undefined

Gathering Feedback from Student Work

undefined

The Power of Feedback

undefined

School Leaders: If You Want Feedback, Ask for It

To process a transaction with a purchase order please send to [email protected].

logo (1)

Tips for Online Students , Tips for Students

Back To Basics: What Is Performance Based Assessment (PBA)?

Back-To-Basics-What-Is-Performance-Based-Assessment

There are a variety of ways to test a student’s knowledge. For some, multiple choice exams and short response questions work well. Yet, these methods may cause test anxiety and fail to showcase how a student solves a problem. For this reason, performance-based assessments may be able to offer better insight as to how much a student understands. Here, we’ll answer “what is performance-based assessment (PBA)?” and break down how to implement performance-based testing in practice.

What Is Performance-Based Assessment?

Serving as an alternative to traditional testing methods, performance-based assessment includes the problem-solving process. These assessments require a student to create a product or answer a question that will demonstrate the student’s skills and understanding.

For this reason, there tends to be no single right or wrong answer. Instead, PBAs require students to actively participate in a task to assess their process. The questions or tasks are designed to be practical and interdisciplinary.

Not only do performance-based assessments provide deeper insight into how well students have learned, they also give them insight into what they understand themselves. With this knowledge, teachers are better able to understand where a student needs extra assistance and can modify their lessons accordingly.

Photo by Hannah Olinger on Unsplash

What essential components does pba include.

Depending on subject matter and goals, there are different ways to facilitate PBA. Yet, there are certain elements that make PBA what it is.

Performance-based assessments meet this criteria:

  • Process/product-oriented

By this manner, PBA can have several different right answers because the tasks and tests are open-ended. Like most real-world situations, they are bound by time and consist of a level of complexity such that problem-solving skills are really tested.

A Guide To Performance-Based Testing

Performance-based learning relies on the acquisition of skills and development of work habits. Together, these are paired to be applied to real-world situations.

1. Balance In Literacy

In PBA, rather than asking a student solely if they know something, you may also question how they can use their knowledge. This balances the two to provide the ability to recall knowledge and then classify it for practical usage.

2. Content Knowledge

It’s up to the teacher to pull subject matter directly from the curriculum or to pull ideas from the school or department itself.

3. Work Habits

For success in PBA and overall life, students must master skills like time management, intrapersonal communication, and individual responsibility.

4. Performance Tasks

Tasks are designed to pull everything together. These tasks take work habits, content knowledge, and balance in literacy to create. They become ingrained as a part of learning rather than an after effect.

Examples Of PBA

In theory, PBA makes a lot of sense. But how can you incorporate it into your teaching?

Here are some examples of performance-based testing:

  • Elementary School: Pose a question like, “Should our school upgrade our water fountain systems?” Now, that’s a pretty open-ended question with no single correct answer. One way to make it practical is to ask students to record how many kids are using the water fountain per hour. In this way, they can determine need and learn about decision-making.
  • Middle School: Create a scenario in which someone commits a crime. Then, run a mock trial in your classroom. This can test a student’s communication skills and reasoning.

Advantages Of PBA

Performance-based assessment is advantageous for both teachers and students. For students, it helps to apply in-class learning to situations outside of the classroom. For teachers, it offers deeper insight into the learning needs of students.

At the same time, they offer a way for students to better measure their own understanding and success. While completing a task or project, a student can see where they are struggling. Then, they can ask specific questions or work harder on enhancing their knowledge.

Photo by javier trueba on Unsplash

How teachers can create pbas: 6 tips.

If you’re a teacher or facilitator designing performance-based assessments, here’s an easy step-by-step guide for doing so.

1. Identify Goals

The first step is designing a test that will challenge a student’s problem-solving and critical thinking abilities. The teacher will want the students to work without direct aid so that they can evaluate where a student’s strengths and weaknesses reside.

2. Course Standards

Most schools and districts have core standards that must be taught within the school year. Take the goal identified above and relate it back to a common core standard.

3. Review Assessments

Look at how students are currently understanding the core standard. This may be from previous test results.

4. Address Learning Gaps

By reviewing assessments, it becomes clear where a student is lacking understanding. So, you can design a performance-based assessment that addresses the learning gap in practice.

5. Design A Scenario

Design a situation that addresses core standards and main ideas that students may be struggling with. You can design a scenario by defining key characteristics, including: setting, role, time frame, product, and audience.

6. Develop A Plan

You’ll have to balance both content and task preparation. Depending on a student’s needs, you may have to be more or less hands-on in describing the problem at hand.

The Bottom Line

Education and learning is as diverse as the student population. The best type of education is one that becomes applicable to real-world situations. That’s why at the University of the People, we design our curriculum to prepare students to enter a career upon graduation and be prepared. Students may have standard tests, but they also get to apply their knowledge to solving complex problems.

In the setting of primary to secondary education, performance-based assessments can play this same crucial role. Instead of relying solely on tests that tend to be multiple choice and fail to show how a student arrives at their answer, PBA offers deeper insight into their thinking process.

The best way to implement PBA is to first have a general understanding of your student’s abilities and areas in which they need improvement. Then, you can creatively design a scenario that puts them to the test.

Related Articles

Privacy overview.

  • Our Mission

Performance-Based Assessment in Math

Instead of doing math problems with no context, students at this school role-play real jobs.

A student is working on a peper for his geography class.

Performance-Based Assessment: Math

Through performance-based assessment, students demonstrate the knowledge, skills, and material that they have learned. This practice measures how well a student can apply or use what he or she knows, often in real-world situations. Research has shown that performance-based assessment provides a means to assess higher-order thinking skills and helps teachers and principals support students in developing a deeper understanding of content.

How It's Done

Performance-based assessment can work with the curriculum, instruction, or unit that you're teaching right now. How would you design a performance-based assessment for this content? Because PBA requires students to demonstrate their knowledge and skills with the concepts that they've learned, this assessment requires them to create a product or response, or to perform a specific set of tasks.

At Hampton High School, teachers calibrate their assessments against a rigor scale with the goal of high performance. They use the common Rigor, Relevance, and Relationships framework to demonstrate that the higher levels of rigor and relevance embody higher-level cognition and application. "What's the level of performance?" teachers will ask when designing assessments. "Is the performance that we want from kids short-term memory and fragmented applications, or should they demonstrate comprehensive understanding of big ideas?" This shifts the focus from content measures to student performance measures.

For example, a performance task in history would require students to produce a piece of writing rather than answering a series of multiple-choice questions about dates or events. The value of performance assessment is that it mimics the kind of work done in real-world contexts. So an authentic performance task in environmental science might require a student to investigate the impact of fertilizer on local groundwater and then report the results through a public service campaign (like a video, a radio announcement, or a presentation to a group).

Performance assessment draws on students’ higher-order thinking skills -- evaluating the reliability of information, synthesizing data to draw conclusions, or solving a problem with deductive or inductive reasoning. Performance tasks may require students to present supporting evidence in an argument, conduct a controlled experiment, solve a complex problem, or build a model. A performance task often has more than one acceptable solution, and teachers use rubrics as a key part of assessing student work.

Math: Disaster Relief Mission

Hampton High School's pre-calculus teachers aimed to create a performance-based assessment that asked students to demonstrate their knowledge of concepts, and apply it to circumstances unfamiliar to them. They came up with Disaster Relief Mission, a simulation where students play the role of air traffic controllers and pilots responding to crisis situations around the country. In these situations, students have to figure out what math to use in order to rescue those in need.

In the Resources tab, you'll find all the math materials that Hampton teachers created for the Disaster Relief Mission project. These materials include:

  • Project directions
  • Rubrics to assess the project

Disaster Relief Mission is a sophisticated example of performance assessment, developed and refined over the past three years by Hampton's teachers. The prep work involved in such a project does require some time, including coming up with the missions, setting up the gymnasium with the correct coordinates, and configuring all the technology (iPods, FaceTime, and a Compass App) used in this exam. Teachers also spend some time training students on how to use the technology so that it won't be an issue during the actual work. Students are also trained for the roles of both pilot and air traffic controller, in case teams need to be reconfigured on the day of the exam.

Disaster Relief Mission PBA

Students are split into teams of three (one air traffic controller and two pilots) and given four disaster missions to solve. Each team is distributed across two locations (air traffic controllers in one room, pilots in the gymnasium), and all communicate via FaceTime.

The teachers set up ten missions in the gymnasium, each with different coordinates. However, students have only four problems to solve, allowing multiple groups teams work in the gym at the same time but not on the same problem.

A sample disaster relief mission looked like this:

Air traffic controllers are responsible for determining the angle and distance that the pilots need to move to get them from one mission to another. They calculate these numbers and relay them to the pilots via FaceTime. If correct, the pilots in the gym reach the mission site and then have to figure out what math will help them complete the mission. For example, will their calculations require the Law of Sines, Law of Cosines, right triangle trigonometry, or bearings?

After students complete one mission, they restart the whole process for the next mission, until they complete all four. The whole PBA takes one class period to complete.

Evaluation/Utilizing Rubrics

Teachers design a rubric to measure the performance of students. The rubric is given to students ahead of time, so that they're clear about what they will be assessed on. For Disaster Relief Mission, the rubric is designed so that each team member -- whether pilot or air traffic controller -- receives the same number of points on the exam. For a perfect score, a team receives 45 points for completing and solving all four missions. The rubric assesses the accuracy of how well students solve each mission, including:

  • Looking at the accuracy of how polar coordinates were calculated
  • Looking at the accuracy of math used in each mission, including all calculations (not just final answers)
  • Supporting work, including maps that showed how the air traffic controllers determined the angles at which the plane would travel
  • Neatness of the work
  • How students collaborated and communicated as a team

If a team doesn't submit its calculations, for example, but has the correct answer, less points are given. If a team has a correct answer but the units of measure are missing, they're also given fewer points. The rubric allows teachers to grade across a spectrum, taking into consideration how accurate and complete the students' work is.

  • What Is Performance-Based Assessment?
  • What Is Performance-Based Learning and Assessment, and Why Is It Important?
  • What Is a Performance Task?
  • Performance Tasks for Math
  • Performance-Based Assessment for Hampton H.S. Disaster Relief Mission

Hampton High School

Per pupil expenditures, free / reduced lunch, demographics:.

Stanford University

What We Do Container

  • Current Projects
  • Completed Projects

Performance Assessment Resource Bank

PARB logo

  • Create an account and start using the free PARB resources
  • Collaborators
  • Read reviews

While assessment in the United States has focused largely on “bubbling in” answers on multiple choice tests, educators are searching for better, more authentic, and higher quality assessment tools that measure the complex thinking necessary for success in our modern world. Scaling up performance assessment requires high-quality tools and content.

That’s why we’re built the Performance Assessment Resource Bank (PARB) —an online resource for K–12 teachers, administrators, and policy makers. The website is designed to serve as a platform for sharing high-quality performance assessments and resources curated from educators and organizations nationwide and building community among the educators and leaders who use, develop, and share these important tools. It includes performance assessment tasks and support resources for designing a system of assessment and building educator assessment literacy and capabilities, all focused on more meaningful learning.

This innovative approach provides an expanding resource base through community collaboration and sharing features that will support an expanding network of performance assessment content developers. Assessment resources, initially created by experts, will grow in breadth and depth because the site enables educators to work together, adapt, pilot, and provide feedback on existing tools, contribute new content, and build a performance assessment community.

Initial contributors of tools and content to the resource base include:

  • Literacy Design Collaborative  
  • Educational Policy Improvement Center  
  • Center for Collaborative Education
  • Deeper Learning Student Assessment Initiative members (Envision Schools, New Tech Network Schools, ConnectEd for California, and Asia Society International Studies Schools Network)
  • New Hampshire Task Bank

The website, now in its nationwide rollout, features:

  • Free, open access.
  • A growing library of high quality performance tasks, portfolio frameworks, learning progressions, research, and other assessment resources, all developed by educators and organizations and vetted by experts trained by UL-SCALE at Stanford University.
  • Personalized dashboards with saved resources for each community member.
  • A user rating system to guide content selection and facilitate continuous improvement of content.
  • The ability for registered users to share resources via email, Facebook, Twitter, and Google+.

Educators can quickly and easily access and use content in classrooms, schools, and school systems or use the tools to develop their own performance assessment materials. Community members can also submit new content for review and vetting by experts.

After significant testing, the Performance Assessment Resource Bank is now available nationwide. Based on user feedback, we will continuously upgrade the site. 

PARB Collaborators:  

This new and innovative approach to assessment was created through a partnership between the Stanford Center for Opportunity Policy in Education (SCOPE) and  Understanding Language-Stanford Center for Assessment, Learning and Equity (UL-SCALE) , with the participation of member states of the Innovation Lab Network of the  Council of Chief State School Officers (CCSSO) .

PARB Directors:

  • Elizabeth Leisy Stosich, SCOPE
  • Laura Gutmann, UL-SCALE 

Reviews of PARB:

The Performance Assessment Resource Bank has been an empowering solution for my classroom for two years now. The tasks are interesting and engaging for students, and fit seamlessly into my curriculum. I love hearing students talk about how the tasks helped them become more critical thinkers through the reading and writing process and I find myself planning my own tasks now using the larger scope I've experienced in the performance assessments. This Performance Assessment Resource Bank is an important place where teachers have access to quality resources that can help us as we endeavor to produce competitive, 21st century students. —Carisa Barnes, High School English Teacher, California

The Performance Assessment Resource Bank puts developed resources into the hands of people who are more than ready to use them.  They've created a peer-reviewed archive, based in solid research and connected to 21st century standards, that lets districts, schools and teachers do what we need to do:  Let kids use what they've learned to show us what they can do. —Jonathan Doughty, High School Science Teacher, Maine

The Performance Assessment Resource Bank is very comprehensive and will provide outstanding support to educators in the development of high quality performance assessments.  It is “one-stop shopping.” Teachers will have access to high quality assessments and rubrics that have been piloted and vetted.  The professional development materials and resources that are available will assists Virginia’s teachers and schools as we continue our work with the development and implementation of performance assessments. One of the greatest resources available through the Performance Assessment Resource Bank is the opportunity for teachers to submit their locally developed performance assessments and have them vetted by the staff at SCALE. Having a professional staff to provide feedback and validate the quality of our work will be huge to Virginia as we move forward in our efforts to replace our multiple choice state assessments with valid and rigorous performance tasks — Dr. John W. “Billy” Haun, Chief Academic Officer for the State of Virginia

Join our mailing list

Sign up for our free newsletter to learn about new SCOPE publications and upcoming events.

Stanford Center for Opportunity Policy in Education Education Building, 485 Lasuen Mall Stanford, CA 94305

  • News & Events

Stanford University

  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Non-Discrimination
  • Accessibility

© Stanford University , Stanford , California 94305 .

  • CAL Solutions

cal-logo-no-backgroudn.png

  • Annual Reports
  • Professional Development

Adult English Language Education

Dual Language and Multilingual Education

Immigrants and Newcomers

International Language Education

PreK-12 EL Education

Testing and Assessment

World Languages

  • Permissions
  • Subscribe to CAL News

assessment task based performance

  • Designing Performance-based Assessment Tasks
  • Foreign Language Assessment Directory
  • Key Concepts
  • What do I want to know?
  • What skills do I want to measure?
  • What is the intended purpose of the test?
  • How will I use the test results?
  • What information will the test provide?
  • Show what you know!
  • Puzzle Piece
  • What is the relationship between reliability and validity?
  • How do I determine if a test is reliable for my situation?
  • What could affect reliability?
  • Do I have the resources to use this test in my classroom?
  • What are the practical considerations for test administration?
  • What are the practical considerations in scoring a test?
  • What are the possible effects of a test?
  • What does positive washback look like?
  • What does negative washback looks like?
  • Who will be affected?
  • How will different stakeholders be affected?
  • Putting It All Together
  • Needs Assessment
  • Linguistic Characteristics and Considerations
  • Cultural Characteristics and Considerations
  • Factors in Language Development
  • Program Types
  • Implications for Assessment
  • Show What You Know!
  • Improving Outcomes
  • Support and Advocacy
  • Providing Evidence
  • Assessment Challenges
  • Placement Tests
  • Formative Assessment
  • Summative Assessment
  • Examples of Effective Assessment Tasks
  • Summary of Best Practices
  • Selecting Assessments
  • Modifying Assessments
  • Developing Assessments
  • Learner Motivations and Goals
  • Acquiring Proficiency
  • Proficiency Levels
  • Proficiency-Based Approach to Assessment: The What
  • Proficiency-Based Approach to Assessment: The Why
  • Proficiency-Based Approach to Assessment: The How
  • Types of Assessments
  • Placement Testing: The Why
  • Placement Testing: The How
  • Types of Assessment Tools and Approaches for Placement
  • Selecting Placement Tests
  • Additional Considerations
  • Using Placement Test Results
  • Assessment Plans: The Why
  • Assessment Plans: The How
  • Aligning Assessment with Instruction
  • Performance-based Assessment Tasks
  • Scoring Performance-based Assessment Tasks
  • Using Integrated Performance Assessments
  • Designing Integrated Performance Assessments
  • Intercultural Communicative Competence
  • Assessing Intercultural Communication
  • Assessing Cultures
  • Assessment and Program Articulation

How do I design performance-based assessment tasks?

assessment task based performance

Well-designed performance tasks are based on needs as determined by your course objectives and available resources; authentic to both real-world activities and your students’ life experiences; contextualized by providing a thorough description of the background so students can imagine themselves in the situation; appropriate for students’ proficiency levels; and include clear standards , or criteria of a successful performance that students can understand.

Knowing these features, you can develop a performance task using the following steps:

Choose one performance objective or learning outcome and brainstorm ways you can assess student progress on or mastery of that objective. Note the proficiency level and communicative mode for the chosen objective.

Write what students will do to complete the task, what successful completion of the task looks like, and how students will demonstrate their performance of the objective being assessed., write what instructions, background context information, and materials students will receive to support their completion of the task. consider your practical needs in terms of where and how the task will be administered, for example, in small groups, pairs, or individually., brainstorm how you will evaluate student performances using a checklist or rubric, and whether that is an existing tool, one that can be adapted, or needs to be created., write how you will share feedback with students and other stakeholders. consider what you will do with the information gathered from this task, including how results will inform future instruction in your classroom or program..

  • Finally, consider if there is a way you want to provide student choice during the task.

assessment task based performance

  • EMPLOYEE RESOURCES
  • PRIVACY POLICY

Powered by World Data Inc.

logo that says helpful professor with a mortarboard hat picture next to it

Performance-Based Learning: 15 Examples, Pros and Cons

performance-based learning examples and definition, explained below

Performance-based learning involves students being able to do something, perform something, or demonstrate something. Students develop specific skills related to the subject under study, which helps them see the connection between academic concepts and real-life situations.  

Students apply knowledge they have learned in class to a practical scenario. This allows them to exercise skills and gain a different perspective on the subject under study.

As Hibbard et al. (1996) explained,

“Performance-based learning and assessment achieve a balanced approach by extending traditional fact-and-skill instruction” (p. 5).

Performance-based learning can completely transform the educational experience of students. It is far more interesting to students and matches the active learning style of those that have trouble sitting still and listening for 50-minutes at a time.

Through active learning , students are believed to absorb information more deeply and learn about subtle nuances of a subject that cannot be fully understood through a traditional classroom lecture.

Performance-Based Learning Examples

  • Demonstrating knowledge through creating a product: A high school English literature teacher asks students to choose a famous play by Shakespeare and transform it into a comic book.
  • Learning to drive: The process of learning to drive is a great example of performance-based learning because it can’t be merely theoretical. The student sits behind the wheel and performs the task. When assessed in a driving test, it’s also a performance-based test.
  • Developing a practical plan based on what has been learned: Students studying to be paramedics are instructed to develop an emergency management plan for an airplane accident.
  • Working on an outdoors project: An entire elementary school creates natural science lessons that directly engage children in gardening .
  • Physically building something: Every year, the students in a physics course participate in a competition on which group can construct the strongest bridge out of paper and tape.
  • Assessing students based on role-play scenarios: Professor Santos has his HR students conduct role-plays on how to tell a senior employee that their contract will not be renewed.
  • Applying math to real life: A team of math teachers put together an assignment called Mission Relief that involves students playing different roles to guide an airplane to safety using mathematical formulas.
  • Creating new technology in class: Advanced Design students choose to design a tech gadget. They are assessed on the product quality, fit for market, and how well it solves real-life problems for consumers.  
  • Creating a book: Dr. Flannigan lets her elementary education students create their own pop-up book on a theme of their choice.
  • Teaching practicum: Teacher education students spend a lot of time at university reading books and talking theory. But in the courses where they go into a classroom and practice teaching, they get the biggest benefit because they get to perform the craft they’ve been learning all about in class.

Benefits of Performance-Based Learning

Key benefits include:

  • Students are assessed on their practical application of knowledge rather than mere theoretical understanding.
  • Teaching and learning is skewed toward active learning rather than passive learning. Active learning is believed to be more effective for long-term intellectual, physical, and social development.
  • Students get to see and learn about the complexities of application of theory to real life, allowing for learning about the nuances of concepts.
  • Learning tends to lead to the creation of a product, which can be a motivational force for learners.
  • The connection between school and out-of-school life (e.g. life skill development and workplace skill development) tends to be emphasized.

Weaknesses of Performance-Based Learning

  • It is difficult to administer normative standardized tests for many performance-based assessment tasks. This makes it hard to generate quantifiable, comparable, and standardizable grades for students.
  • The term can be seen as overly vague. Other concepts, like active learning and project-based learning have significant overlaps with this concept, but have much more scholarly research underpinning them.

Performance-Based Learning Case Studies

1. oral presentations.

Oral presentations can be used in almost any course. It gives students an opportunity to develop extremely valuable communication skills and build self-confidence.

For example, instead of writing a lengthy term paper on the industrial revolution or some other historical event, students can construct a PPT and make a class presentation.

Writing a term paper does build certain skills, but communicating with others and learning how to present information verbally is a common job task in many occupations.

Certain rigorous elements of conducting research can still apply, such as reading the relevant research, making a reference section, and using appropriate citation practices during the presentation. But the skills the students exercise are far more pragmatic.

2. Dramatic Performances

Putting on a dramatic performance is a collaborative activity that doesn’t just involve students acting on stage. There are many other roles for students to play in this kind of performance-based project.

The term dramatic performance can refer to dance, a recital, reading of poetry, or a performing a play. Although being on a stage in front of an audience is a great way to build social confidence, there are many other benefits.

For example, students must learn fundamental project management skills. They have to devise schedules, assign leadership and work teams, allocate resources, and learn about teamwork and conflict resolution . 

These are all very valuable skills for students to practice in the safety of an academic context.

3. Science Fairs

Maybe one of the best examples of performance-based learning is the science fair. Giving students an opportunity to display their project at an exhibition, describe it to others and field questions, are all skills they can carry with them long after graduation.

Of course, there is a lot of learning accomplished while the students work on their projects as well. For example, rather than reading about the plant life cycle, students can actually learn about it by planting their own seeds.

They can document the entire process by taking photos or making sketches of each stage. Their display can also contain text, images, and graphs. Students will still need to read, but they will also learn how to pin-point key information for their display. This sounds simple enough, but it is a key skill that has to be developed with practice.

4. Consumer Science Project

University students in a consumer science course can learn about the expectations of customers by conducting a focus group or a survey that solicits their opinions.

For example, the professor may ask students to design, administer, and analyze a customer satisfaction survey for a restaurant.

Working in groups, students will begin by brainstorming questions to include in the survey and narrowing down those that are most relevant to the type of restaurant their project targets.

The group could then partner with a local restaurant and obtain permission to collect data for a given period of time. Once data collection has been completed, the group will perform various analyses and create several graphs and charts that highlight the key findings.

Learning how to formulate a point of view and convey those points in a public setting can be a very daunting task for most middle school and high school students. It takes a keen understanding of argument potency and a high degree of self-confidence to go head-to-head against an opponent.

However, there may be no more important skill than being able to identify the flaws in a point of view and then counter with convincing statements that support a different position.

The skills needed in debate also include conducting research and practicing civility and diplomacy . Even if not in a formal debate context, the give-and-take of arguments is a common occurrence in the classroom and later in life at business meetings.

Performance-based learning gives students an opportunity to take what they have learned in the classroom and apply it to an activity or project.

It is a great way for students to see the connection between concepts in a textbook and their application to a practical situation.

Preschool teachers to university professors implement performance-based learning in their classrooms to help students develop skills that they will use throughout their lifetime.

These skills include learning how to manage projects, collaborate with others, search for and weigh evidence, as well as performing in a public setting in a professional manner.

Burguillo, J. C. (2010). Using game theory and competition-based learning to stimulate student motivation and performance. Computers and Education, 55 (2), 566-575. https://doi.org/10.1016/j.compedu.2010.02.018

Hibbard, M. K., Elia, E., & Wagenen, L. van. (1996). A teacher’s guide to performance-based learning and assessment . Alexandria, VA: Association for Supervision and Curriculum Development.

Street, L. A., Martin, P. H., White, A. R., & Stevens, A. E. (2022). Problem-based learning in social policy class: A semester-long project within organizational policy practice. Journal of Policy Pracctice and Research , 3 , 118-131. https://doi.org/10.1007/s42972-022-00047-4

Wirkala, C., & Kuhn, D. (2011). Problem-based learning in K–12 education: Is it effective and how does it achieve its effects? American Educational Research Journal, 48 (5), 1157–1186. https://doi.org/10.3102/0002831211419491

Ernst, Dana & Hodge, Angie & Yoshinobu, Stan. (2017). What Is Inquiry-Based Learning? Notices of the American Mathematical Society, 64 . 570-574. https://doi.org/10.1090/noti1536

Beyrow, M., Godau, M., Heidmann, F., Langer, C., Wettach, R., & Mieg, H. (2019). Inquiry-Based Learning in Design. Inquiry-Based Learning – Undergraduate Research (pp. 239-247). https://doi.org/10.1007/978-3-030-14223-0_22

Lee, V. S., Greene, D. B., Odom, J., Schechter, E., & Slatta, R. W. (2004). What is inquiry guided learning. In V. S. Lee (Ed.), Teaching and learning through inquiry: A guidebook for institutions and instructors (pp. 3-15). Sterling, VA: Stylus Publishing.

Seltzer, E. (1977). A comparison between John Dewey’s theory of inquiry and Jean Piaget’s genetic analysis of intelligence. The Journal of Genetic Psychology , 130 (2d Half), 323–335. https://doi.org/10.1080/00221325.1977.10533264

Dave

Dave Cornell (PhD)

Dr. Cornell has worked in education for more than 20 years. His work has involved designing teacher certification for Trinity College in London and in-service training for state governments in the United States. He has trained kindergarten teachers in 8 countries and helped businessmen and women open baby centers and kindergartens in 3 countries.

  • Dave Cornell (PhD) https://helpfulprofessor.com/author/dave-cornell-phd/ 25 Positive Punishment Examples
  • Dave Cornell (PhD) https://helpfulprofessor.com/author/dave-cornell-phd/ 25 Dissociation Examples (Psychology)
  • Dave Cornell (PhD) https://helpfulprofessor.com/author/dave-cornell-phd/ 15 Zone of Proximal Development Examples
  • Dave Cornell (PhD) https://helpfulprofessor.com/author/dave-cornell-phd/ Perception Checking: 15 Examples and Definition

Chris

Chris Drew (PhD)

This article was peer-reviewed and edited by Chris Drew (PhD). The review process on Helpful Professor involves having a PhD level expert fact check, edit, and contribute to articles. Reviewers ensure all content reflects expert academic consensus and is backed up with reference to academic studies. Dr. Drew has published over 20 academic articles in scholarly journals. He is the former editor of the Journal of Learning Development in Higher Education and holds a PhD in Education from ACU.

  • Chris Drew (PhD) #molongui-disabled-link 25 Positive Punishment Examples
  • Chris Drew (PhD) #molongui-disabled-link 25 Dissociation Examples (Psychology)
  • Chris Drew (PhD) #molongui-disabled-link 15 Zone of Proximal Development Examples
  • Chris Drew (PhD) #molongui-disabled-link Perception Checking: 15 Examples and Definition

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 15 February 2024

Assessment of Self-report, Palpation, and Surface Electromyography Dataset During Isometric Muscle Contraction

  • Jihoon Lim   ORCID: orcid.org/0000-0002-9596-0906 1 ,
  • Lei Lu 2 , 3 ,
  • Kusal Goonewardena 4 ,
  • Jefferson Zhe Liu   ORCID: orcid.org/0000-0002-5282-7945 1 &
  • Ying Tan 1  

Scientific Data volume  11 , Article number:  208 ( 2024 ) Cite this article

Metrics details

  • Biomedical engineering

Measuring muscle fatigue involves assessing various components within the motor system. While subjective and sensor-based measures have been proposed, a comprehensive comparison of these assessment measures is currently lacking. This study aims to bridge this gap by utilizing three commonly used measures: participant self-reported perceived muscle fatigue scores, a sports physiotherapist’s manual palpation-based muscle tightness scores, and surface electromyography sensors. Compensatory muscle fatigue occurs when one muscle group becomes fatigued, leading to the involvement and subsequent fatigue of other muscles as they compensate for the workload. The evaluation of compensatory muscle fatigue focuses on nine different upper body muscles selected by the sports physiotherapist. With a cohort of 30 male subjects, this study provides a valuable dataset for researchers and healthcare practitioners in sports science, rehabilitation, and human performance. It enables the exploration and comparison of diverse methods for evaluating different muscles in isometric contraction.

Background & Summary

Muscle fatigue, characterized by a decline in muscle performance and accompanied by feelings of weakness, tiredness, or exhaustion in the affected muscles, is a prevalent and non-specific symptom experienced by many individuals 1 , 2 . It can be associated with a range of health conditions, including muscle strain, chronic fatigue syndrome (CFS), and overtraining syndrome, which can result from the accumulation of untreated muscle fatigue over time 2 , 3 . Therefore, monitoring muscle fatigue plays a crucial role in providing timely intervention for these conditions. However, detecting and measuring muscle fatigue poses significant challenges due to its complex nature. It involves multiple components of the motor system, including mechanisms within the brain and spinal cord, peripheral nerves, neuromuscular junction, excitation-contraction coupling, and force generation 4 . The intricate interplay of these components makes it difficult to isolate and quantify specific parameters related to muscle fatigue.

Despite these challenges, researchers and clinicians have proposed various measures to assess muscle fatigue, which can be categorized into two classes. The first class comprises subjective measures that rely on self-reported data from participants or subjective assessments conducted by clinicians or physiologists. Examples of subjective measures include self-administered questionnaires 5 , the Borg CR-10 scale 6 , and a palpation-based muscle tightness scale often used in clinical or research settings 7 , 8 ; where the palpation-based scale involves the physiotherapist manually assessing the muscle and providing a subjective rating based on the level of muscle tension and tightness. These measures provide valuable insights into the subjective experience of muscle fatigue and its impact on individuals. However, it is important to note that these subjective measures generally have limitations in terms of reliability. For instance, the Borg CR-10 scale may provide unstable estimations due to its subjective nature 9 .

The second class consists of objective measures that utilize various sensors and quantitative methods to assess muscle fatigue. These objective measures provide more quantitative and precise insights into muscle fatigue. They include blood tests, electromyography (EMG), or surface EMG (sEMG) to measure electrical activities of muscles, force measurements using dynamometers or force plates, and other sensor-based technologies like accelerometers or wearable devices. For example, motion sensors have been employed to assess the perceived muscle fatigue on coordination during endurance running 10 , and accelerators were found efficient to monitor fatigue during intermittent exercise 11 . These objective measures contribute to a more comprehensive and objective evaluation of muscle fatigue. Among these sensors, sEMG sensors have been widely used due to their simplicity of use. The sEMG signal can provide valuable information by decomposing sEMG signals to extract information and neural activation 12 , 13 and characterize muscle fatigue through changes in signal indicators 14 , 15 , 16 , 17 , 18 . Some commonly used sEMG measures include the mean absolute value (MAV), root mean square (RMS), mean frequency (MNF), and median frequency (MDF), nonlinear variables such as percentage of recurrence and determinism by recurrence quantification analysis (RQA) 19 , 20 among others. For instance, sEMG signal was utilized to assess muscle fatigue during a forward head and rounded shoulder sitting posture 21 ; it has been used to evaluate the effects of fatigue on muscle synergies for baseball players 22 ; another study also used sEMG signal to detect localized muscle fatigue for track and field athletes 23 .

Understanding muscle fatigue is a challenging task since it is influenced by various factors such as psychological, physiological, and sociological factors 24 , 25 . A combination of subjective and objective measures can provide more robust insights into muscle fatigue analysis and studies are comparing the subjective and objective measures of muscle fatigue 26 , 27 , 28 , 29 , 30 , 31 , 32 . However, these studies have several shortcomings: (i) a limited number of sensors attached to a restricted area of muscle parts, for example, only one sEMG sensor on the right trapezius muscle is used 30 . (ii) a small sample size, and (iii) a lack of sEMG signals, which play an important role in understanding muscle activities. For example, pressure andoxygenated hemoglobin levels were measured using a pressure sensor 28 , as well as vertical jump levels without employing the sEMG sensor 29 . Therefore, the goal of this paper is to enable a comprehensive understanding of muscle fatigue by presenting data collected through three measures: (1) participant’s self-reported perceived muscle fatigue rank, (2) muscle tightness evaluations conducted by an experienced sports physiotherapist employing manual palpation-based techniques, and (3) extracted feature from sEMG signal measurements. These measures have been selected for their widespread usage and ability to offer valuable insights into the assessment of muscle fatigue. It is highlighted that the experiments were instructed by an experienced physiotherapist with over thirty years of physiotherapy services in sports, providing a high level of expertise in the clinical field. To assess and measure compensatory muscle fatigue, the sports physiotherapist selected nine distinct upper body muscles for evaluation. These specific muscles were chosen based on their involvement in the task and their potential to contribute to muscle fatigue.

The study comprises three datasets aimed at understanding the occurrence and consequences of compensatory muscle fatigue. Firstly, participants were asked to report the top three most fatigued muscles at the conclusion of the experiments which is perceived muscle fatigue, providing valuable subjective insights into their individual experiences. It is known in clinical practice that tight muscles are often associated with inefficient function and a higher susceptibility to fatigue 33 , 34 . Thus, in the second step of the experiment, a sports physiotherapist proficiently assigned a continuous muscle tightness score to each of the nine selected muscles. Manual palpation by sports physiotherapists, as conducted in this study, presents a potential measure for assessing muscle fatigue. Additionally, at the conclusion of the experiments, the physiotherapist identified and scored the top three muscles exhibiting the highest tightness, utilizing his expertise and manual palpation-based techniques. Lastly, sEMG sensors were used to record muscle activity from the nine selected muscles, under the guidance of the experienced sports physiotherapist.

To the best of our knowledge, this work represents the first comprehensive set of data, shedding light on the occurrence and dynamics of compensatory muscle fatigue during specific movements. This dataset offers several advantages for further research and analysis. Firstly, it serves as a valuable resource for exploring and comparing diverse methods used to assess muscle fatigue. Researchers can utilize this dataset to gain insights into the strengths and limitations of different approaches, advancing the field of muscle fatigue assessment. Secondly, the dataset provides a unique opportunity to investigate compensatory muscle fatigue in a controlled and standardized manner. By analyzing the data, researchers can gain valuable insights into the mechanisms and dynamics of this phenomenon, deepening our understanding of how muscles interact and adapt during physical tasks. Thirdly, this dataset can serve as a benchmark for future studies in the field. It provides a reference point for replicating and validating findings, ensuring the reliability and reproducibility of research in the area of compensatory muscle fatigue. Furthermore, the dataset can inspire the development of new methodologies and approaches for studying and quantifying muscle fatigue.

Overall, this dataset holds immense potential for advancing our knowledge of muscle fatigue and its implications in various fields, including sports science, biomechanics, rehabilitation, and human performance.

Participants

In this study, we recruited thirty healthy male participants without any history of neurological or muscular pathology from Australia. The experiments were performed between August 2022 and December 2022. Among the participants, 28 had a dominant right hand, and 2 had a dominant left hand. Each participant held the dumbbell using their dominant hand. Out of the thirty participants, 29 completed the entire experimental protocol, while one individual was suspended in the middle of the process due to muscle fatigue.

Prior to the experiment, all subjects were instructed to disclose any medical conditions or medications to the sports physiotherapist. To minimize the impact of prior physical activities, participants were advised not to exercise or engage in heavy lifting for at least three hours before the start of the session. It was also emphasized that participants perform proper warm-up exercises to reduce the risk of injury.

Before commencing the experiment, each participant provided information regarding their dominant hand, height, weight, and age 35 (see Table  1 ). Detailed information about the experimental protocol was provided to all participants, and they were required to sign a consent form prior to their involvement in the study. The study protocol was approved by the Human Research Ethics Committee of the University of Melbourne (ID: 1954575).

Experimental setup

In this study, commercially available Delsys Trigno Avanti sEMG sensors (Delsys Incorporations, USA) were used as shown in Fig.  1a . These sensors were attached to the skin using a customized double-sided adhesive interface without conductive paste or gels, and compressive pressure was applied to enhance the adhesive strength as shown in Fig.  1b,c . The adhesive tape is single-use only and medical-grade approved for dermatological applications. The use of the adhesive interface ensured an electrical connection between the sensor and the skin, minimizing noise from line interference.

figure 1

Delsys Trigno Avanti surface EMG sensor system.

However, it is important to note that the planar, flat, and rigid surface of commercial sEMG sensors makes them vulnerable to motion artifacts caused by relative motions 36 . These motion artifacts can contaminate the sEMG signals 37 . Additionally, the sensors may fall off from the curvilinear parts of the human skin, even with the tailor-made adhesive interface, during dynamic movements 38 To address these limitations, sEMG signal recording in this study was conducted with participants in a static posture with isometric muscle contraction. This approach minimized the influence of motion artifacts and ensured the stability of the sensor placement throughout the experiment.

Based on the selected static posture with isometric muscle contraction, the experienced sports physiotherapist identified nine different locations of muscle groups on the upper limb and body. These specific muscle locations were carefully chosen, considering the potential sequential fatigue that might be triggered by this particular movement, in order to analyze compensatory muscle fatigue.

When using sEMG sensors, the placement of sensors on the skin plays a critical role in ensuring high-quality signals and reliable measurements. The electrode orientation refers to aligning the line connecting the two bipolar electrodes with the direction of the muscle fibers 39 .

Aligning the electrodes in the direction of the muscle fiber is important as muscle activity signals depend on this orientation. In addition, the sensor locations were carefully selected to avoid innervation zones, origin or insertion locations, and the edge of the muscle belly. Therefore, the sensors were positioned following the recommended location and orientation of the muscle fibers to enable accurate estimation of spectral parameters 39 , 40 , 41 .

Moreover, each sensor placement was isolated strategically from other muscles to minimize muscle crosstalk and interference. Even small displacements of the electrodes within a centimeter range had a huge impact on the signal, not ensuring reliable readings. Additionally, the selected sensor locations were chosen to have minimal hair, reducing the muscle crosstalk from surrounding muscles.

The placement of the nine wireless Delsys Trigno Avanti sEMG sensors on the upper body followed specific guidelines from the literature 39 , 40 , 42 , 43 , and was examined by a physiotherapist at our local sports center. Firstly, the six sensor placement locations (# 2 (BB), # 3 (TB), # 5 (UT), # 7 (MT), # 8 (LT), and # 9 (AD)) were marked with erasable markers based on the guidelines ( SENIAM; Surface Electromyography for the Non-Invasive Assessment of Muscles ) for the sEMG sensors. Additionally, three additional locations (# 1 (BR), # 4 (IS), and # 6 (PCS)) were selected based on recommendations from the sports physiotherapist. The muscle and sensor locations are described in Fig.  2 and detailed information about each sensor placement and orientation can be found in Table  2 . In Fig.  2 , muscle locations are indicated for right-handed participants.

figure 2

Sensor locations on nine different muscles: Brachioradialis (# 1 (BR)), Biceps Brachii (# 2 (BB)), Triceps Brachii (# 3 (TB)), Infraspinatus (# 4 (IS)), Upper Trapezius (# 5 (UT)), Paraspinal Cervical Spine (# 6 (PCS)), Middle Trapezius (# 7 (MT)), Lower Trapezius (# 8 (LT)), Anterior Deltoid (# 9 (AD)).

The entire experimental procedure was conducted under the supervision of an experienced sports physiologist, ensuring the precise positioning of sensors and accurate measurement of muscle activities. The presence of the experienced physiotherapist guaranteed consistent sensor placement for each participant, mitigating the potential effects of crosstalk and other sources of interference.

Experimental protocol

Prior to commencing the experiments, several precautions were taken to ensure optimal sEMG signal quality. The sports physiotherapist provided a detailed explanation of the experimental protocol and addressed any questions or concerns from the participants, ensuring their understanding and cooperation throughout the process. To prepare for the attachment of sensors, participants were instructed to remove their upper garments, allowing for direct contact between the sensors and the skin. Then, in order to maintain hygiene and minimize contact impedance between the electrodes and the skin, the designated skin areas for sensor placement were thoroughly cleansed using antiseptic skin wash, effectively removing any surface residues. Subsequently, the skin was completely dried to ensure firm electrode-skin contact. Any electronic devices or accessories that had the potential to interfere with the signal quality were removed to ensure accurate and reliable measurements.

During the experiment, participants remained bare-bodied from the waist up, adopting an upright posture with their torso in a relaxed position. The attachment of sensors began by placing each sensor on the target muscles of the dominant upper limb and body, following the predetermined locations identified by the sports physiotherapist.

In this study, the participants underwent a single movement test, which consisted of two separate sessions: a preparatory session (Ses1) and a data collection session (Ses2) as shown in Table  3 .

Preparatory session (Ses1)

During the trial session of the dumbbell frontal raise, each lasting 10 seconds, the sEMG signals of each muscle were assessed. The recording session immediately followed, with the consecutive performance of the exercise. Commercial sEMG sensors recorded the electrophysiological signals from nine different muscles, which were displayed in real-time on a computer screen. Participants were instructed to avoid any movement in elbow and wrist flexion/extension during the exercise. After the trial session, participants had a 30-second rest period to ensure muscle readiness and minimize potential muscle fatigue before the data collection session.

For the calibration of sensors, we checked the display if the signal reading from all nine sensors was correct or wrong. The real-time Signal Quality Monitor panel was used to check whether noise interference or not good adhesion of the sensor on the skin was found. If the signal quality is not good which is outside of the green area on the gauge panel or the Bluetooth connection is not stable, reattachment was conducted.

In the experiments, when all factors were pointing to green status, we initiated the data collection process which confirmed the acquisition of a high-quality sEMG signal. If any of these factors fell within the yellow or red areas on the scale, for example, shown in Fig.  3 , calibration procedures such as checking for baseline noise on the display panel, adjusting the sensor’s position, and skin preparation were done again. These actions were taken to address any signal quality issues and ensure the reliability of the recorded data.

figure 3

Delsys sensor signal quality check.

Data collection session (Ses2)

In the data collection session, the nine sensors were attached to designated muscle locations on the subject’s upper body according to the experimental protocol. The subject assumed a comfortable and stable standing position with feet shoulder-width apart. As a starting position, the subject held a 1 kg dumbbell in their dominant hand, allowing the arm to hang straight down by their side with a pronated grip which means that the palm facing down. Slowly, the dominant arm was raised forward until it reached a peak position which is a parallel position to the ground while maintaining a straight arm and neutral wrist. The sports physiotherapist visually monitored the arm position and at the top of the lift, the dominant arm should be at or slightly below shoulder level, which is 90 to 180 degrees of shoulder flexion, with 180 degrees indicating a fully extended arm. Participants were instructed to maintain a stationary posture throughout the experiment while holding the 1 kg dumbbell in their dominant hand as shown in Fig.  4 .

figure 4

Nine sensors attached on the upper body (Front-view, Back-view, side-view).

The figures presented in this study show muscle placements and locations as indicated for a right-handed participant. For left-handed individuals, the sensors were attached in the opposite orientation to prevent any potential confusion in the interpretation of the results. This approach was taken to ensure consistency in the sensor placement and to cater to both right-handed and left-handed participants.

During the exercise, a sports physiotherapist conducted regular assessments of muscle tightness using a palpation-based approach. At intervals of every 30 seconds, the sports physiotherapist evaluated the participant’s muscle tightness using a muscle tightness 4-point ordinal scale ranging from 0 to 3 44 . This continuous assessment of muscle tightness was carried out throughout the entire exercise duration of 210 seconds, allowing for the observation of muscle tightness progression over time. The chosen exercise duration aimed to replicate sustained effort and enable the monitoring of muscle fatigue development 45 . At the end of the exercise, the participant slowly lowered their arm back to the starting position and rested for 60 seconds. Participants were advised to discontinue the exercise if they experienced muscle fatigue to the extent that they could no longer maintain the required posture. Following the experiment, all sensors were detached, and both the skin and sensors were gently cleansed using an antiseptic skin wash to ensure hygiene and prepare for subsequent sessions or analyses.

Following the completion of the exercise, each participant was requested to rank the top three locations among the nine monitored muscles that they subjectively perceived as the most fatigued. In the case of muscles located on the backside, namely # 4 (IS), # 5 (UT), # 6 (PCS), # 7 (MT), and # 8 (LT), which were challenging to distinguish individually for muscle fatigue. Therefore, a sports physiotherapist assisted by pointing to each muscle to help participants identify and rank the muscle that was causing perceived muscle fatigue. This self-reported ranking provided participants with the opportunity to express their personal assessment of the muscles that they felt experienced the highest degree of fatigue during the exercise and aimed to collect subjective feedback on perceived muscle fatigue. Concurrently, the sports physiotherapist also ranked the top three muscles that exhibited the most significant signs of tightness. In this study, the palpation-based muscle tightness score data included the continuous assessment conducted by the sports physiotherapist as well as the ranking of the muscles at the end of the experiment based on his assessment.

The total duration of the experiment, including both preparation and data collection, was estimated to be approximately 20 minutes. The preparation phase accounted for approximately 10 minutes. The remaining time was dedicated to the actual data collection session, during which the participants performed the designated exercise while their muscle activity was recorded.

Data Collection and Processing

In the data collection and processing phase of the study, three categories of data were recorded: (1) self-reported perceived muscle fatigue data, (2) palpation-based muscle tightness score data, and (3) sEMG signal data.

All collected data underwent comprehensive processing and analysis. Custom-written scripts in Matlab (The MathWorks Inc., US) were created and employed for the data processing tasks. They provided essential tools to process the data in a systematic manner, allowing for the extraction of meaningful insights and outcomes.

Self-reported perceived muscle fatigue data

As part of the experimental protocol, participants were instructed to provide self-reports on the top three muscle sites among a predefined selection of nine muscles where they perceived muscle fatigue. This self-report assessment relied on participants’ subjective feelings and perceptions of muscle fatigue in these specific muscle sites. By capturing participants’ personal experiences and sensations, this self-reporting approach added a valuable subjective dimension to the evaluation of perceived muscle fatigue during the experiment. Table  4 presents two examples of self-reported data, demonstrating the format of the data. In this table, sf 1 , sf 2 , and sf 3 represent the top three muscles that showed the most perceived fatigue for each subject, respectively.

Physiotherapist’s palpation-based muscle tightness score data

In addition, an experienced sports physiotherapist with expertise in palpation conducted a categorical scoring system by subjective examination of soft tissue tightness during the experiment. This examination yielded two sets of data. The first set, referred to as Data 1 , consists of continuous assessments made by the sports physiotherapist throughout the experiment. The second set, referred to as Data 2 , represents the physiotherapist’s ranking of muscle tightness at the end of the experiment for each participant.

The physiotherapist’s palpation aimed to assess muscle tightness rather than directly evaluating ‘fatigue’. It is a clinical understanding that tight muscles tend to function inefficiently and are more prone to experiencing fatigue 33 , 34 . Therefore, the assessment aimed to identify the top three muscles with the highest levels of tightness, as determined by the sports physiotherapist. This approach was used to indirectly gauge the potential impact of muscle tightness on muscle fatigue.

To illustrate the format of the collected data, Table  5 provides two examples of Data 1 , showcasing the sports physiotherapist’s palpation-based scores of muscle tightness for the nine monitored muscles.

Table  6 provides two examples of Data 2 , demonstrating the final assessment of the sports physiotherapist. In this table, pf 1 , pf 2 , and pf 3 represent the top three muscles with the highest level of muscle tightness from the sports physiotherapist, respectively.

A commercially available Delsys Trigno Avanti sensor system was utilized for the detection and measurement of muscle activities in nine specific upper body muscles of each subject. An example of sEMG data is shown in Fig.  5 . All nine sEMG sensors were synchronized on time. During data collection, the sEMG signals were acquired using the EMGworks Acquisition 4.8.0 software and recorded at a sampling rate of 2148 Hz. The sEMG sensor transmitted the data in real-time to the Lenovo Thinkbook laptop ( Intel ( R ) Core i 7–1165 G 7 @ 2.80  GHz , 16 GB RAM ) wirelessly which was connected to the base station via a USB cable. All the data collected from the software is exported to an Excel spreadsheet for further analysis. The demonstration of sEMG signals from Subject 19 for 9 different muscle locations on the screen of the EMGworks Acquisition system is shown in Figure  5 .

figure 5

sEMG signals acquisition using EMGworks Acquisition 4.8.0 software for nine different muscles on upper body (brachioradialis (# 1 (BR)), biceps brachii (# 2 (BB)), triceps brachii (# 3 (TB)), infraspinatus (# 4 (IS)), upper trapezius (# 5 (UT)), paraspinal cervical spine (# 6 (PCS)), mid trapezius (# 7 (MT)), lower trapezius (# 8 (LT)), anterior deltoid (# 9 (AD))).

The sEMG signal from the commercial sensor refers to the sEMG signal acquired after it has been amplified and subjected to bandpass filtering. This filtering process involves the use of analog components equipped with filters. Specifically, the Butterworth 2nd-order high-pass filter and a 4th-order low-pass filter were used to ensure the entire frequency spectrum within the specified range is captured 46 . Additionally, a 20 Hz high-pass filter is integrated to reduce the motion artifact which distorts the underlying physiological signal 47 . These integrated design features demonstrate the good signal quality of sEMG signals in this study.

figure 6

Schematic representation of the provided dataset. The SciData dataset is organized into two types of files: .xlsx and .mat files. The combination of .xlsx files in the SciData dataset allows for a comprehensive representation of the data.

Data Records

The dataset accompanying this paper is available for download on figshare 48 . It is provided under an open license, allowing users to freely utilize the data for any purpose. In Fig.  6 , the schematic representation of the dataset organization illustrates how the files in the dataset are structured and organized. The figure shows the hierarchical arrangement of files and variables and parameters for each file.

figure 7

Technical validation of sEMG sensor signal quality.

The raw data from the subjective measurements (self-reported perceived muscle fatigue scores and physiotherapist’s palpation-based muscle tightness measurement scores) are stored in individual files encoded in .xlsx format. Each file corresponds to a specific measurement and contains the relevant data collected during the experiment. These files serve as a comprehensive record of the subjective assessments performed by the participants and the sports physiotherapist. Furthermore, the sEMG data using 9 different sensors on corresponding muscles recorded from each participant are stored in separate files encoded in .xlsx format. These files contain a set of sEMG sensor variables that collectively constitute the participant’s data. Each variable represents a specific aspect of the recorded data and contributes to a comprehensive understanding of the participant’s measurements.

Self-reported perceived muscle fatigue rank data

Subjective measurement data of participants’ self-reported perceived muscle fatigue rank were summarized in Excel spreadsheet format (e.g., SelfReported_Subject01.xlsx ).

Subject : Each data file is named according to the participant number, which is an integer ranging from 1 to 30.

Sensor : Sensor 01 - Sensor 09 corresponds to the muscle part described in Fig.  2 .

Self-reported perceived muscle fatigue Rank 1, Rank 2, Rank 3 : The data records for self-reported perceived muscle fatigue rank 1, 2, and 3 include information on the participants’ subjective assessment of their muscle fatigue levels. Each record specifies the participant number, the rank of perceived muscle fatigue (1, 2, or 3), and the corresponding muscle site. These records provide insights into the participants’ individual perceptions of muscle fatigue and contribute to understanding the subjective experience of fatigue during the experimental sessions.

Physiotherapist’s palpation-based muscle tightness rank data

Sports physiotherapist’s palpation-based assessment of muscle tightness during the 210-second experiment with 30-second intervals and final assessment of muscle tightness was summarized in Excel spreadsheet format (e.g., PhysioPalpation_Subject01.xlsx ).

Sensor : Sensor 01 - Sensor 09 corresponds to the muscle parts described in Fig.  2 .

Muscle tightness measurements for 210   seconds with 30-second intervals : The subjective data records for each participant include the physiotherapist’s palpation-based measurements taken at 0 which is the starting point and 30-second intervals for a total of 8 times across nine muscle locations.

Sports physiotherapist’s palpation-based muscle tightness Rank 1, Rank 2, Rank 3 : Followed by the muscle tightness measurement with 30-second intervals, the data records for sports physiotherapist-assessed muscle tightness rank 1, 2, and 3 contain the evaluations conducted by the physiotherapist. Each record includes the participant number, the rank of muscle tightness assigned by the physiotherapist (1, 2, or 3), and the associated muscle location. These records reflect the expert judgment of the physiotherapist regarding the severity and localization of muscle tightness, providing valuable assessments of muscle condition during the experimental sessions.

Raw data contains sEMG data for all subjects with nine muscles. The sEMG time and signal data were collected via a Bluetooth module and an in-house data acquisition (DAQ) system. The recorded data was stored in Excel Spreadsheets in .xlsx format, with each participant’s data saved in a separate file (e.g., Subject01.xlsx ).

Time : The sEMG raw time data consists of the time series measurements recorded from the sEMG sensors. These sensors captured the electrical activity generated by the muscles during the experimental sessions. Each data entry in the time series corresponds to a specific time point. The sEMG raw time data is stored in an Excel spreadsheet (.xlsx) using Time [s] format.

sEMG signal : The sEMG signal data contains the amplitude of the electrical signals recorded by the sEMG sensors. These signals represent the muscular electrical activity and provide insights into the muscle’s activation levels during the experimental sessions. Each entry in the signal data corresponds to a specific time point, reflecting the magnitude of the electrical activity at that particular moment. The sEMG signal data is stored in an Excel spreadsheet (.xlsx) using Avanti sensor 5: EMG.A 5 [V] format.

In the sensor configuration, each sensor consists of four electrodes. The upper two electrodes are differential sEMG pairs, and the lower two electrodes are stabilizing references. It allows the sensor to quickly respond to disturbances detected on the skin surface which reduces the impact of potential noise sources 49 . Additionally, bar electrodes composed of 99.9% silver with a 10 mm inter-electrode distance (IED) were positioned to minimize the crosstalk from surrounding muscles effectively and ensure the good signal quality 50 .

Technical Validation

This section contains the sEMG sensor configuration and acquisition. In the context of sEMG sensor placement for signal assessments, the guideline in the appendix “Interpretation of Muscle and Signal Quality Assessments” was used 39 . This guideline specifically offers criteria for assessing the quality of signals for sensor placements on the most superficial muscles during isometric contractions. To ensure the reliability of the recorded sEMG signals, the assessment considered four criteria with scores for each muscle. These criteria included the assessment of the signal quality based on its amplitude above the background noise level, the electrode placement to avoid innervation zones, the fidelity of the recorded signal to the natural propagation of muscle activity, and the feasibility of motor unit identification. Each muscle was scored based on how well these criteria were met. Following the guideline, seven muscles (# 1 (BR), # 2 (BB), # 3 (TB), # 5 (UT), # 7 (MT), # 8 (LT), # 9 (AD)) achieved the full score of 6 points, while one muscle (# 4 (IS)) received 5 points. Since the assessment guideline focused on only 43 muscles in the trunk, upper limb, and lower limb, the paraspinal muscle located on the cervical spine (# 6 (PCS)), situated in the neck region, was not included in the scoring system. Therefore, in our study, this metric system with scores validated the setup of the experiment, particularly regarding the careful selection of muscle locations and resulting high-quality sEMG recordings. Additionally, muscle tissue exhibits anisotropic properties, which emphasize the importance of aligning the detection surfaces of the electrodes with the orientation of the muscle fibers. To ensure accurate electrode placement, we collaborated with a sports physiotherapist, informed by relevant literature 39

To guarantee the functionality of the integrated sEMG sensor and to confirm the high quality of the sEMG data collected, various metrics describing the sEMG acquisition process were taken into account. These metrics and their detailed descriptions can be found in the guideline 46 .

Input source impedance : Input impedance refers to the resistance to current flow into each input terminal of an amplifier, and it varies with frequency. When dealing with dry skin, the input impedance at the interface between the skin and the detection surface can range from thousands to millions of ohms. Maximizing the input impedance of the differential amplifier is important to avoid signal loss or distortion due to input loading. It allows the accurate capture of electrode voltages without disruption. Moreover, amplifiers with high input impedance help minimize contamination from unwanted power line interference. This consideration ensures reliable sEMG signal recording without causing issues in the differential amplifier.

Differential amplifier gain : The primary function of an amplifier is to take a weak electric signal originating from the body and amplify its amplitude to make it suitable for recording and display on electronic devices. In this study, a commercial standalone sEMG sensor, Delsys Trigno, with a gain of 1000 which is a gain value well within the accepted range was used 51 . This high gain significantly improved the signal-to-noise ratio (SNR) of the sEMG signal and made it highly resilient to noise and interference.

Common-mode rejection ratio (CMRR) : The utilization of bipolar electrode arrangements is common with a differential amplifier, which effectively eliminates signals common to both electrodes. Typically, the common mode voltage, which is the signal common to both electrodes, is larger than the sEMG signal. The CMRR quantifies the differential amplifier’s accuracy in subtracting these common signals. Therefore, a high CMRR is essential to distinguish the sEMG signals from the background noise effectively. The Delsys commercial sensor utilized in this study shows a CMRR value exceeding 80 dB, which assures excellent signal quality 52 .

sEMG Bandwidth : Frequency band of sEMG between 20 and 450 Hz. A 4th-order Butterworth band-pass filter was used to achieve an effective frequency range of sEMG signals between 20 Hz and 450 Hz, and a 2nd-order Butterworth band-stop filter with cut-off frequencies 49 Hz and 51 Hz was used to remove power frequency noise 47 .

Inter-electrode distance (IED) : The size of electrodes and the space between them highly affect the sEMG signal. Larger detection areas and greater inter-electrode distances (IED) result in greater amplitude of the sEMG signal detection. However, these dimensions should not be too large to avoid picking up the crosstalk interference from neighboring muscles. Delsys sensors maintain a 10 mm IED, effectively reducing crosstalk while preserving sEMG signal amplitude 49 . This fixed IED ensures consistency and repeatability in experiments that maintain data quality and the integrity of sEMG signal acquisition.

Motion artifact : Motion artifact is typically induced by relative movement of the sEMG sensor in relation to the skin. It is an interference on the electrode-skin interfaces and contaminates the signal quality. It becomes especially challenging when dealing with dynamic muscle contractions or rapid body movements. To ensure accurate contact and stable recording in our experiment, we conducted isometric contractions without rapid body movements. This static posture approach allowed us to maintain reliable contact between the sensor and the skin, minimizing unwanted noise during data recording.

Moreover, the Delsys EMGworks Acquisition software incorporates a real-time Signal Quality Monitor tool which provides continuous feedback on the sensor’s signal quality as shown in Fig. 7 . This tool continuously monitors the signal quality of each sensor and provides visual feedback during the experiment. It assesses the estimated SNR within the range of 0–40, baseline noise ranging from 0–40 uVrms, and line or clipping interference ranging from 0–10, in real-time. Acceptable signal quality is indicated by an SNR greater than 1.2, baseline noise below 15 uVrms, and minimal line interference below 2, which is indicated by a green area from the gauge panel. This real-time monitoring system offers a dynamic way to ensure the quality of the recorded data throughout the experiment. Various factors related to signal quality and noise have also been considered and verified, affirming that the experiment’s signal quality meets acceptable standards 53 .

Signal-to-noise ratio (SNR) : SNR is one of the most important quality measures of sEMG signal 54 . It quantifies the ratio between the sEMG signal recorded during muscle contraction and the baseline noise when the muscle is at rest. A higher SNR value indicates a more robust ability to reliably discriminate and extract sEMG data from unwanted noise.

sEMG baseline noise : To ensure the quality of an sEMG signal, it is important to establish the baseline noise of the system. According to the EMGWorks software, Delsys sEMG systems typically exhibit a baseline noise of less than 15 µV which is within an acceptable range of 10–20 µV peak-to-peak from the literature 50 . The quality of the skin-electrode interface significantly influences the level of baseline noise. Thus, before commencing data collection, we conducted a preparatory session to check the baseline noise within a range and ensured the contact between the electrodes and the skin.

Line interference noise : Noise at frequencies of 50 or 60 Hz, originating from power lines, fluorescent lights, and various electrical devices, is a common source of interference in sEMG recordings. Advanced sensor technology with designed circuits for the Delsys sEMG sensor has effectively eliminated this issue.

Clipping : Signal saturation is a type of distortion that occurs when a signal surpasses a certain threshold. This can happen due to sensor detachment or excessively high sEMG signal amplitudes. To maintain signal integrity, it is crucial to monitor any signal clipping to ensure that the sEMG sensor and reference electrode are properly attached and connected. If required, adjustments can be made by reducing the gain or repositioning the sEMG sensor to lower the signal level, as recommended in the literature 55 . In our study, a gain of 1000 is used, which is an appropriate value for enhancing surface sEMG signals, and it is in an acceptable range while avoiding clipping.

Furthermore, three different performance metrics were calculated to check the signal quality. The drop in power density ratio (DPR) indicates whether the signal power spectrum is adequately peaked in the sEMG power spectrum’s frequency range. The power spectrum deformation (PSD) measures the effect of disturbances of the spectrum of a signal with a power spectrum larger than 20 Hz 56 . From Table  7 , both DPR and PSD values showed that the Delsys surface EMG sensors have adequate peaking and are immune to high-frequency noise. Based on the signal quality analysis, these factors ensured that the signal quality maintained a high level of signal quality throughout the experiment.

Subsequently, this section validates the sEMG measurements. One fundamental mathematical technique for analyzing signals is the Fourier Transform, which can deconstruct any signal into a series of sine waves with varying frequencies. We first check the frequency range of sEMG signals. It is known that the frequency range of sEMG signals is in [20 Hz, 450 Hz] 57 , 58 . We have checked the frequency range of all sEMG signals by using a fast Fourier transform (FFT) on sEMG signals. It has been verified that all sEMG signals measured stayed in this range, validating the sEMG signals collected from commercial sEMG sensors in this work. A visualization of the power distribution provides a comprehensive measure of how different frequencies impact the sEMG signal and an example of the frequency spectrum of an sEMG signal is shown in Fig.  8 .

figure 8

Frequency spectrum of the sEMG signal detected during an isometric contraction (Subject 19).

Then, signal processing techniques were used to reduce the contaminated noises 55 and extract the features of the sEMG sensors 59 . The signal processing of sEMG signals was performed using custom-written Matlab scripts and the process included signal acquisition and pre-processing. The sEMG signal was sampled at a rate of 2148 Hz. The signal was then filtered by a digital bandpass filter with a passband between 20 Hz and 450 Hz based on the FFT analysis ensuring that no critical information is lost during signal acquisition 54 . The filtered signal was then rectified with full wave rectification and used for envelope analysis. The filtered signal is shown in Fig.  9 .

figure 9

sEMG signal (blue) and pre-processed sEMG signal (red) for nine different muscles on the upper body (Subject 19, Brachioradialis (# 1 (BR)), Biceps Brachii (# 2 (BB)), Triceps Brachii (# 3 (TB)), Infraspinatus (# 4 (IS)), Upper Trapezius (# 5 (UT)), Paraspinal Cervical Spine (# 6 (PCS)), Middle Trapezius (# 7 (MT)), Lower Trapezius (# 8 (LT)), Anterior Deltoid (# 9 (AD))).

Fatigue is a complex and widespread phenomenon that comes in various forms. It can be categorized as pathological or non-pathological, physical or mental, and can be evaluated subjectively or objectively. Various techniques have been employed to measure fatigue and energy levels. Some methods aim to gauge the impact of fatigue, such as reduced performance, while others aim to pinpoint the origins of fatigue, like muscle dysfunction.

The definitions of muscle fatigue are diverse, and they haven’t been definitively linked to concrete objective measures. This doesn’t undermine the value of both subjective and objective measures of fatigue but highlights the complexity of this phenomenon. While subjective measures of perceived muscle fatigue and objective measures using sEMG sensors are widely employed, the complexity of muscle fatigue persists. In addition to these conventional approaches, the palpation-based technique can be a possible measure that is linked to muscle fatigue. This technique introduces a tactile dimension, providing an alternative means to assess and understand muscle fatigue beyond the established subjective and objective measures.

Even though the integration of these measures remains unclear, both subjective and objective measurements are taken into account in the context of muscle fatigue, as they hold significance in assessing health and quality of life. In future research, it is crucial to bridge the gap between subjective and objective measures by considering multiple factors and conducting calibration studies. Additionally, there is a need for further investigations using hand-held dynamometers, experiments with heavier weights, and longer durations to enhance our understanding of compensatory muscle fatigue.

Usage Notes

To use the provided code, you need to have Matlab installed, preferably version R2021b or higher. You can load the Matlab script file SciDataEMG.m , which is available in the provided link, for data processing and analysis. The dataset is categorized into three sub-groups: SubGroup1.mat comprises data from Subject 01 to Subject 10, SubGroup2.mat contains data for Subject 11 to Subject 20, and SubGroup3.mat includes data for Subject 21 to Subject 30. Then, select the relevant sub-group mat file based on the subject and muscle of interest, and specify the desired subject_id and muscle_id . For instance, if you wish to analyze muscle # 9 (PCS) of Subject 09, load SubGroup1.mat , and assign subject_id  = 9 and muscle_id  = 9. Executing these selections will generate the following plots: (1) sEMG signal plot and (2) sEMG signal and pre-processed sEMG signal plot.

Code availability

The custom-written code used for data acquisition and analysis in this paper can be downloaded from figshare 60 . The provided files contain the necessary scripts and functions for data acquisition and signal processing.

• readme.pdf with instructions about loading the dataset, running the code, and code execution.

• SciDataEMG contains:

- Code ( SciDataEMG.m )

- The .mat files in the SciData dataset ( SubGroup1.mat , SubGroup2.mat , SubGroup3.mat ) contain summarized or processed data, which can be loaded into Matlab for further analysis and visualization. To facilitate data management and analysis, the data from all thirty participants were consolidated into a summarized format using Matlab. The raw sEMG time and signal data for each subgroup of participants were saved in a .mat file (e.g., SciData/RawEMGData/MatlabData/SubGroup1.mat ) for computational efficiency since the dataset of 30 subjects is too large. These files are commonly used for efficient processing and analysis using Matlab functions and tools.

- Results ( P_9_M_9 sEMG signal.png , P_9_M_9 pre-processed sEMG signal.png which are plotted results for representative subject example from the code (Subject 09).

Wang, H. et al . Impaired static postural control correlates to the contraction ability of trunk muscle in young adults with chronic non-specific low back pain: A cross-sectional study. Gait & Posture 92 , 44–50 (2022).

Article   Google Scholar  

Wan, J.-J., Qin, Z., Wang, P.-Y., Sun, Y. & Liu, X. Muscle fatigue: general understanding and treatment. Experimental & Molecular Medicine 49 , e384–e384 (2017).

Article   CAS   Google Scholar  

Garcia, M.-G., Läubli, T. & Martin, B. J. Long-term muscle fatigue after standing work. Human Factors 57 , 1162–1173 (2015).

Article   PubMed   Google Scholar  

Dugan, S. A. & Frontera, W. R. Muscle fatigue and muscle injury. Physical Medicine and Rehabilitation Clinics of North America 11 , 385–403 (2000).

Article   CAS   PubMed   Google Scholar  

Shankar, S., Kumar, N. & Hariharan, C. Ergonomic evaluation of ergonomically designed chalkboard erasers on shoulder and hand-arm muscle activity among college professors. International Journal of Industrial Ergonomics 84 , 103170 (2021).

Zhao, H., Seo, D. & Okada, J. Validity of using perceived exertion to assess muscle fatigue during back squat exercise. BMC Sports Science, Medicine and Rehabilitation 15 , 14 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Najm, W. I. et al . Content validity of manual spinal palpatory exams - a systematic review. BMC complementary and alternative medicine 3 , 1 (2003).

Nolet, P. S. et al . Reliability and validity of manual palpation for the assessment of patients with low back pain: a systematic and critical review. Chiropractic and manual therapies 29 , 33 (2021).

Thamsuwan, O., Galvin, K., Palmandez, P. & Johnson, P. W. Commonly used subjective effort scales may not predict directly measured physical workloads and fatigue in hispanic farmworkers. International Journal of Environmental Research and Public Health 20 , 2809 (2023).

Bailey, J. P., Dufek, J. S., Silvernail, J. F., Navalta, J. & Mercer, J. Understanding the influence of perceived fatigue on coordination during endurance running. Sports Biomechanics 19 , 618–632 (2020).

Beato, M., De Keijzer, K. L., Carty, B. & Connor, M. Monitoring fatigue during intermittent exercise with accelerometer-derived metrics. Frontiers in Physiology 10 , 780 (2019).

Farina, D., Merletti, R. & Enoka, R. M. The extraction of neural strategies from the surface emg: an update. Journal of Applied Physiology 117 , 1215–1230 (2014).

Enoka, R. M. Physiological validation of the decomposition of surface emg signals. Journal of Electromyography and Kinesiology 46 , 70–83 (2019).

Nugent, F. J. et al . The relationship between rowing-related low back pain and rowing biomechanics: a systematic review. British Journal of Sports Medicine 55 , 616–628 (2021).

Behm, D. G. et al . Non-local muscle fatigue effects on muscle strength, power, and endurance in healthy individuals: A systematic review with meta-analysis. Sports Med 51 , 1893–1907 (2021).

Kolind, M. et al . Effects of low load exercise with and without blood-flow restriction on microvascular oxygenation, muscle excitability and perceived pain. European Journal of Sport Science 23 , 542–551 (2022).

Lambert, B. et al . Blood flow restriction training for the shoulder: A case for proximal benefit. The American Journal of Sports Medicine 49 , 2716–2728 (2021).

Tabasi, A. et al . The effect of back muscle fatigue on emg and kinematics based estimation of low-back loads and active moments during manual lifting tasks. Journal of Electromyography and Kinesiology 73 , 102815 (2023).

Robinson, M., Lu, L., Tan, Y., Oetomo, D. & Manzie, C. Feature identification framework for back injury risk in repetitive work with application in sheep shearing. IEEE Transactions on Biomedical Engineering 70 , 616–627 (2023).

Webber et al . J. M. Influence of isometric loading on biceps emg dynamics as assessed by linear and nonlinear tools. Journal of applied physiology (Bethesda, Md.: 1985) 78 , 3 (1995).

Guo, X. et al . A weak monotonicity based muscle fatigue detection algorithm for a short-duration poor posture using sEMG measurements. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , 2238–2241 (2021).

Thomas, S. J., Castillo, G. C., Topley, M. & Paul, R. W. The effects of fatigue on muscle synergies in the shoulders of baseball players. Sports Health 15 , 282–289 (2023).

Azman, M. Z. C., Mat Jusoh, M. A. & Khusaini, N. S. Detection of localized muscle fatigue by using wireless emg among track and field athletes. In Innovation and Technology in Sports: Proceedings of the International Conference on Innovation and Technology in Sports,(ICITS) 2022, Malaysia , 259–268 (Springer Nature Singapore, 2023).

Enoka, R. M. & Duchateau, J. Muscle fatigue: what, why and how it influences muscle function. The Journal of physiology 586 , 11–23 (2008).

Solomon, N. & Manea, V. Quantifying Energy and Fatigue: Classification and Assessment of Energy and Fatigue Using Subjective, Objective, and Mixed Methods towards Health and Quality of Life (Springer, Cham, 2022).

Völker, I., Kirchner, C. & Bock, O. L. On the relationship between subjective and objective measures of fatigue. Ergonomics 59 , 1259–1263 (2016).

Sarker, P., Norasi, H., Koenig, J., Hallbeck, M. S. & Mirka, G. Effects of break scheduling strategies on subjective and objective measures of neck and shoulder muscle fatigue in asymptomatic adults performing a standing task requiring static neck flexion. Applied Ergonomics 92 , 103311 (2021).

Holtzer, R. et al . Interactions of subjective and objective measures of fatigue defined in the context of brain control of locomotion. The Journals of Gerontology: Series A 72 , 417–423 (2017).

Google Scholar  

Lourenço, J. et al . Relationship between objective and subjective fatigue monitoring tests in professional soccer. International journal of environmental research and public health 20 , 1539 (2023).

Oberg, T., Sandsjö, L. & Kadefors, R. Subjective and objective evaluation of shoulder muscle fatigue. Ergonomics 37 , 1323–1333 (1994).

Rodrigues Armijo, P., Huang, C.-K., Carlson, T., Oleynikov, D. & Siu, K.-C. Ergonomics analysis for subjective and objective fatigue between laparoscopic and robotic surgical skills practice among surgeons. Surgical Innovation 27 , 81–87 (2019).

Morse, C. I., Onambele-Pearson, G., Edwards, B., Wong, S. C. & Jacques, M. F. Objective and subjective measures of sleep in men with muscular dystrophy. PLoS ONE 17 (2022).

Ge, H.-Y., Arendt-Nielsen, L. & Madeleine, P. Accelerated muscle fatigability of latent myofascial trigger points in humans. Pain Medicine 13 , 957–964 (2012).

Celik, D. & Yeldan, I. The relationship between latent trigger point and muscle strength in healthy subjects: a double-blind study. Journal of back and musculoskeletal rehabilitation 24 , 251–256 (2011).

Ptaszkowski, K., Wlodarczyk, P. & Paprocka-Borowicz, M. The relationship between the electromyographic activity of rectus and oblique abdominal muscles and bioimpedance body composition analysis - a pilot observational study. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy 12 , 2033–2040 (2019).

Roland, T. Motion artifact suppression for insulated EMG to control myoelectric prostheses. Sensors 20 , 1031 (2020).

Boyer, M., Bouyer, L., Roy, J.-S. & Campeau-Lecours, A. Reducing Noise, Artifacts and Interference in Single-Channel EMG Signals: A Review. Sensors 23 , 2927 (2023).

Wang, C. et al . Stretchable, Multifunctional Epidermal Sensor Patch for Surface Electromyography and Strain Measurements. Advanced Intelligent Systems 3 , 2100031 (2021).

Barbero, M., Merletti, R. & Rainoldi, A. Atlas of Muscle Innervation Zones. Understanding Surface Electromyography and Its Applications . (Springer Science and Business Media, Milan, Italy, 2012).

Hermens, H. J., Freriks, B., Disselhorst-Klug, C. & Rau, G. Development of recommendations for SEMG sensors and sensor placement procedures. Journal of Electromyography and Kinesiology 10 , 361–374 (2000).

Merletti, R. Standards for reporting emg data. International Society of Electrophysiology and Kinesiology (ISEK) 9 , 1 (1999).

Criswell, E. Cram’s introduction to surface electromyography (Jones and Bartlett Learning, Sudbury, 2011).

Merletti, R., Rainoldi, A. & Farina, D. Surface electromyography for noninvasive characterization of muscle. Exercise and Sport Sciences Reviews 29 , 20–25 (2001).

Bendtsen, L., Jensen, R., Jensen, N. K. & Olesen, J. Pressure-controlled palpation: a new technique which increases the reliability of manual palpation. Cephalalgia: an international journal of headache 15 (3), 205–210 (1995).

Clark, M., Lucett, S. & Sutton, B. G. NASM essentials of personal fitness training (Wolters Kluwer Health/Lippincott Williams & Wilkins, Philadelphia, 2012).

Delsys Incorporated. Trigno Wireless Biofeedback System User’s Guide (2021).

De Luca, C. J., Donald Gilmore, L., Kuznetsov, M. & Roy, S. H. Filtering the surface EMG signal: Movement artifact and baseline noise contamination. Journal of Biomechanics 43 , 1573–1579 (2010).

Lim, J., Lu, L., Goonewardena, K., Liu, Z. & Tan, Y. Assessment of self-reported, palpation, and surface electromyography dataset during isometric contraction - data records. figshare https://doi.org/10.6084/m9.figshare.24770868 (2023).

De Luca, C. J., Kuznetsov, M., Gilmore, L. D. & Roy, S. H. Inter-electrode spacing of surface emg sensors: reduction of crosstalk contamination during voluntary contractions. Journal of biomechanics 45 (3), 555–561 (2012).

Merletti, R. & Muceli, S. Tutorial. surface emg detection in space and time: Best practices. Journal of electromyography and kinesiology: official journal of the International Society of Electrophysiological Kinesiology 49 , 102363 (2019).

Zahak, M. Signal Acquisition Using Surface EMG and Circuit Design Considerations for Robotic Prosthesis (Intech, 2012).

Tankisi, H. et al . Standards of instrumentation of emg. Clinical neurophysiology: official journal of the International Federation of Clinical Neurophysiology 131 (1), 243–258 (2020).

Besomi, M. et al . Consensus for experimental design in electromyography (cede) project: Electrode selection matrix. Journal of electromyography and kinesiology: official journal of the International Society of Electrophysiological Kinesiology 48 , 128–144 (2019).

Adam, A. & De Luca, C. J. Firing rates of motor units in human vastus lateralis muscle during fatiguing isometric contractions. Journal of Applied Physiology 99 , 268–280 (2005).

Windhorst, U. & Johansson, H. Modern Techniques in Neuroscience Research (Springer Berlin Heidelberg, Berlin, Heidelberg, 1999).

Sinderby, C., Lindström, L. & Grassino, A. E. Automatic assessment of electromyogram quality. Journal of applied physiology (Bethesda, Md.: 1985) 79 (5), 1803–1815 (1995).

Date, S. et al . Brachialis muscle activity can be measured with surface electromyography: A comparative study using surface and fine-wire electrodes. Frontiers in Physiology 12 , 809422 (2021).

Shaw, L. & Bagha, S. Online emg signal analysis for diagnosis of neuromuscular diseases by using pca and pnn. International Journal of Engineering Science and Technology 4 , 4453–4459 (2012).

Lloyd, D. G. & Besier, T. F. An EMG-driven musculoskeletal model to estimate muscle forces and knee joint moments in vivo . Journal of Biomechanics 36 , 765–776 (2003).

Lim, J., Lu, L., Goonewardena, K., Liu, Z. & Tan, Y. Assessment of self-reported, palpation, and surface electromyography dataset during isometric contraction - code availability. figshare https://doi.org/10.6084/m9.figshare.24770883 (2023).

Download references

Acknowledgements

This work was supported in part by the Australian Research Council Linkage Project (LP220100417).

Author information

Authors and affiliations.

Department of Mechanical Engineering, The University of Melbourne, Parkville, 3010, Australia

Jihoon Lim, Jefferson Zhe Liu & Ying Tan

Department of Engineering Science, University of Oxford, Oxford, OX1 2JD, UK

Department of Population Health Sciences, King’s College London, London, UK

Elite Akademy Sports Medicine, Parkville, 3010, Australia

Kusal Goonewardena

You can also search for this author in PubMed   Google Scholar

Contributions

J.L., K.G. and Y.T. initiated, ideated, prepared, and led the experimental study and data collection. J.L. L.L. and Y.T. wrote data analysis code, exported and analyzed data, drafted all manuscript versions, and prepared figures, tables, and data visualization. J.Z.L. participated in designing and carrying out the study protocol and provided feedback on the manuscript.

Corresponding author

Correspondence to Ying Tan .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Lim, J., Lu, L., Goonewardena, K. et al. Assessment of Self-report, Palpation, and Surface Electromyography Dataset During Isometric Muscle Contraction. Sci Data 11 , 208 (2024). https://doi.org/10.1038/s41597-024-03030-8

Download citation

Received : 12 July 2023

Accepted : 31 January 2024

Published : 15 February 2024

DOI : https://doi.org/10.1038/s41597-024-03030-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

assessment task based performance

  • Open access
  • Published: 13 February 2024

Freezing of gait assessment with inertial measurement units and deep learning: effect of tasks, medication states, and stops

  • Po-Kai Yang 1 , 2 ,
  • Benjamin Filtjens 1 , 2 ,
  • Pieter Ginis 3 ,
  • Maaike Goris 3 ,
  • Alice Nieuwboer 3 ,
  • Moran Gilat 3 ,
  • Peter Slaets 2 &
  • Bart Vanrumste 1  

Journal of NeuroEngineering and Rehabilitation volume  21 , Article number:  24 ( 2024 ) Cite this article

35 Accesses

1 Altmetric

Metrics details

Freezing of gait (FOG) is an episodic and highly disabling symptom of Parkinson’s Disease (PD). Traditionally, FOG assessment relies on time-consuming visual inspection of camera footage. Therefore, previous studies have proposed portable and automated solutions to annotate FOG. However, automated FOG assessment is challenging due to gait variability caused by medication effects and varying FOG-provoking tasks. Moreover, whether automated approaches can differentiate FOG from typical everyday movements, such as volitional stops, remains to be determined. To address these questions, we evaluated an automated FOG assessment model with deep learning (DL) based on inertial measurement units (IMUs). We assessed its performance trained on all standardized FOG-provoking tasks and medication states, as well as on specific tasks and medication states. Furthermore, we examined the effect of adding stopping periods on FOG detection performance.

Twelve PD patients with self-reported FOG (mean age 69.33 ± 6.02 years) completed a FOG-provoking protocol, including timed-up-and-go and 360-degree turning-in-place tasks in On/Off dopaminergic medication states with/without volitional stopping. IMUs were attached to the pelvis and both sides of the tibia and talus. A temporal convolutional network (TCN) was used to detect FOG episodes. FOG severity was quantified by the percentage of time frozen (%TF) and the number of freezing episodes (#FOG). The agreement between the model-generated outcomes and the gold standard experts’ video annotation was assessed by the intra-class correlation coefficient (ICC).

For FOG assessment in trials without stopping, the agreement of our model was strong (ICC (%TF) = 0.92 [0.68, 0.98]; ICC(#FOG) = 0.95 [0.72, 0.99]). Models trained on a specific FOG-provoking task could not generalize to unseen tasks, while models trained on a specific medication state could generalize to unseen states. For assessment in trials with stopping, the agreement of our model was moderately strong (ICC (%TF) = 0.95 [0.73, 0.99]; ICC (#FOG) = 0.79 [0.46, 0.94]), but only when stopping was included in the training data.

A TCN trained on IMU signals allows valid FOG assessment in trials with/without stops containing different medication states and FOG-provoking tasks. These results are encouraging and enable future work investigating automated FOG assessment during everyday life.

Parkinson’s disease (PD) is a neurodegenerative disorder that affects over six million people worldwide [ 1 ]. One of the most debilitating symptoms associated with PD is freezing of gait (FOG), which develops in approximately 70% of PD patients over the course of their disease [ 2 , 3 ]. Clinically, FOG is defined as a “brief, episodic absence or marked reduction of forward progression of the feet despite the intention to walk” and is often divided into three manifestations based on leg movement: (1) trembling: tremulous oscillations in the legs of 8–13 Hz; (2) shuffling: very short steps with poor clearance of the feet; and (3) complete akinesia: no visible movement in the lower limbs [ 1 , 4 ]. While one patient can experience different FOG manifestations, the distribution of manifestations can vary widely among individuals, in which trembling and shuffling are more common than akinetic freezing [ 5 ]. The unpredictable nature of FOG poses a significant risk of falls and injuries for PD patients [ 6 , 7 , 8 ], and it can also affect their mental health and self-esteem, leading to a lower quality of life [ 9 ]. To relieve the symptoms, dopaminergic medication such as Levodopa is mainly used [ 10 ]. During Off-medication states, FOG more commonly occurs [ 11 ], while in contrast, FOG episodes are milder in On-medication states but may manifest differently with more trembling [ 12 ].

To qualitatively assess FOG severity in PD patients and guide appropriate treatment, subjective questionnaires, such as the Freezing of Gait Questionnaire (FOGQ) and the New Freezing of Gait Questionnaire (NFOGQ), are commonly used [ 13 , 14 ]. Although these questionnaires may be sufficient to identify the presence of FOG, they are insufficient to objectively describe patients’ FOG severity, and capture treatment effects, as they suffer from recall bias [ 15 ], in which the patients may not have been completely aware of their freezing severity, frequency, or impact on daily life. These questionnaires are also poor for intervention studies due to the large test-retest variability resulting in extremely minimal detectable change values [ 15 ]. To objectively assess FOG severity, PD patients are asked to perform brief and standardized FOG-provoking tasks in clinical centers. Common tasks include timed-up-and-go (TUG) [ 16 ], 180 or 360 degrees turning while walking [ 17 ], and 360-degree turning-in-place (360Turn) [ 18 ]. The TUG is commonly used in clinical practice since the task includes typical everyday motor tasks such as standing, walking, turning, and sitting. In combination with a cognitive dual-task, it has proven to provoke FOG reliably [ 19 ]. Recently, the 360Turn with a cognitive dual-task was also shown to be practical and reliable to provoke FOG for investigating therapeutic effects on FOG [ 20 ]. Adding a cognitive dual-task to both the TUG and 360Turn test can increase the cognitive load on individuals, which can result in more FOG events, making these tests more sensitive and perhaps relevant measures of FOG severity in real-life situation [ 17 , 19 , 20 ].

The current gold standard to assess FOG severity during the standardized FOG-provoking tasks is via a post-hoc visual analysis of video footage [ 17 , 21 , 22 ]. This protocol requires experts to label FOG episodes and the corresponding FOG manifestations frame by frame [ 22 ]. Based on the frame-by-frame annotations, semi-objective FOG severity outcomes can be computed, such as the number of FOG episodes (#FOG) and the percentage time spent frozen (%TF), defined as the cumulative duration of all FOG episodes divided by the total duration of the walking task [ 23 ]. However, this procedure relies on time-consuming and labor-intensive manual annotation by trained clinical experts. Moreover, the inter-rater agreement between experts was not always strong [ 23 ], and the annotated #FOG between raters could also contain significant differences due to multiple short FOG episodes being inconsistently pooled into longer episodes [ 20 ].

As a result, there is an interest in automated and objective approaches to assess FOG  [ 5 , 24 – 27 ]. Traditionally, automatic approaches detect FOG segments based on a predefined threshold for high-frequency spectra of the leg acceleration [ 28 ]. These techniques, however, are not fully designed explicitly for FOG as they also provide a positive value to PD patients without FOG and even healthy controls [ 29 ]. Additionally, since these techniques rely on rapid leg movements, they may not detect episodes of akinetic FOG. As gait in PD is highly variable, there is increasing interest in deep learning (DL) techniques to model FOG [ 24 , 27 , 30 – 32 ]. Owing to their large parametric space, DL techniques can infer relevant features directly from the raw input data. As such, our group recently developed a new DL based algorithm using marker-based 3D motion capture (MoCap) data [ 27 ]. However, marker-based MoCap is cumbersome to set up and is constrained to lab environments. As a result, inertial measurement units (IMU), due to the better portability, were often used to capture motion signals both in a lab and at home [ 33 , 34 ] and were widely used for the traditional sensor-based assessment of FOG [ 24 , 31 , 35 , 36 ]. The multi-stage temporal convolutional neural network (MS-TCN) stands as one of the current state-of-the-art DL models, initially designed for frame-by-frame sequence mapping in computer vision tasks  [ 37 ]. The MS-TCN architecture initially generates an initial prediction using multiple temporal convolution layers and subsequently refines this prediction over multiple stages. In a recent study, a multi-stage graph convolutional neural network was developed specifically for 3D MoCap-based FOG detection. This research demonstrated that the refinement stages within the model effectively mitigate over-segmentation errors encountered in FOG detection tasks  [ 27 ]. These errors manifest as long FOG episodes being predicted as multiple short FOG episodes, impacting FOG detection performance of DL models. Acknowledging the necessity of mitigating such errors, approaches like the post-processing step employed in [ 24 ] also smooth and merged short FOG episodes in the predicted FOG annotations generated by DL models. Consequently, implementing a post-processing step in FOG annotation from DL models emerges as an essential aspect.

Previous studies proposed automatic FOG detection models for FOG assessment in clinical settings by training and evaluating DL models on datasets that include multiple standardized FOG-provoking tasks measured during both On- and Off-medication states [ 24 , 27 , 31 , 38 ]. However, seeing the widespread clinical use of the 360Turn task for FOG detection, it is still uninvestigated if DL models can adequately detect FOG in this task, which forms the first research gap. Additionally, whether training task-specific and medication-specific models enables a better FOG detection performance than a model trained on multiple tasks and both medication states was not discussed in the literature, which forms the second gap.

Moreover, gait patterns and FOG severity can vary substantially among different FOG-provoking tasks [ 39 ] and medication states [ 40 , 41 ]. Prior studies have delved into the impact of medication states on FOG. For instance, researchers in [ 42 ] trained a model using a combined dataset of Off and On medication trials and then assessed the model’s performance on each medication state independently. This evaluation aimed to understand how the automatic detection of FOG outcomes derived from the model would respond to medication conditions known to influence FOG severity. Similarly, in [ 43 ], investigations were made to determine whether dopaminergic therapy affected the system’s ability to detect FOG. However, these studies have yet to explore the performance of DL models in detecting FOG in an unseen medication state compared to a model trained specifically on data collected from these medication states, which forms the third research gap. Here, “unseen” refers to conditions not included in the model’s training, such as training a model for 360Turn and evaluating its performance on TUG, or training exclusively on On medication data and testing on Off medication data. This gap is critical in evaluating the generalizability of DL models, probing whether their learned features can be robustly applied to new and unseen conditions, ultimately addressing the model’s adaptability beyond its original training context.

Additionally, although these standardized FOG-provoking tasks include walking and turning movements, similar to movements in real-life conditions, they do not include sudden volitional stops, which frequently occur during daily activities at home. Hence, it becomes crucial to be able to distinguish between FOG and volitional stops when transitioning toward at-home FOG assessment. These volitional stops usually do not include any lower limb movements and are often considered challenging to distinguish from akinetic freezing [ 44 ]. Although a previous study proposed using physiological signals, such as electrocardiography, to detect discriminative features for classifying FOG from voluntary stops [ 45 ], methods using motor signals to distinguish FOG from stops were seldom investigated. To the best of our knowledge, only limited studies proposed FOG detection or prediction on trials with stops using IMU signals [ 31 , 46 ]. However, while these studies developed models to detect FOG from data that contains voluntary stopping, they did not address the effect of including or excluding stopping instances during the model training phase on FOG detection performance, forming the fourth research gap.

To address the aforementioned gaps, this paper first introduced a FOG detection model to enable automatic FOG assessment on two standardized FOG-provoking tasks (i.e. the TUG task and the 360Turn task) based on IMUs. The model comprises an initial prediction block to generate preliminary FOG annotations and a subsequent prediction refinement block, designed to mitigate over-segmentation errors. Next, we evaluated whether a DL model trained for a specific task (TUG or 360Turn) or a specific medication state (Off or On) could better detect FOG than a DL model trained on all data. In essence, our aim was to ascertain whether DL models necessitate training on task-specific or medication state-specific data. Subsequently, we evaluated the FOG detection performance of DL models when applied to tasks or medication states that were not included during the model training phase. This analysis aims to assess the generalizability of DL models across unseen tasks or medication states. Finally, we investigated the effect of including or excluding stopping periods on detecting FOG by introducing self-generated and researcher-imposed stopping during standardized FOG-provoking tests. Both self-generated and researcher-imposed stops are hereinafter simply referred to as “stopping”. To this end, the contribution of the present manuscript is fourfold:

We proposed a FOG detection model for fine-grained FOG detection on IMU data, demonstrating its ability to effectively generalize across two distinct tasks and accommodate both medication states.

We show, for the first time, that FOG can be automatically assessed during the 360Turn task.

We show that the DL model cannot generalize to an unseen FOG-provoking task, thereby highlighting the importance of expressive training data in the development of FOG assessment models.

We show that the DL model can assess FOG severity with a strong agreement with experts across FOG-provoking tasks and medication states, even in the presence of stopping.

The study primarily focuses on evaluating the performance of a state-of-the-art model under different conditions, including different tasks, medication states, and stopping conditions, rather than introducing a novel FOG detection model. A comparison of various FOG detection models is provided in Appendix .

We recruited 12 PD patients in this study. Subjects were included if they subjectively reported on the NFOGQ having at least one FOG episode per day with a minimum duration of 5 s. The inclusion criterion was chosen to maximize the chance of capturing FOG in the lab-based assessment procedure. All subjects completed the Montreal Cognitive Assessment (MoCA) [ 47 ], Unified Parkinson’s Disease Rating Scale (UPDRS) [ 48 ], and Hoehn & Yahr (H&Y) Scale [ 49 ] for clinical assessments.

All subjects performed TUG with 180 degrees turning to both directions and a 1-min alternating 360Turn test during the assessments. In the TUG, participants were instructed to stand up from a chair, walk towards a mark placed 2.5 ms from the chair, turn around the mark, walk back to the chair, and sit down. In the 360Turn, participants had to perform rapid alternating 360-degree turns in place for 1 min [ 20 ]. While measuring the standardized FOG-provoking tasks, we included a dual task to provoke more FOG episodes [ 19 , 20 ]. The dual task consisted of the auditory Stroop task [ 20 , 50 ], in which the words “high” and “low” were played from a computer with both a high and low pitch voice. Participants were instructed to name the pitch they heard and not repeat the word. As a result, the TUGs and 360Turn tests were grouped into one block (two TUG trials and one 360Turn trial). Each block of tests was measured with and without a dual task (6 trials). We also included measurements containing a self-generated or researcher-imposed stopping period to collect data for further training. Each block also consisted of stopping trials, in which TUGs were performed four times, twice with a stop in the straight walking part and twice with a stop in the turning part of the TUG; while 360Turn was performed one time. The block was repeated with self-generated and researcher-imposed stopping (10 trials). All pre-mentioned assessments were done first in the clinical Off-medication state (approximately 12 h after the last PD medication intake) and repeated in the same order during the On-medication state (at least 1 h after medication intake), resulting in 32 trials for each subject. The blocks at each session were performed in randomized order to counter potential fatiguing or motor learning to more or fewer FOGs in the last tests.

All participants were equipped with five Shimmer3 IMU sensors attached to the pelvis and both sides of the tibia and talus. All IMUs recorded at a sampling frequency of 64 Hz during the measurements. RGB videos were captured with an Azure Kinect camera at 30 frames per second for offline FOG annotation purposes. For synchronization purposes, triggered signals were sent at regular intervals of 1 s from the camera to an extra IMU that was connected with a cable to the laptop and synced with the other five IMUs. FOG events were visually annotated at a frame-based resolution by a clinical expert, after which all FOG events were verified by another clinical expert using Elan annotation software [ 22 ]. Annotators used the definition of FOG as a brief episode with the inability to produce effective steps [ 1 ]. Specifically, a FOG episode started only when the foot of the participant is suddenly no longer producing an effective step forward and is displaying FOG-related features [ 22 ]. The episode ended only when it is followed by at least two effective steps (these two steps are not part of the episode) [ 22 ]. Unlike previous studies that considered shuffling as one of the FOG manifestations [ 1 , 5 ], this study adopts a stricter definition of FOG that distinguishes non-paroxysmal shuffling and festination as non-FOG events, although they are probably related to FOG due to the presence of increased cadence with small steps during walking. During model training and testing, these FOG-related events were considered non-FOG events.

FOG detection model architecture

The FOG detection model presented in this study consists of two components, as depicted in Fig.  1 : (1) an initial prediction block responsible for generating FOG annotations from IMU signals, and (2) a prediction refinement block focused on reducing over-segmentation errors. We conducted comparisons among five FOG detection models for the initial prediction block. Two DL models, namely Long Short Term Memory (LSTM) [ 51 ] and Temporal Convolutional Neural Network (TCN) [ 52 ], along with three traditional machine learning models, i.e., Support Vector Machine, K Nearest Neighbor, and eXtreme Gradient Boosting (XGBoost), were evaluated. The DL models were trained using raw IMU signals of all five IMUs as input data, while the ML models were trained on 65 features [ 32 ] generated from the IMU signals of the talus IMU of both lower limbs. Ultimately, the TCN model outperformed others and was chosen as the initial prediction block. The model comparison results are available in Appendix Table 9 .

figure 1

Overview of the proposed FOG detection model architecture. Our proposed FOG detection model comprises two essential blocks: an initial prediction block and a prediction refinement block. The initial prediction block takes the six-dimensional signal of T samples from each of the five IMUs and generates initial predictions with the probabilities of positive (FOG) and negative (non-FOG) classifications for each sample within the input sequence. Consequently, the output sequence is structured as \(T \times 2\) representing the probabilities of the two classes. The prediction refinement block aims to refine the initial predictions. This block takes the initially predicted probabilities of the two classes as input and applies a smoothing process, removing over-segmentations and enhancing the overall prediction quality. The output of this refinement block is a refined prediction, also structured as \(T \times 2\) representing the probabilities of the two classes

Similarly, we compared a pre-defined post-processing method [ 24 ] with a trained DL model [ 37 ] for prediction refinement. The pre-defined post-processing method aimed to merge FOG episodes that were 21 samples apart into a single FOG episode and relabel FOG episodes shorter than 21 samples as non-FOG episodes. The selection of 21 samples was based on the observation that 95% of the FOG episodes in our dataset lasted longer than 0.33 s (21 samples). The trained DL model outperformed the pre-defined post-processing method in performance. Consequently, the trained DL model from [ 37 ] was chosen for prediction refinement. A comprehensive comparison between the pre-defined and learned refinement models, as well as the comparison between the inclusion and exclusion of a refinement model, is available in Appendix Table 10 .

Based on the conclusion drawn from the comparison presented in Appendix , our proposed FOG detection model employs the TCN from [ 52 ] as the initial prediction block and the multi-stage TCN from [ 37 ] as the prediction refinement block. A comprehensive visualization of the detailed model architecture is provided in Appendix Fig.  6 . Furthermore, specific hyperparameter settings for the two blocks can be found in Appendix Table 11 .

To evaluate the performance of the model, datasets were partitioned using a leave-one-subject-out (LOSO) cross-validation approach. The LOSO cross-validation approach iteratively splits the data according to the number of subjects in the dataset. One subject is evaluated, while the others are used to train the model. This procedure was repeated until all subjects had been used for evaluation. This approach mirrored the clinically relevant scenario of FOG assessment in newly recruited subjects [ 53 ], where the model assesses FOG in unseen subjects. The result for all models shown in this study were averaged over all unseen subjects using the LOSO cross-validation approach.

Experimental settings

Clinical setting To support FOG assessment in clinical settings, which typically do not include stopping, this study first investigated the overall and relative performance of a generic model trained across standardized FOG-provoking tasks that do not include stopping. Next, we assessed generalization across FOG-provoking tasks and medication states by studying the effect of including or excluding training data from a specific task or medication state on detecting FOG.

Towards the home setting To move towards FOG assessment in daily life where stopping frequently occurs, we trained and evaluated the performance of a generic model trained across trials with stopping. Next, we assessed the effect of including or excluding stopping periods on detecting FOG.

Naming convention The naming convention of all the DL models that were evaluated in this study with their corresponding training data is shown in Table  1 . The generic model trained for clinical measurements (i.e., excluding stopping) was termed “Model_Clinical”. Models trained with less data variety were termed (i.e., trained for a specific task or medication state): “Model_TUG”, “Model_360Turn”, “Model_Off”, and “Model_On”. The generic model trained to work towards FOG assessment in daily life (i.e., including stopping) was termed “Model_Stop”. To compare the effect of stopping, we evaluated Model_Clinical and Model_Stop. In order to maintain a similar amount of FOG duration in the training data, Model_Stop was only trained on trials that included stopping.

From a clinical perspective, FOG severity is typically assessed in terms of percentage time-frozen (%TF), and the number of detected FOG episodes (#FOG) [ 23 ]. This paper used %TF as the primary outcome and #FOG as a secondary outcome based on previous studies [ 20 , 24 ]. To assess the agreement between the model predictions and the expert annotations for each of the two clinical metrics, we calculated the intra-class correlation coefficient (ICC) with a two-way random effects analysis (random trials, random raters) (ICC (2,1)), in which both the raters and the subjects are treated as random effects, meaning that they are assumed to be a random sample from a larger population [ 54 ]. The ICCs between the model and experts were calculated subject-based, with one %TF and #FOG per subject. In other words, the %TF and #FOG were calculated over all trials for each subject. The strength of the agreement was classified according to [ 55 ]: \(\ge \, 0.80\) : strong, 0.6–0.79: moderately strong, 0.3–0.59: fair, and < 0.3: poor.

From a technical perspective, the sample-wise F1 score (Sample-F1) is a metric commonly used in classification problems to evaluate the quality of a model’s predictions at the individual sample level. It provides a balanced measure of a model’s ability to identify positive and negative classes, especially in FOG detection scenarios where the proportion of FOG samples is lower than that of non-FOG samples. When contrasted with metrics such as accuracy, specificity, and sensitivity, the F1 score emerges as a more balanced measure for comparing models’ performances [ 56 ]. In binary classification, Sample-F1 is computed by comparing the predicted and true labels. Each sample is classified as true positive (TP), false positive (FP), or false negative (FN) by a sample-wise comparison between the experts’ annotation and model predictions. Sample-F1 is calculated under the formula:

Additionally, the segment-wise F1-score at k (Segment-F1@ k ) proposed by Lea et al. [ 57 ] is a metric that penalizes over and under-segmentation errors. It allows only minor temporal shifts for the predicted segment, resulting in a much stricter evaluation metric than sample-wise metrics such as Sample-F1 [ 27 ]. To compute Segment-F1@ k , action segments are classified as TP, FP, or FN by comparing the intersection over union (IoU) to a pre-defined threshold k . The IoU is calculated as the intersection length of the predicted segment and the ground-truth segment divided by the union between the two segments. If the corresponding IoU of a predicted segment is large than k , the predicted segment is TP; otherwise, it is FP. All unpaired ground-truth segments are considered FN. Based on previous studies [ 27 , 58 ], we set the threshold k for IoU as 50% (Segment-F1@50). Additionally, an example to compare %TF, #FOG, and Segment-F1@50 is shown in Fig.  2 . The %TF and #FOG for both annotations are 40% and 2 for trial 1, 10% and 1 for trial 2, resulting in a high ICC value of 1. However, the Segment-F1@50 is 0.67 for trial 1 and 0 for trial 2, resulting in an averaged Segment-F1@50 of 0.335. This example shows that although ICC is widely used in previous studies when comparing the inter-rater agreement of %TF and #FOG, it contains the disadvantages of not penalizing shifted annotations, a problem that Segment-F1@50 overcomes. This study calculated one Sample-F1 and Segment-F1@50 for each subject by taking the averaged Sample-F1 and averaged Segment-F1@50 over all trials of that subject. The overall Sample-F1 and Segment-F1@50 under the LOSO cross-validation approach were calculated by averaging the metrics over all subjects.

figure 2

An example for comparing ICC and segment-wise F1 score. This toy example shows the annotations on two trials with the ground-truth annotation as gray and the predicted annotation as yellow. The x-axis represents the timeline for the annotations. When calculating the agreement between the ground-truth and prediction, the %TF and #FOG are both 40% and 2 for the first trial and 10% and 1 for the second trial, resulting in an ICC value of 1. On the other hand, for the segment-wise F1@50 of the first trial, since FP = 1 (the first FOG segment has an IoU less than 50%), TP = 1 (the second FOG segment has an IoU over 50%), and FN = 0, resulting in a segment-F1@50 with 0.67. For the second trial, FP = 1, TP = 0, and FN = 0 resulted in a segment-F1@50 with 0. Thus, the mean Segment-F1@50 equals 0.335. This example shows the disadvantage of using the ICC value of %TF and #FOG to measure the alignment between two annotations

Based on the above discussion, when comparing the performance between different models, i.e., Model_TUG vs. Model_360Turn, Model_Off vs. Model_On, and Model_Clinical and _Model_Stop, only Sample-F1 and Segment-F1@50 were used. Whereas when showing the agreement between the two generic models and the experts in terms of FOG severity outcomes, the ICC values for %TF and #FOG were reported.

Statistical analysis

The Bland-Altman plot [ 59 ] was applied to investigate the systematic bias of the %TF and #FOG between the prediction of Model_Clinical and the experts’ annotation. To investigate whether the difference in Sample-F1 and Segment-F1@50 for each subject between two DL models, i.e., Model_TUG vs. Model_360Turn, Model_On vs. Model_Off, and Model_Clinical vs. Model_Stop, was statistically significant, the paired Student’s t-test [ 60 ] was applied, with the number of pairs equal to the number of subjects evaluated with LOSO. The homogeneity of variances was verified in all metrics across subjects with Levene’s tests [ 61 ]. The Shapiro-Wilk test [ 62 ] was used to determine whether the variables were normally distributed across subjects. The significance level for all tests was set at 0.05. All analyses were performed using SciPy 1.7.11, bioinfokit 2.1.0, statsmodels 0.13.2, and pingouin 0.3.12, written in Python version 3.7.11.

This section first describes the dataset characteristics. Next, we discuss the result of automatic FOG assessments at two levels: (1) FOG detection for clinical measurements with a discussion on the generalization of the FOG detection model and the effect of FOG-provoking tasks and medication states, and (2) FOG detection for moving towards daily life with a discussion on the effect of stopping.

Dataset characteristics

Table  2 shows the clinical characteristics of the twelve PD patients. Participants varied in their age and disease duration. According to Table  3 , a total of 346 trials were collected. Freezing occurred in 38.43% of trials (133 out of 346 trials), with average %TF of 14.62% and total #FOG of 530 observed. The dataset’s mean duration of FOG episodes was 3.01 s, with the shortest episode lasting 0.05 s and the longest episode lasting 63.62 s. Based on the dataset measurement protocol, 32 trials were collected for each subject. Subjects with more than 32 trials was due to repeated measurements, and subjects with less than 32 trials was due to technical difficulties.

The 346 trials in the dataset included 133 trials (81.11 min) collected within the clinical setting, i.e., trials without stopping, and 213 trials (100.60 min) with stopping included. According to Table  4 , all 133 trials without stopping were used to train Model_Clinical, while all 213 trials with stopping were used to train Model_Stop. Within the 133 trials without stopping, 89 TUG trials (35.99 min) were used to train Model_TUG, and 44 360Turn trials (45.11 min) were used to train Model_360Turn. Similarly, 67 Off-medication trials (45.75 min) were used to train Model_Off, and 66 On-medication trials (35.36 min) were used to train Model_On. These models were evaluated and discussed in the following sections.

Clinical setting: FOG detection

This study first trained and evaluated the proposed model trained for FOG detection in standardized clinical setting (i.e., trials without stopping). The #FOG that Model_Clinical detected per subject varied from 3 to 80, amounting to 335 FOG episodes, while the %TF varied from 0.52 to 70.49%. When comparing with experts’ annotations, the model had a strong agreement in terms of %TF, (ICC = 0.92, CI = [0.68, 0.98]), and #FOG (ICC = 0.95, CI = [0.72, 0.99]). The Bland–Altman plots shown in Fig.  3 revealed a systematic error across FOG severity from the model, with a mean bias of \(-4.06\)  (CI = [ \(-7.41, -0.72\) ]) for %TF and \(-4.41\)  (CI = [ \(-7.66, -1.17\) ]) for #FOG. For %TF, the limits of agreement (LOA) fall within the range of \(-14.40\) % (CI =  \(-20.19, -8.59\) ) to 6.26% (CI = [ \(-0.45, 12.05\) ]), showing that it was confident that the differences between the model and the experts would lie in the range of \(-14.40\) % to 6.26%. For #FOG, the LOA fall within the range of \(-14.43\)  (CI = [ \(-20.04, -8.80\) ]) to 5.59 (CI = [ \(-0.02, 11.21\) ]), showing that the differences between the model and the experts will lie in the range of \(-14.43\) to 5.59.

figure 3

Bland–Altman plot for the clinical metrics from Model_Clinical and the experts. The dots represent the difference in scores per patient on the y-axis (i.e., model’s %TF or #FOG subtracted from experts’ %TF or #FOG), plotted against the mean score per patient from the model and the experts on the x-axis. The orange shaded area represents the 95% CI for the mean bias, and the gray shaded area represents the 95% CI for the upper and lower limits of agreement. A negative mean error indicates that the model overestimates with %TF and #FOG compared with the experts’ annotation

Additionally, when evaluating all standardized trials (i.e., without stopping) within the dataset, results showed that 56.70% of the FP samples were annotated as FOG-related segments, i.e., shuffling and festination, meaning that the model tended to annotate FOG-related samples as FOG. According to the qualitative example of the model and experts’ annotations in Fig.  4 , the model generally predicted broader FOG segments compared to the experts’ annotations, resulting in a seeming overestimation of %TF. Also, the model tends to split some experts’ annotated FOG segments into two different FOG segments, resulting in seemingly overestimating #FOG.

figure 4

Overview of the annotations for four typical IMU trials from two patients. Four typical trials include annotations for IMU trials measured during four settings: a  TUG in Off-medication (S3), b  TUG in On-medication (S1), c  360Turn in Off-medication (S3), d  360Turn in On-medication (S1). The figures visualize the difference between the manual FOG segmentation by the clinician and the automated FOG segmentation by the DL model. The x-axis denotes the time of the trial in seconds. The gray region indicates the experts’ annotated FOG, and the yellow region indicates the model-annotated FOG. The color gradient visualizes the overlap or discrepancy between the model and experts’ annotations. The figure shows that the model generally annotated broader FOG events compared to experts’ annotation, resulting in a systematic error in %TF shown in Fig.  3

Next, we assessed the relative performance of the generic model in detecting FOG for trials with a specific FOG-provoking task, medication state, or with and without stopping. As shown in Table  5 , Model_Clinical had a strong agreement with the experts in terms of %TF (all ICCs > 0.92) and #FOG (all ICCs > 0.84). Results showed that it was more difficult for the model to detect FOG in 360Turn tests than TUG in terms of the average Segment-F1@50 (360Turn: 0.45; TUG: 0.67) and Sample-F1 (360Turn: 0.58; TUG: 0.72). Similarly, it was more difficult for the model to detect FOG in Off trials than On trials (Segment-F1@50: 0.55 vs. 0.64; Sample-F1: 0.65 vs. 0.69). However, these results were not reflected in the ICC values for %TF and #FOG, which also shows the inadequate ability of such metrics when comparing different models.

Generalization of the FOG detection model

We proceeded to evaluate how Model_Clinical performed in comparison to models specifically designed for various FOG-provoking tasks and medication conditions, including Model_TUG, Model_360Turn, Model_Off, and Model_On. As shown in Table  6 , when testing on TUG trials, there was no difference between Model_Clinical and Model_TUG in terms of Segment-F1@50 and Sample-F1. When testing on 360Turn trials, there was no difference between Model_Clinical and Model_360Turn in terms of Segment-F1@50 and Sample-F1. Similarly, When testing on Off-medication trials, there was no difference between Model_Clinical and Model_Off in terms of Segment-F1@50 and Sample-F1. When testing on On-medication trials, there was no difference between Model_Clinical and Model_On in terms of Segment-F1@50 and Sample-F1. While no significant differences emerged between Model_Clinical and the models specifically trained for distinct conditions, tasks, and medication statuses, it’s worth noting that the task-specific models exhibited higher F1 scores compared to the model trained on data with more variability.

Effect of FOG-provoking tasks and medication states

Next, we investigated the effect of including or excluding data from specific tasks or medication states. As shown in Table  7 , when testing on TUG trials, Model_TUG resulted in a statistically higher Segment-F1@50 (p < 0.005) and Sample-F1 (p < 0.005) than Model_360Turn. Similarly, when testing on 360Turn trials, Model_360Turn resulted in a higher Segment-F1@50 and Sample-F1 than Model_TUG, though the differences were not statistically significant. On the other hand, when testing on Off-medication trials, no difference was found between Model_Off and Model_On in terms of Segment-F1@50 (p = 0.952) and Sample-F1 (p = 0.957). Similarly, when testing on On-medication trials, no difference was found between Model_Off and Model_On in terms of Segment-F1@50 (p = 0.579) and Sample-F1 (p = 0.307). The results showed that DL models trained only on TUG trials could still detect FOG in 360Turn trials, while DL models trained only on 360Turn could not detect FOG in TUG trials. In contrast, DL models trained without trials for specific medication states could detect FOG on trials measured during unseen medication states. In other words, the data variance between different FOG-provoking tasks was more challenging to model than between different medication states.

Towards the home setting: FOG detection with stopping versus clinical ratings

To move towards FOG detection in daily life, we trained and evaluated the DL model, Model_Stop, on trials collected with stopping. As shown in Table  5 , when comparing with experts’ annotations, Model_Stop had a strong agreement in terms of %TF, (ICC = 0.95, CI = [0.73, 0.99]), and a moderately stong agreement in terms of #FOG (ICC = 0.79, CI = [0.46, 0.94]). Similar to FOG detection in clinical settings, results show that it was also more difficult for the model to detect FOG in 360Turn tests than TUG in terms of the average Segment-F1@50 (360Turn: 0.33; TUG: 0.65) and Sample-F1 (360Turn: 0.47; TUG: 0.67). Also, it was more difficult for the model to detect FOG in Off trials than On trials (Segment-F1@50: 0.55 vs. 0.62; Sample-F1: 0.62 vs. 0.64).

Effect of stopping periods versus no stopping periods

Next, we investigated the effect of stopping periods on FOG detection by comparing the performance of DL models trained on trials with and without self-generated and researcher-imposed stopping, i.e., Model_Clinical and Model_Stop. According to the results shown in Table  8 , when evaluating trials collected during standardized measurements, i.e., trials without stopping, there was no difference found between Model_Clinical and Model_Stop in terms of Segment-F1@50 (p = 0.550) and Sample-F1 (p = 0.326).

When evaluating trials collected with stopping periods, the Segment-F1@50 for Model_Stop (mean = 0.60) was significantly higher than Model_Clinical (mean = 0.39; p = 0.005). Similarly, the Sample-F1@50 for Model_Stop (mean = 0.65) was significantly higher than Model_Clinical (mean = 0.44; p < 0.005). Additionally, among the 210 observed stops within the dataset, only 16 (7.61%) were mislabeled as FOG from Model_Stop, while 74 (35.23%) were annotated as FOG from Model_Clinical. The results indicated that the model trained with trials that include stopping could learn to differentiate stopping from FOG, resulting in a statistically higher Segment-F1@50 and Sample-F1 than the model trained without stopping.

This is the first study to show that a DL model using only five lower limb IMUs can automatically annotate FOG episodes frame by frame that matches how clinical experts annotate videos. Additionally, this study is the first to assess the FOG detection performance of a DL model during the dual-task 360Turn task, recently proposed as one of the most effective FOG-provoking tasks [ 20 ]. Two clinical measures were computed to evaluate the FOG severity predicted by the DL model trained for the clinical setting (Model_Clinical), the %TF and #FOG [ 23 ]. Model_Clinical showed a strong agreement with the experts’ observations for %TF (ICC = 0.92) and #FOG (ICC = 0.95). In previous studies, the ICC between independent raters on the TUG task was reported to be 0.87 [ 63 ] and 0.73 [ 23 ] for %TF and 0.63 [ 23 ] for #FOG, while for 360Turn, the ICC between raters was reported to be 0.99 for %TF and 0.86 for #FOG [ 20 ]. While the ICC value in previous studies varied depending on the specific tool and population being studied [ 64 ], in comparison, our proposed model achieved similar levels of agreement. This holds significant promise for future AI-assisted FOG annotation work, whereby the DL model annotates FOG episodes initially, and the clinical expert verifies/adjusts only where required. Despite the high agreement with the experts, results showed that the model statistically overestimated FOG severity with a higher %TF and #FOG than the experts when evaluating all trials. The overestimation of %TF and #FOG was partly due to FP when predicting FOG-related movement, such as shuffling and festination, as FOG segments. The systematic overestimation resulted in relatively low F1 scores while maintaining a high ICC. Given that these FOG-related movements often lie on the boundary between freezing and non-freezing [ 45 ], it can be challenging for the model to accurately annotate and categorize them in a manner consistent with nearby FOG episodes.

This study aimed to assess the generalization capabilities of DL models across various tasks and medication states by comparing models trained on all tasks and medication states (referred to as Model_Clinical) against task-specific and medication-specific models (Model_TUG, Model_360Turn, Model_Off, and Model_On). Our results showed that task- and medication-specific models performed better than the general model, though these effects were not statistically significant. Moreover, when comparing the performance of the general model on different tasks and medication states, our result showed that it was more difficult for the model to detect FOG in 360Turn tests than TUG in terms of the average Segment-F1@50 and Sample-F1. Also, our result showed that it was more difficult for the model to detect FOG in Off-medication tests than in On-medication tests. Despite evaluating Model_Clinical on both tasks and medication states, our model exhibited relatively lower F1 scores compared to those reported in FOG detection literature [ 32 , 51 , 65 ]. This discrepancy in our study’s F1 scores can be attributed to the challenging nature of our dataset, notably containing a higher proportion of short FOG episodes, with 41.84% lasting less than 1 s. In comparison, the CuPiD dataset [ 66 ] has a proportion of 5.06%, while the dataset from [ 24 ] reports 0% of such short episodes. When comparing our FOG detection models with those proposed in the literature, detailed in Appendix , we observed that these models struggled to properly detect FOG in our dataset, exhibiting lower Sample F1 scores compared to our model. This disparity suggests that our dataset poses greater difficulty for annotations, possibly due to the prevalence of numerous short episodes.

Our next evaluation focused on determining the extent to which a DL model trained exclusively on a single FOG-provoking task or medication state could generalize to unseen FOG-provoking tasks or medication states. Results showed that the model trained on one FOG-provoking task (i.e., TUG or 360Turn) could better detect FOG in such a task than the model without training on such tasks. Additionally, although previous studies have shown that gait patterns are altered post anti-Parkinsonian medication [ 40 , 41 ], our results also showed that the model trained on one medication state could still detect FOG in the other medication state. As a result, we recommend caution when applying DL-based FOG assessment models on FOG-provoking tasks that were not explicitly trained on, while applying models trained on different medication states does not show such discrepancies. This also has implications for future work toward daily-life FOG detection. Training data needs to be diversified for all activities encountered during daily. On the other hand, diversifying training data towards the medication states is unnecessary, making data collection more feasible as data can be collected in the On-medication regimens in the future.

While existing approaches utilized DL models to detect FOG on standardized FOG-provoking tasks with IMUs [ 24 , 31 ], the model’s ability to distinguish FOG from stopping remains undetermined, which is critical for free-living assessment [ 45 ]. Therefore, voluntary and instructed stops were introduced in the standardized FOG-provoking tasks. When evaluating trials without stops, results showed no difference between the model trained without stops and the model trained with stops, showing that adding stopping periods in the training data does not affect the DL model to detect FOG. Additionally, when evaluating trials with stops, results showed that compared with the model trained without stops, the model trained with stops produced less FP of stopping (16 compared to 74). While it was considered that stops are difficult to distinguish from FOG with movement-related signals, especially for akinetic FOG [ 67 ], our model could still detect FOG in the presence of stops. Moreover, our result highlights the importance of including stopping in the training data.

Although this study has provided valuable insights, there are some limitations to acknowledge. The first limitation is that the videos in our dataset were annotated sequentially by two clinical experts. The first annotator’s work was verified and, if needed, corrected by the second annotator. As a result, we could not calculate the inter-rater agreement in our study to compare our models’ annotation against. However, the literature shows that inter-rater agreement is 0.39–0.99 [ 20 , 23 , 24 , 27 , 35 , 63 ] and that these differences between experts were sometimes due to minor differences between FOG and FOG-related movements. Our DL model’s agreement with the clinical experts exceeded those previously published inter-rater agreements, and just as between experts, most of our model’s mispredicted FOG segments were marked as FOG-related segments by the experts. Future work could investigate the development of DL models that can better differentiate between FOG and FOG-related events. On the other hand, whether such differentiation is truly needed depends on the research or clinical question. The second limitation is that this study simulated free-living situations by asking patients to stop when performing standardized FOG-provoking tasks. Yet, free-living movement will contain substantially more variance (e.g., daily activities) than captured during our standardized tasks. Moreover, FOG severity during our tasks does not necessarily represent FOG severity in daily life [ 44 , 68 ]. Therefore, future work should establish the reliability of our approach to data measured in free-living situations. The third limitation is that this study showed that training DL models with trials that include stopping resulted in better performance in detecting FOG in trials that include stopping. However, whether DL models are able to distinguish between FOG and stopping for all manifestations of FOG (e.g., akinetic FOG) remains to be investigated. The fourth limitation is our choice of utilizing the complete sensor configuration, which includes all five IMUs in this study. Previous research has compared various IMU positions and recommended an optimal technical setup comprising only three IMUs (specifically, lumbar and both ankles) [ 24 ]. We included the performance results of models trained with the 3-IMU configuration in Appendix . The result demonstrate that there is no significant difference between the performance of models trained with five IMUs and three IMUs. However, additional research is required to definitively establish the ideal sensor configuration for effective FOG detection in home environments. The fifth limitation is the small number of participants compared to the other use cases in DL literature. As this study evaluated the model with the LOSO cross-validation approach, the results still showed that the model could generalize learned features to unseen subjects. Moreover, despite the small number of subjects, the number of samples and FOG events in the dataset used in this study is comparable with the literature [ 27 , 31 ]. Future studies could evaluate automatic FOG assessment on larger datasets or across datasets. The sixth limitation is that the recruited PD patients subjectively reported having at least one FOG episode per day with a minimum duration of 5 s. While the proposed model works for these severe freezers, it still has to be verified whether the model also generalizes to mild freezers.

This paper introduced a DL model comprising an initial prediction block and a prediction refinement block for IMU-based FOG assessment trained across two FOG-provoking tasks in both On- and Off-medication states and trials containing stopping. We established that the proposed DL model resulted in strong agreement with experts’ annotations on the percentage of time frozen and the number of FOG episodes. This highlights that a single DL model can be trained to generalize over FOG-provoking tasks and medication states for FOG assessment in a clinical setting. Additionally, our investigation revealed that while there was no significant difference observed between the model trained on all-encompassing data and task- and medication-specific models. Moreover, we established that DL models should include specific FOG-provoking tasks in the training data in order to be able to detect FOG in such a task, while this is not necessary for different medication states. Finally, we established that the proposed model can still detect FOG in trials that contain stopping. Though, only when stopping is included in the training data. These findings are encouraging and enable future work to investigate FOG assessment during everyday life.

Availability of data and materials

The input set was imported and labeled using Python version .2.7.12 with Biomechanical Toolkit (BTK) version 0.3 [ 71 ]. The model architecture was implemented in Pytorch version 1.2 [ 72 ] by adopting the public code repositories of MS-TCN [ 37 ], and VideoPose3D [ 52 ]. All models were trained on an NVIDIA GeForce RTX 2080 GPU using Python version 3.7.11. The datasets analyzed during the current study are not publicly available due to restrictions on sharing subject health information.

Abbreviations

  • Parkinson’s disease
  • Freezing of gait

Freezing of Gait Questionnaire

New Freezing of Gait Questionnaire

Montreal Cognitive Assessment

Unified Parkinson’s Disease Rating Scale

Hoehn & Yahr

Machine learning

Deep learning

Timed-up-and-go

360-Degree turning-in-place

Percentage time spent frozen

Number of FOG episodes

Inertial measurement unit

Motion-captured

Temporal convolutional network

Multi-stage temporal convolutional neural network

Leave-one-subject-out

Standard deviation

True positive

True negative

False positive

False negative

Intra-class correlation coefficient

Confidence interval

Biomechanical Toolkit

Not applicable

Nutt JG, Bloem BR, Giladi N, Hallett M, Horak FB, Nieuwboer A. Freezing of gait: moving forward on a mysterious clinical phenomenon. Lancet Neurol. 2011;10:734–44. https://doi.org/10.1016/S1474-4422(11)70143-0 .

Article   PubMed   PubMed Central   Google Scholar  

Perez-Lloret S, Negre-Pages L, Damier P, Delval A, Derkinderen P, Destèe A, Meissner WG, Schelosky L, Tison F, Rascol O. Prevalence, determinants, and effect on quality of life of freezing of gait in Parkinson disease. JAMA Neurol. 2014;71:884–90. https://doi.org/10.1001/JAMANEUROL.2014.753 .

Article   PubMed   Google Scholar  

Hely MA, Reid WGJ, Adena MA, Halliday GM, Morris JGL. The Sydney multicenter study of Parkinson’s disease: the inevitability of dementia at 20 years. Mov Disord. 2008;23:837–44. https://doi.org/10.1002/MDS.21956 .

Schaafsma JD, Balash Y, Gurevich T, Bartels AL, Hausdorff JM, Giladi N. Characterization of freezing of gait subtypes and the response of each to levodopa in Parkinson’s disease. Eur J Neurol. 2003;10:391–8. https://doi.org/10.1046/J.1468-1331.2003.00611.X .

Article   CAS   PubMed   Google Scholar  

Kondo Y, Mizuno K, Bando K, Suzuki I, Nakamura T, Hashide S, Kadone H, Suzuki K. Measurement accuracy of freezing of gait scoring based on videos. Front Hum Neurosci. 2022. https://doi.org/10.3389/FNHUM.2022.828355 .

Rudzińska M, Bukowczan S, Stożek J, Zajdel K, Mirek E, Chwała W, Wójcik-Pędziwiatr M, Banaszkiewicz K, Szczudlik A (2013) Causes and consequences of falls in Parkinson disease patients in a prospective study. Neurologia i Neurochirurgia Polska 47(5):423-430. https://doi.org/10.5114/ninp.2013.38222

Pelicioni PHS, Menant JC, Latt MD, Lord SR. Falls in Parkinson’s disease subtypes: risk factors, locations and circumstances. Int J Environ Res Public Health. 2019. https://doi.org/10.3390/IJERPH16122216 .

Paul SS, Canning CG, Sherrington C, Lord SR, Close JCT, Fung VSC. Three simple clinical tests to accurately predict falls in people with Parkinson’s disease. Mov Disord. 2013;28:655–62. https://doi.org/10.1002/MDS.25404 .

Moore O, Kreitler S, Ehrenfeld M, Giladi N. Quality of life and gender identity in Parkinson’s disease. J Neural Transm. 2005;112:1511–22. https://doi.org/10.1007/S00702-005-0285-5 .

Rizek P, Kumar N, Jog MS. An update on the diagnosis and treatment of Parkinson disease. CMAJ = journal de l’Association medicale canadienne. 2016;188:1157–65. https://doi.org/10.1503/CMAJ.151179 .

Barthel C, Mallia E, Debû B, Bloem BR, Ferraye MU. The practicalities of assessing freezing of gait. J Parkinson’s Dis. 2016;6:667. https://doi.org/10.3233/JPD-160927 .

Article   Google Scholar  

Espay AJ, Fasano A, Nuenen BFLV, Payne MM, Snijders AH, Bloem BR. “On’’ state freezing of gait in Parkinson disease: a paradoxical levodopa-induced complication. Neurology. 2012;78:454. https://doi.org/10.1212/WNL.0B013E3182477EC0 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Giladi N, Tal J, Azulay T, Rascol O, Brooks DJ, Melamed E, Oertel W, Poewe WH, Stocchi F, Tolosa E. Validation of the freezing of gait questionnaire in patients with Parkinson’s disease. Mov Disord. 2009;24:655–61. https://doi.org/10.1002/MDS.21745 .

Nieuwboer A, Rochester L, Herman T, Vandenberghe W, Emil GE, Thomaes T, Giladi N. Reliability of the new freezing of gait questionnaire: agreement between patients with Parkinson’s disease and their carers. Gait Posture. 2009;30:459–63. https://doi.org/10.1016/J.GAITPOST.2009.07.108 .

Hulzinga F, Nieuwboer A, Dijkstra BW, Mancini M, Strouwen C, Bloem BR, Ginis P. The new freezing of gait questionnaire: unsuitable as an outcome in clinical trials? Mov Disord Clin Pract. 2020;7:199–205. https://doi.org/10.1002/MDC3.12893 .

Mancini M, Priest KC, Nutt JG, Horak FB. Quantifying freezing of gait in Parkinson’s disease during the instrumented timed up and go test. In: Annual international conference of the IEEE engineering in medicine and biology society. IEEE Engineering in Medicine and Biology Society. Annual international conference 2012, 2012. p. 1198–201. https://doi.org/10.1109/EMBC.2012.6346151 .

Spildooren J, Vercruysse S, Desloovere K, Vandenberghe W, Kerckhofs E, Nieuwboer A. Freezing of gait in Parkinson’s disease: the impact of dual-tasking and turning. Mov Disord. 2010;25:2563–70. https://doi.org/10.1002/MDS.23327 .

Mancini M, Smulders K, Cohen RG, Horak FB, Giladi N, Nutt JG. The clinical significance of freezing while turning in Parkinson’s disease. Neuroscience. 2017;343:222. https://doi.org/10.1016/J.NEUROSCIENCE.2016.11.045 .

Çekok K, Kahraman T, Duran G, Çolakoğlu BD, Yener G, Yerlikaya D, Genç A (2020) Timed up and go test with a cognitive task: correlations with neuropsychological measures in people with Parkinson’s disease. Cureus 12(9):e10604. https://doi.org/10.7759/cureus.10604

D’Cruz N, Seuthe J, Somer CD, Hulzinga F, Ginis P, Schlenstedt C, Nieuwboer A. Dual task turning in place: a reliable, valid, and responsive outcome measure of freezing of gait. Mov Disord. 2022;37:269–78. https://doi.org/10.1002/MDS.28887 .

Shine JM, Moore ST, Bolitho SJ, Morris TR, Dilda V, Naismith SL, Lewis SJG. Assessing the utility of freezing of gait questionnaires in Parkinson’s disease. Parkinsonism Related Disord. 2012;18:25–9. https://doi.org/10.1016/J.PARKRELDIS.2011.08.002 .

Article   CAS   Google Scholar  

Gilat M. How to annotate freezing of gait from video: a standardized method using open-source software. J Parkinson’s Dis. 2019;9:821–4. https://doi.org/10.3233/JPD-191700 .

Morris TR, Cho C, Dilda V, Shine JM, Naismith SL, Lewis SJG, Moore ST. A comparison of clinical and objective measures of freezing of gait in Parkinson’s disease. Parkinsonism Related Disord. 2012;18:572–7. https://doi.org/10.1016/J.PARKRELDIS.2012.03.001 .

O’Day J, Lee M, Seagers K, Hoffman S, Jih-Schiff A, Kidziński Ł, Delp S, Bronte-Stewart H. Assessing inertial measurement unit locations for freezing of gait detection and patient preference. J NeuroEng Rehabil. 2022;19:1–15. https://doi.org/10.1186/S12984-022-00992-X/FIGURES/5 .

Hu K, Wang Z, Wang W, Martens KAE, Wang L, Tan T, Lewis SJG, Feng DD. Graph sequence recurrent neural network for vision-based freezing of gait detection. IEEE Trans Image Process Publ IEEE Signal Process Soc. 2019;29:1890–901. https://doi.org/10.1109/TIP.2019.2946469 .

Article   MathSciNet   Google Scholar  

Hu K, Wang Z, Mei S, Martens KAE, Yao T, Lewis SJG, Feng DD. Vision-based freezing of gait detection with anatomic directed graph representation. IEEE J Biomed Health Inform. 2020;24:1215–25. https://doi.org/10.1109/JBHI.2019.2923209 .

Filtjens B, Ginis P, Nieuwboer A, Slaets P, Vanrumste B. Automated freezing of gait assessment with marker-based motion capture and multi-stage spatial-temporal graph convolutional neural networks. J NeuroEng Rehabil. 2022;19:1–14. https://doi.org/10.1186/s12984-022-01025-3 .

Moore ST, MacDougall HG, Ondo WG. Ambulatory monitoring of freezing of gait in Parkinson’s disease. J Neurosci Methods. 2008;167:340–8. https://doi.org/10.1016/J.JNEUMETH.2007.08.023 .

Cockx H, Nonnekes J, Bloem BR, van Wezel R, Cameron I, Wang Y. Dealing with the heterogeneous presentations of freezing of gait: how reliable are the freezing index and heart rate for freezing detection? J Neuroeng Rehabil. 2023;20(1):53.

Filtjens B, Ginis P, Nieuwboer A, Afzal MR, Spildooren J, Vanrumste B, Slaets P. Modelling and identification of characteristic kinematic features preceding freezing of gait with convolutional neural networks and layer-wise relevance propagation. BMC Med Inform Decis Mak. 2021;21(1):341.

Bikias T, Iakovakis D, Hadjidimitriou S, Charisis V, Hadjileontiadis LJ. DeepFoG: an IMU-based detection of freezing of gait episodes in Parkinson’s disease patients via deep learning. Front Robot AI. 2021. https://doi.org/10.3389/FROBT.2021.537384 .

Shi B, Tay A, Au WL, Tan DML, Chia NSY, Yen SC. Detection of freezing of gait using convolutional neural networks and data from lower limb motion sensors. IEEE Trans Biomed Eng. 2022;69:2256–67. https://doi.org/10.1109/TBME.2022.3140258 .

Celik Y, Stuart S, Woo WL, Godfrey A. Wearable inertial gait algorithms: impact of wear location and environment in healthy and Parkinson’s populations. Sensors. 2021. https://doi.org/10.3390/s21196476 .

Komaris DS, Tarfali G, O’Flynn B, Tedesco S. Unsupervised IMU-based evaluation of at-home exercise programmes: a feasibility study. BMC Sports Sci Med Rehabil. 2022;14:1–12. https://doi.org/10.1186/s13102-022-00417-1 .

Mancini M, Shah VV, Stuart S, Curtze C, Horak FB, Safarpour D, Nutt JG. Measuring freezing of gait during daily-life: an open-source, wearable sensors approach. J NeuroEng Rehabil. 2021;18:1–13. https://doi.org/10.1186/s12984-020-00774-3 .

Pardoel S, Shalin G, Nantel J, Lemaire ED, Kofman J. Early detection of freezing of gait during walking using inertial measurement unit and plantar pressure distribution data. Sensors. 2021;21:2246. https://doi.org/10.3390/S21062246 .

Farha YA, Gall J. Ms-tcn: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. https://doi.org/10.48550/arxiv.1903.01945 .

Marcante A, Marco RD, Gentile G, Pellicano C, Assogna F, Pontieri FE, Spalletta G, Macchiusi L, Gatsios D, Giannakis A, Chondrogiorgi M, Konitsiotis S, Fotiadis DI, Antonini A. Foot pressure wearable sensors for freezing of gait detection in Parkinson’s disease. Sensors. 2020;21:128. https://doi.org/10.3390/S21010128 .

Romijnders R, Warmerdam E, Hansen C, Welzel J, Schmidt G, Maetzler W. Validation of IMU-based gait event detection during curved walking and turning in older adults and Parkinson’s disease patients. J Neuroeng Rehabil. 2021. https://doi.org/10.1186/S12984-021-00828-0 .

Bryant MS, Rintala DH, Hou JG, Lai EC, Protas EJ. Effects of levodopa on forward and backward gait patterns in persons with Parkinson’s disease. Neurorehabilitation. 2011;29:247. https://doi.org/10.3233/NRE-2011-0700 .

Son M, Han SH, Lyoo CH, Lim JA, Jeon J, Hong KB, Park H. The effect of levodopa on bilateral coordination and gait asymmetry in Parkinson’s disease using inertial sensor. Npj Parkinson’s Dis. 2021;7:1. https://doi.org/10.1038/s41531-021-00186-7 .

Reches T, Dagan M, Herman T, Gazit E, Gouskova NA, Giladi N, Manor B, Hausdorff JM. Using wearable sensors and machine learning to automatically detect freezing of gait during a fog-provoking test. Sensors. 2020;20(16):4474. https://doi.org/10.3390/s20164474 .

Borzì L, Mazzetta I, Zampogna A, Suppa A, Olmo G, Irrera F. Prediction of freezing of gait in Parkinson’s disease using wearables and machine learning. Sensors. 2021;21(2):614. https://doi.org/10.3390/s21020614 .

Snijders AH, Nijkrake MJ, Bakker M, Munneke M, Wind C, Bloem BR. Clinimetrics of freezing of gait. Mov Disord. 2008;23:468–74. https://doi.org/10.1002/MDS.22144 .

John AR, Cao Z, Chen H-T, Martens KE, Georgiades M, Gilat M, Nguyen HT, Lewis SJG, Lin C-T. Predicting the onset of freezing of gait using EEG dynamics. Appl Sci. 2023;13(1):302. https://doi.org/10.3390/app13010302 .

Krasovsky T, Heimler B, Koren O, Galor N, Hassin-Baer S, Zeilig G, Plotnik M. Bilateral leg stepping coherence as a predictor of freezing of gait in patients with Parkinson’s disease walking with wearable sensors. IEEE Trans Neural Syst Rehabil Eng. 2023;31:798–805. https://doi.org/10.1109/TNSRE.2022.3231883 .

Nasreddine ZS, Phillips NA, Bédirian V, Charbonneau S, Whitehead V, Collin I, Cummings JL, Chertkow H. The Montreal cognitive assessment, MoCA: a brief screening tool for mild cognitive impairment. J Am Geriatr Soc. 2005;53(4):695–9. https://doi.org/10.1111/j.1532-5415.2005.53221.x .

Goetz CG, Tilley BC, Shaftman SR, Stebbins GT, Fahn S, Martinez-Martin P, Poewe W, Sampaio C, Stern MB, Dodel R, Dubois B, Holloway R, Jankovic J, Kulisevsky J, Lang AE, Lees A, Leurgans S, LeWitt PA, Nyenhuis D, Olanow CW, Rascol O, Schrag A, Teresi JA, van Hilten JJ, LaPelle N, Agarwal P, Athar S, Bordelan Y, Bronte-Stewart HM, Camicioli R, Chou K, Cole W, Dalvi A, Delgado H, Diamond A, Dick JP, Duda J, Elble RJ, Evans C, Evidente VG, Fernandez HH, Fox S, Friedman JH, Fross RD, Gallagher D, Goetz CG, Hall D, Hermanowicz N, Hinson V, Horn S, Hurtig H, Kang UJ, Kleiner-Fisman G, Klepitskaya O, Kompoliti K, Lai EC, Leehey ML, Leroi I, Lyons KE, McClain T, Metzer SW, Miyasaki J, Morgan JC, Nance M, Nemeth J, Pahwa R, Parashos SA, Schneider JS, Sethi K, Shulman LM, Siderowf A, Silverdale M, Simuni T, Stacy M, Stern MB, Stewart RM, Sullivan K, Swope DM, Wadia PM, Walker RW, Walker R, Weiner WJ, Wiener J, Wilkinson J, Wojcieszek JM, Wolfrath S, Wooten F, Wu A, Zesiewicz TA, Zweig RM. Movement disorder society-sponsored revision of the unified Parkinson’s disease rating scale (MDS-UPDRS): scale presentation and clinimetric testing results. Mov Disord. 2008;23:2129–70. https://doi.org/10.1002/MDS.22340 .

Hoehn MM, Yahr MD. Parkinsonism: onset, progression and mortality. Neurology. 1967;17:427–42. https://doi.org/10.1212/WNL.17.5.427 .

Kestens K, Degeest S, Miatton M, Keppler H. An auditory Stroop test to implement in cognitive hearing sciences: development and normative data. Int J Psychol Res. 2021;14:37. https://doi.org/10.21500/20112084.5118 .

Shalin G, Pardoel S, Lemaire ED, Nantel J, Kofman J. Prediction and detection of freezing of gait in Parkinson’s disease from plantar pressure data using long short-term memory neural-networks. J Neuroeng Rehabil. 2021;18(1):1–15.

Pavllo D, Feichtenhofer C, Grangier D, Auli M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2019-June, 2018. p. 7745–54. https://doi.org/10.48550/arxiv.1811.11742 .

Saeb S, Lonini L, Jayaraman A, Mohr DC, Kording KP. The need to approximate the use-case in clinical machine learning. GigaScience. 2017;6:1–9. https://doi.org/10.1093/GIGASCIENCE/GIX019 .

McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1(1):30–46. https://doi.org/10.1037/1082-989X.1.1.30 .

Chan YH. Biostatistics 104: correlational analysis. Singap Med J. 2003;44:614–9.

CAS   Google Scholar  

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21:1–3. https://doi.org/10.1186/s12864-019-6413-7 .

Lea C, Flynn MD, Vidal R, Reiter A, Hager GD. Temporal convolutional networks for action segmentation and detection. https://doi.org/10.48550/arXiv.1611.05267 .

Filtjens B, Vanrumste B, Slaets P. Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans Emerg Top Comput. 2022. https://doi.org/10.1109/TETC.2022.3230912 .

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;327:307–10. https://doi.org/10.1016/S0140-6736(86)90837-8 .

Gosset WS. The probable error of a mean. Biometrika. 1908;6:1–25. https://doi.org/10.1093/BIOMET/6.1.1 .

Brown MB, Forsythe AB. Robust tests for the equality of variances. J Am Stat Assoc. 1974;69:364–7. https://doi.org/10.1080/01621459.1974.10482955 .

Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52:591–611.

Walton CC, Mowszowski L, Gilat M, Hall JM, O’Callaghan C, Muller AJ, Georgiades M, Szeto JYY, Martens KAE, Shine JM, Naismith SL, Lewis SJG. Cognitive training for freezing of gait in Parkinson’s disease: a randomized controlled trial. NPJ Parkinson’s Dis. 2018. https://doi.org/10.1038/S41531-018-0052-6 .

van Hartskamp M, Consoli S, Verhaegh W, Petkovic M, van de Stolpe A. Artificial intelligence in clinical health care applications: viewpoint. Interact J Med Res. 2019;8(2): e12100. https://doi.org/10.2196/12100 .

Naghavi N, Miller A, Wade E. Towards real-time prediction of freezing of gait in patients with Parkinson’s disease: addressing the class imbalance problem. Sensors. 2019;19(18):3898.

Mazilu S, Hardegger M, Zhu Z, Roggen D, Tröster G, Plotnik M, Hausdorff JM. Online detection of freezing of gait with smartphones and machine learning techniques. In: 2012 6th international conference on pervasive computing technologies for healthcare (PervasiveHealth) and workshops. IEEE; 2012. p. 123–30.

Mancini M, Bloem BR, Horak FB, Lewis SJG, Nieuwboer A, Nonnekes J. Clinical and methodological challenges for assessing freezing of gait: future perspectives. Mov Disord. 2019;34:783–90. https://doi.org/10.1002/MDS.27709 .

Rahman S, Griffin HJ, Quinn NP, Jahanshahi M. The factors that induce or overcome freezing of gait in Parkinson’s disease. Behav Neurol. 2008;19:127–36. https://doi.org/10.1155/2008/456298 .

Reddi SJ, Kale S, Kumar S. On the convergence of adam and beyond. In: International conference on learning representations; 2018. https://openreview.net/forum?id=ryQu7f-RZ .

Li J. A two-step rejection procedure for testing multiple hypotheses. J Stat Plan Inference. 2008;138(6):1521–7.

Barre A, Armand S. Biomechanical toolkit: open-source framework to visualize and process biomechanical data. Comput Methods Programs Biomed. 2014;114:80–7. https://doi.org/10.1016/J.CMPB.2014.01.012 .

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:8026. https://doi.org/10.48550/arxiv.1912.01703 .

Download references

Acknowledgements

We thank the participants for their willingness to participate.

This study is funded by the KU Leuven Industrial Research Fund. PY was supported by the Ministry of Education (KU Leuven-Taiwan) scholarship. BF was supported by KU Leuven Internal Funds Postdoctoral Mandate PDMT2/22/046.

Author information

Authors and affiliations.

eMedia Research Lab/STADIUS, Department of Electrical Engineering (ESAT), KU Leuven, Andreas Vesaliusstraat 13, 3000, Leuven, Belgium

Po-Kai Yang, Benjamin Filtjens & Bart Vanrumste

Intelligent Mobile Platforms Research Group, Department of Mechanical Engineering, KU Leuven, Andreas Vesaliusstraat 13, 3000, Leuven, Belgium

Po-Kai Yang, Benjamin Filtjens & Peter Slaets

Research Group for Neurorehabilitation (eNRGy), Department of Rehabilitation Sciences, KU Leuven, Tervuursevest 101, 3001, Heverlee, Belgium

Pieter Ginis, Maaike Goris, Alice Nieuwboer & Moran Gilat

You can also search for this author in PubMed   Google Scholar

Contributions

Study design by PY, BF, PG, AN, PS, and BV. Data analysis by PY. Design and implementation of the neural network architecture by PY, BF. Statistics by PY and BV. Subject recruitment, data collection, and data preparation by PG, MG1, MG2, and AN. The first draft of the manuscript was written by PY and all authors commented on subsequent revisions. The final manuscript was read and approved by all authors.

Corresponding author

Correspondence to Po-Kai Yang .

Ethics declarations

Ethics approval and consent to participate.

The study was approved by the local ethics committee of the UZ/KU Leuven (S65059) and all subjects gave written informed consent.

Consent for publication

Not applicable.

Competing interests

The authors declare that there is no conflict of interest regarding the publication of this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

FOG detection model design

IMU-based FOG detection models typically adopt window-based methodologies, dividing an IMU trial into predefined windows to train models for FOG detection. Given a window, FOG detection models aims to classify the window into non-FOG or FOG. The need for predicting FOG annotations frame by frame, similar to experts’ annotations, demands a sliding mechanism with a one-sample step size during evaluation. However, this sliding operation sometimes leads to over-segmentation of FOG annotations  [ 27 , 58 ]. To mitigate these errors, researchers have proposed post-processing methods  [ 24 ]. These methods aim to refine initial annotations and eliminate short FOG episodes that are shorter than the smallest FOG episode in the dataset. On the other hand, employing a refinement model  [ 27 , 58 ] presents a more flexible approach, bypassing the need for extensive knowledge about dataset characteristics.

We propose a FOG detection model comprising an initial prediction block to generate initial FOG annotations and a prediction refinement block to smooth and refine these initial predictions. We initially compare five FOG detection models for the initial prediction block and then compare two approaches for refining the initial predictions.

Problem definition

An IMU trial can be represented as \(X \in \mathbb {R}^{T \times C_{in}}\) , where T is the number of samples and \(C_{in}\) is the input feature dimension ( \(C_{in} = 30\) for 5 IMUs with 3 dimensional acceleration and gyroscope). Each IMU trial X is associated with a ground truth label vector \(Y \in \mathbb {R}^{T \times L}\) , where L is the number of output classes, i.e., 2 for non-FOG and FOG. To generate predictions for each sample, the model learns a function \(f: X \rightarrow \hat{Y}\) that transforms a given input sequence \(X=x_0,\ldots,x_{T-1}\) , where \(x_i \in \mathbb {R}^{1 \times C_{in}}\) , into an output sequence \(\hat{Y}=\hat{y}_0,\ldots,\hat{y}_{T-1}\) , where \(y_i \in \mathbb {R}^{L}\) that closely resembles the manual annotations Y .

Initial prediction block

Our primary objective is to determine the state-of-the-art method for initial FOG prediction. FOG detection models are typically categorized into two main types: Feature-based models, which extract predefined features from IMU data within the window, and Signal-based models, which directly use raw data for FOG detection. Consequently, we selected two distinct signal-based models extensively employed in FOG detection literature: LSTM [ 51 ] and TCN [ 27 ]. These models were trained on raw IMU signals. Additionally, we evaluated three established traditional machine learning models commonly utilized for FOG detection as feature-based models: Support Vector Machine (SVM) with a radial basis function kernel, K Nearest Neighbors (KNN), and XGBoost. These models were trained on 65 pre-defined features used for FOG detections, as outlined in [ 32 ].

The comparisons were conducted on our dataset with 12 subjects, with the partition that excludes instances of stopping. This partition aimed to assess the FOG detection models specifically for clinical detection purposes. For both model training and testing, each IMU trial underwent segmentation into windows of length Q , generated with a step size of 1 sample. Every window was assigned a ground truth label represented by the middle sample of the ground truth annotation within that particular window. All window pairs generated from the dataset were utilized in training the models. During the inference phase, we segmented each trial into T fixed-length sequences, each sequence having a length of Q . Subsequently, these sequences were processed by the model to generate T predictions for each trial across the two classes L . In inference scenarios, the predicted output is formalized as a 2D matrix \(\hat{Y} \in \mathbb {R}^{T \times L}\) . An example illustrates the utilization of windows extracted from different IMU trials in both model training and inference stages is shown in Fig.  5 .

figure 5

An example for comparing window generation during model training and inferencing. This example illustrates the utilization of windows extracted from various IMU trials in both model training and inference stages. During model training, windows generated from different trials were randomly chosen to train the DL model. The actual label assigned to each window corresponded to the experts’ annotation of the middle sample within that window. During the inference phase, windows were generated from the same IMU trial, employing a step size of 1 sample. Subsequently, all generated windows were input into the DL model to conduct sample-wise predictions

Neural network model design: signal-based models

For the DL models, we adopted typical architectures documented in the literature. The LSTM network configuration consisted of passing the input sequence through two bidirectional LSTM layers, each comprising 32 cells. This LSTM network transformed the input sequence of shape \(Q \times C_{in}\) into an internal representation of shape \(Q \times 32\) . Subsequently, an average pooling layer was employed for temporal pooling, resulting in an output of shape \(1 \times 32\) . The output was passed through a linear layer followed by a softmax layer, generating probabilities for the two classes \(1 \times L\) , where \(L=2\) .

Regarding the TCN network, we used the architecture from [ 52 ]. The TCN architecture has a single TCN block comprising five temporal convolution layers. Employing a kernel size of 3, dimensionality of 32, and dilation rates designed to cover the sequence length Q , this TCN utilized valid convolutions, directly transforming the input sequence of shape \(Q \times C_{in}\) into an output of shape \(1 \times 32\) . The output was passed through a linear layer with a softmax activation function, generating probabilities for the two classes. The detailed model architecture, specifically elucidating how valid convolutions are executed within the TCN model, can be found in the original study [ 52 ] (Fig. 6 ).

figure 6

Detailed model architecture of the FOG detection model. Our proposed FOG detection model comprises two essential blocks: an initial prediction block and a prediction refinement block. The initial prediction block utilizes the TCN proposed by Pavllo et al. [ 52 ], featuring five temporal convolution layers with valid convolutions. This TCN transforms the input sequence (padded with 121 samples on both sides) of shape \((T+242)\times 30\) into an output of shape \(T\times 2\) . The prediction refinement block, leveraging a multi-stage TCN architecture proposed by Farha and Gall [ 37 ], aims to refine the initial predictions. The multi-stage TCN comprises four stages of ResNet-style TCN, each containing eight temporal convolution layers with the same convolutions. The output of this refinement block is a refined prediction, also structured as \(T\times 2\) , representing the probabilities of the two classes

For both DL models, the experiments utilized the Amsgrad optimizer [ 69 ] with a learning rate of 0.0005, decayed with a factor of 0.95 for each epoch. The beta1 and beta2 parameters in Amsgrad were set to 0.9 and 0.999, respectively. For consistency, the window size ( Q ) for both DL models was set to 256, corresponding to a 4-s window. All DL models were trained for 50 epochs. A class-weighted categorical cross-entropy loss function was applied. Before training and testing the models, all six channels of the IMU signals for each trial were centralized by subtracting the mean value of each signal to remove the constant bias.

Machine learning model design: feature-based models

To compare the performance of signal-based models with traditional FOG detection models utilizing pre-defined features, we selected three widely used ML models as representatives. All IMU trials underwent segmentation into windows of length Q with a step size of 1. Here, Q equaled 64, 128, or 256 samples, corresponding to window sizes of 1, 2, and 4 s, respectively. Each IMU window served as the basis for computing 65 features, following the methodology proposed in [ 32 ]. These features were derived from IMU data captured on both lower limbs, resulting in a total of 130 features for analysis. It’s important to note that features generated from magnetometers were excluded from our study due to the absence of this sensor modality in our dataset.

The SVM was evaluated for adjusting the cost parameter (0.1, 1, 10, 100, 1000), gamma (1, 0.1, 0.01, 0.001, 0.0001), and window size Q (64, 128, 256). In the case of KNN, tuning encompassed variations in the number of neighbors (ranging from 1 to 50), different distance metrics (manhattan distance and euclidean distance), and window size Q (64, 128, 256). For XGBoost, the tuning process involved optimizing the max depth (ranging from 2 to 10), number of estimators (ranging from 60 to 220 with a step of 40), learning rate (1, 0.1, 0.01, 0.001, 0.0001), and window size Q (64, 128, 256). The reported results only encompass the performance of the ML models that exhibited the best hyperparameter configuration.

For evaluating and comparing model performance, we primarily reported the sample F1 score. Notably, the Segment-F1@50 metric was omitted from the comparison of initial prediction models due to extensive over-segmentation observed in the predictions made by these models. Consequently, all models displayed uniformly low Segment-F1@50 scores, rendering comparative analysis ineffective. The evaluation process for all models employed the LOSO cross-validation approach.

To assess the models’ performance based on the sample F1 score, we conducted paired sample t-tests. These tests compared the best model against each of the other models, with the number of pairs equating to the number of subjects evaluated using LOSO. To avoid type-I errors, p-values were adjusted for multiple comparisons, as defined in Li [ 70 ]. The significance level for all tests was set at 0.05 to determine statistically significant differences in performance across models.

The results showcased in Table  9 demonstrate that the TCN model achieves the highest F1 score. Upon conducting statistical tests, it was observed that the F1 score for the TCN was significantly better than all feature-based ML models. However, no statistically significant difference was found between the TCN and the LSTM model. The hyperparameter settings of the chosen model, namely TCN [ 52 ], are detailed in Table 11 .

Prediction refinement block

Previous studies have revealed that employing ML models for fine-grained FOG detections might result in splitting long freezing episodes into numerous smaller ones [ 27 , 35 ], consequently causing over-segmentation. Subsequent to identifying the optimal initial prediction model, we proceeded to compare the efficacy of two different approaches aimed at mitigating this over-segmentation issue: (1) A pre-defined smoothing approach outlined in [ 24 ], which doesn’t involve training a model, and (2) Utilization of a DL model trained without pre-defined information, as proposed in [ 27 ].

For our evaluation process, we chose the TCN model outlined in [ 52 ] as the initial prediction model.

Pre-defined post-processing method

We implemented the pre-defined post-processing method introduced in [ 24 ]. This method involves combining model-identified FOG periods separated by a single example into a singular FOG event. Additionally, short FOG periods lasting only one example were reclassified as non-FOG instances. In our dataset analysis, it was observed that 95% of FOG episodes persisted for durations longer than 0.33 s (21 samples). To retain at least 95% of these FOG episodes after post-processing, the method combined FOG episodes in the initial prediction that were 21 samples apart into a single FOG episode. Concurrently, FOG episodes shorter than 21 samples were relabeled as non-FOG instances.

Deep learning refinement method

The DL refinement method employed in this study aimed to train a DL model for prediction refinement. We applied the refinement model derived from the MS-TCN architecture initially proposed in [ 37 ]. This model’s design involved the input sequence undergoing processing through four ResNet-style TCN blocks, each consisting of eight layers. These layers employed a kernel size of 3, with a dimensionality of 32 and dilation rates set at factors of 2 (1, 2, 4, 8, 16, 32, 64, 128).

The training of the DL refinement model utilized the Amsgrad optimizer [ 69 ] with a learning rate of 0.0005 and decayed with a factor of 0.95 for each epoch. The beta1 and beta2 parameters in Amsgrad were set to 0.9 and 0.999, respectively. The DL models were trained for 50 epochs, employing a combination of class-weighted categorical cross-entropy loss function and smoothing loss [ 37 ]. For the smoothing loss, parameters \(\tau\) and \(\lambda\) were set to 4 and 0.15, respectively.

Model performance evaluation was conducted based on reported sample F1 scores and Segment-F1@50. To compare the models, paired sample t-tests were conducted, with the significance level set at 0.05. These tests aimed to assess statistical differences between the performance of the two models based on both F1 scores.

The results obtained are summarized in Table 10 . Notably, the comparison emphasizes that the trained DL model achieves a statistically higher Segment-F1@50 score compared to the pre-defined post-processing method. While the Sample-F1 score for the trained DL model was also higher than the pre-defined post-processing method, the difference did not reach statistical significance. Moreover, a comparison was conducted between the model predictions with and without the addition of a DL refinement approach as a prediction refinement block. As depicted in Table 10 , incorporating a prediction refinement block resulted in statistically higher Segment-F1@50 and Sample-F1 scores. These findings strongly indicate that the strategy of training a refinement model significantly improves the smoothness of the initial prediction. This improvement signifies better generalization compared to relying on a pre-defined post-processing approach. Particularly, the pre-defined approach necessitates knowledge of the shortest FOG episode duration within a dataset to avoid overly smoothing and merging of predicted short episodes. Consequently, based on these findings, this study opted for utilizing the trained DL model for post-processing instead of relying on a pre-defined approach. The hyperparameter settings of the chosen model are detailed in Table 11 .

Comparison of models trained with different IMU sensor positions

While a prior study [ 24 ] had proposed an optimal technical setup utilizing three IMUs (specifically, lumbar and both ankles) following an extensive comparison of various IMU, we compared our full 5-IMU sensor configuration with the previously recommended best technical 3-IMU setup (lumbar and both ankles) detailed in [ 24 ] for FOG detection training on trials without stopping. Our comparative study employed a model that integrated the best-performing initial prediction model, the TCN from [ 52 ], along with the refinement model from [ 37 ].

As shown in Table 12 , the model trained with 5 IMUs has a higher Segment-F1@50 and Sample-F1 compared to the model trained with 3 IMUs. However, no statistically significant differences were observed in terms of both F1 scores.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Yang, PK., Filtjens, B., Ginis, P. et al. Freezing of gait assessment with inertial measurement units and deep learning: effect of tasks, medication states, and stops. J NeuroEngineering Rehabil 21 , 24 (2024). https://doi.org/10.1186/s12984-024-01320-1

Download citation

Received : 09 May 2023

Accepted : 30 January 2024

Published : 13 February 2024

DOI : https://doi.org/10.1186/s12984-024-01320-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Temporal convolutional neural networks

Journal of NeuroEngineering and Rehabilitation

ISSN: 1743-0003

assessment task based performance

Performance Assessment unit 5 writing assigment

IMAGES

  1. How to Design a Performance Task

    assessment task based performance

  2. Lauren's ED450 Assessment Resource Blog: Performance Tasks

    assessment task based performance

  3. Designing Assessments

    assessment task based performance

  4. Performance Task Examples

    assessment task based performance

  5. How to Design a Performance Task

    assessment task based performance

  6. Interim Assessments Predict Student Performance on State Tests

    assessment task based performance

VIDEO

  1. GUIDELINES ON PERFORMANCE TASK FOR PR1

  2. PGA -Assessment Task Branding and Social Media

  3. 2023/2024 Digital T Level DBS: Preparing for the ESP assessment Task 3

  4. Integrating CRM System

  5. Features of Authentic/Performance Assessment

  6. Applied management accounting

COMMENTS

  1. Task and Performance-Based Assessment

    Task-based performance assessment Authenticity Task difficulty Speaking Writing Download reference work entry PDF Introduction A performance test is "a test in which the ability of candidates to perform particular tasks, usually associated with job or study requirements, is assessed" (Davies et al. 1999, p. 144).

  2. Performance-Based Assessment: Reviewing the Basics

    In general, a performance-based assessment measures students' ability to apply the skills and knowledge learned from a unit or units of study. Typically, the task challenges students to use their higher-order thinking skills to create a product or complete a process (Chun, 2010).

  3. PDF Task and Performance-Based Assessment

    Task and Performance-Based Assessment Gillian Wigglesworth and Kellie Frost Abstract The increasing importance of performance testing in testing and assessment contexts has meant that the behavior of test tasks, how they perform, and how they are assessed has become a considerable focus of research.

  4. Performance-Based Assessment: How to Implement It in the Classroom

    If this kind of shift in performance-based assessment is truly underway for the freshmen of the future, performance assessment is worth considering sooner rather than later. ... Students who have set open-ended tasks for summative assessments will find previous examples crucial to success. These examples could be 'ideal' versions of work ...

  5. What Is Performance Assessment?

    Performance assessment: This assessment measures how well students apply their knowledge, skills, and abilities to authentic problems. The key feature is that it requires the student to...

  6. (PDF) Task-based Performance Assessment for Teachers ...

    More generally, task-based testing is seen as a way of achieving a close correlation between the test performance, i.e. what the testee does during the test, and the criterion performance,...

  7. What Is Performance-Based Assessment?

    Performance-based assessment is an evaluation approach that focuses on measuring someone's ability to apply their knowledge in the real world, instead of solely testing knowledge acquisition. By demonstrating their understanding of topics or tasks, learners are required to put what they've learned into practice through activities such as ...

  8. Task and Performance Based Assessment

    A performance test is "a test in which the ability of candidates to perform particular tasks, usually associated with job or study requirements, is assessed" (Davies et al., 1999, p. 144).

  9. Performance-Based Assessment: A Comprehensive Overview

    Performance-based assessment (PBA) is an effective way to evaluate a student's understanding of a subject. This type of assessment requires students to demonstrate their knowledge in ways that go beyond traditional exams.

  10. PDF Performance Assessment: A Deeper Look at Practice and Research

    Performance-based assessment and educational equity. Harvard Educational Review 64(1):5-30. Darling-Hammond, L., & Falk, B. (2013). Teacher learning through assessment: How ... Assessment network. These performance tasks demonstrate the pedagogical decisions teachers made, as well as the ways the experience allowed for deeper learning. ...

  11. PDF Performance-Based Task Assessment Of Higher-Order Proficiencies In

    Unlike conventional summative standardized tests, performance-based assessment is integrated with the whole learning process as learning is embedded in the actual assessment tasks. Performance tasks serve as both an integral part of learning activity and an opportunity to assess the learning outcomes (Hibbard, 1996).

  12. The Ultimate Guide to Performance-Based Assessments

    The best performance assessments provide students with opportunities for exploration, learning, analysis, and other higher-level thinking processes as they complete the task. The final score isn't entirely about the end result, but rather rests heavily on how students used what they know to get to the end result.

  13. Designing Performance Assessment Tasks

    Creating effective assessment tasks requires thinking through curriculum content to establish learning outcomes, then designing performance activities that will allow students to demonstrate their achievement of those outcomes, and specifying criteria by which they will be evaluated, experts say.

  14. PDF INSIDE LEFT

    Performance-based assessment requires students to use high-level thinking to perform, create, or produce something with transferable real-world application. research has shown that such assessment provides useful information about student performance to students, parents, teachers, principals, and policymakers. research on thinking and learning ...

  15. Task‐Based Assessment

    Task-based assessment (TBA) is defined by Brindley (1994, p. 74) as "the process of evaluating, in relation to a set of explicitly stated criteria, the quality of the communicative performances elicited from learners as part of goal-directed, meaning-focused language use requiring the integration of skills and knowledge." ...

  16. PDF Performance Assessments: A Review of Definitions, Quality

    Oberg (2010) describes performance-based assessment generally as "one or more approaches for measuring student progress, skills, and achievement" and that performance ... indicating that the assessment task itself should be meaningful (Frey, Schmitt, & Allen, 2012).

  17. Back To Basics: What Is Performance Based Assessment (PBA)?

    What Is Performance-Based Assessment? Serving as an alternative to traditional testing methods, performance-based assessment includes the problem-solving process. These assessments require a student to create a product or answer a question that will demonstrate the student's skills and understanding.

  18. Performance-Based Assessment in Math

    June 23, 2015 Overview Performance-Based Assessment: Math Through performance-based assessment, students demonstrate the knowledge, skills, and material that they have learned. This practice measures how well a student can apply or use what he or she knows, often in real-world situations.

  19. Second Language Task-Based Performance

    Second Language Task-Based Performance is the first book to synthesize Peter Skehan's theoretical and empirical contributions all in one place. With three distinct themes explored in each section (theory, empirical studies, and assessment), Skehan's influential body of work is organized in such a way that it provides an updated reflection on the material and makes it relevant to today's ...

  20. Performance Assessment Resource Bank

    It includes performance assessment tasks and support resources for designing a system of assessment and building educator assessment literacy and capabilities, all focused on more meaningful learning. ... They've created a peer-reviewed archive, based in solid research and connected to 21st century standards, that lets districts, schools and ...

  21. Designing Performance-based Assessment Tasks

    Knowing these features, you can develop a performance task using the following steps: Choose one performance objective or learning outcome and brainstorm ways you can assess student progress on or mastery of that objective. Note the proficiency level and communicative mode for the chosen objective. Write what students will do to complete the ...

  22. Some reflections on task-based language performance assessment

    The complexities of task-based language performance assessment (TBLPA) are leading language testers to reconsider many of the fundamental issues about what we want to assess, how we go about it and what sorts of evidence we need to provide in order to justify the ways in which we use our assessments.

  23. Performance-Based Learning: 15 Examples, Pros and Cons

    It is difficult to administer normative standardized tests for many performance-based assessment tasks. This makes it hard to generate quantifiable, comparable, and standardizable grades for students. The term can be seen as overly vague. Other concepts, like active learning and project-based learning have significant overlaps with this concept ...

  24. Assessment of Self-report, Palpation, and Surface ...

    Sports physiotherapist's palpation-based assessment of muscle tightness during the 210-second experiment with 30-second intervals and final assessment of muscle tightness was summarized in Excel ...

  25. Freezing of gait assessment with inertial measurement units and deep

    Additionally, this study is the first to assess the FOG detection performance of a DL model during the dual-task 360Turn task, recently proposed as one of the most effective FOG-provoking tasks . Two clinical measures were computed to evaluate the FOG severity predicted by the DL model trained for the clinical setting (Model_Clinical), the %TF ...

  26. A Reactive Deep Learning-Based Model for Quality Assessment in Airport

    Monitoring the correct operation of airport video surveillance systems is of great importance in terms of the image quality provided by the cameras. Performing this task using human resources is time-consuming and usually associated with a delay in diagnosis. For this reason, in this article, an automatic system for image quality assessment (IQA) in airport surveillance systems using deep ...

  27. Performance Assessment unit 5 writing assigment

    Teacher Planning: Performance Assessment Your Name: Anthony Brown Jr. Task Name: Kahoot Grade: 6 th grade Subject and Topic: Science and L.6.1.3 Task Description: Answering varies questions in different forms in a timely manner. Standards Assessed (MDE): L.6.1.3 Develop and use models to explain how specific cellular components (cell wall, cell membrane, nucleus, chloroplast, vacuole, and ...