Lori Lioce, DNP, FNP-BC, CHSE-A, CHSOS, FAANP, Stephen Hetherman, EdD describe the steps taken to develop a high stakes OSCE that is reliable and quantifiable. This pilot was successful in advancing competency based assessment of individual performance.
There are many variables in the execution of successful simulation, student learning, and retention. Implementation of objective structured clinical examinations (OSCE) must begin with faculty and staff development and training. Only when faculty and staff collaborate in the design and operational implementation is the simulation fair to participants, the outcome measurable, and the simulation repeatable. The planning of high stakes events are vital to a fair and reliable process.
This article describes use of Objectivity Plus’ Quantum software during an OSCE with graduate nursing students. This statistical simulation software was used to measure achievement of planned outcomes, calculate and account for instructor subjectivity, and measure attainment of individual participant competence.
This pilot included eight faculty members, four staff members and 71 participants. All aspects of the OSCE event took place over seven hours. A description of the experience consisted of six steps: (1) Setting objectives; (2) OSCE clinical scenario development; (3) Faculty dry-run; (4) Refining performance measures; (5) Operational plan and participant implementation; and (6) Data analysis, debrief and quality improvement.
OSCE design began between the UAH College of Nursing faculty/staff and Objectivity Plus staff. Faculty choose Quantum for its objective, standardized, competency-based assessment software’s ability to provide extensive validity evidence. Development began with an overview and training of Objectivity Plus’ website to highlight features which validate participant’s competence.
In particular, the administrative page for authoring a clinical scenario and the portal to manage user access and roles. Administrative tools allowed staff to create a schedule and to communicate with all participants.
The debrief features allowed immediate reporting from individual performance data to debrief the learner, and a live rating feature to monitor test administration progress. The score report feature provides access to both aggregate and individual participant information.
Step 1: Set Objectives
In clinical nursing education, the competent demonstration of knowledge, skills, and attitudes are important as participants enter clinical today and throughout their academic and professional career. As part of this OSCE, each participant should demonstrate his or her clinical competence throughout this standardized, competency-based assessment.
For this OSCE the participant should:
Perform patient examination and formulate appropriate diagnosis and treatment plan.
Document and report findings
Address the son’s questions about his mother’s status and concerns
Address measures that can be taken to monitor patient’s improvement/decline
Step 2: OSCE Clinical Scenario Selection and Design
The Associate Dean for Graduate Programs created an OSCE task force with the FNP II course manager. The task force included the Associate Dean, FNP course manager, one faculty member who had OSCEs in their educational preparation, two FNP clinical instructors, and a Certified Healthcare Simulation Educator- Advanced (CHSE-A). The purpose of the task force was to select and adapt a provider level clinical scenario from MedEdportal. Faculty from the task force refined the scenario which increased expertise for implementation and expanded faculty development. Using Objectivity Plus’s administrative tool, the clinical scenario was parsed into forms for participant implementation and performance measures were uploaded and tested.
Step 3: Faculty Dry-Run
Faculty serving as raters were provided training to rate participants using the Quantum App on tablets provided by Objectivity Plus for the guided pilot. The App provided the OSCE scenario overview for the participant. Additional documents were developed using the authoring tool to standardize operations and decrease variability. These included a set up document for repeatability, directions to the participant, a patient chart, and clinical documentation sheets.
Step 4: Refining Performance Measures
A scheduled dry-run of the OSCE allowed raters to refine the performance measures. The raters were split into small groups to role play the standardized patient or participate as the learners. The raters who were not part of the design checked off each performance measure (item) to test the design. This provided a fresh perspective to identify flaws in the scenario design, to divide broad items and draft additional specific items. A group debriefing of the dry-run allowed revision and consensus of items.
In the analysis and parsing of the original items, three additional items were written to assist with clarity and meaning. In total, 12 items were uploaded to the App forming a rating sheet for raters to use during the live OSCEs. This was a crucial step to allow for more accurate assessment and analysis of competencies.
Step 5: Operational Plan and Participant Implementation
The staff developed a rater rotation plan where raters rotated and were paired with different raters three times. This design was chosen for this OSCE given the large number of participants and was scheduled for 6.5 hours with a break and a lunch (see Table 1). The schedule accommodated 71 participants in 20 minute rotations in four exam rooms. Each participant was assessed by two raters: their assigned clinical faculty and an independent second rater. Rater 1 was the facilitator/debriefer and was located near the OSCE exam room. Rater 2 was the independent rater and watched a live Zoom feed from another floor which provided a location barrier to support independent rating.
This pilot’s rater rotation plan was less data intensive, but admittedly less precise than a fully crossed research design. Nevertheless, overlap was satisfied to calculate leniency/severity of all raters. Quantum’s software scaled the raters’ leniency/severity on the same interval measure (logits) units as for items and participants. Objective measurement requires that rater leniency/severity levels be modelled and statistically controlled. As a result of this rater rotation plan, initial leniency/severity levels were calculated for each rater and will serve as a baseline for all future OSCEs with these faculty and certainly increase objectivity. Subsequently, rating plans will be devised so that each participant is rated by one rater.
The OSCE was broken down into the following phases: 5 minutes to review the patient chart, 10 minutes patient encounter, and 5 minutes for direct feedback guided by the debrief report in the App.
Step 6: Data Analysis, Debrief and Quality Improvement
A virtual debriefing of score reports was completed. The rich results added vital data for program evaluation. Data included aggregate cohort and individual score reports with standard errors of measurement which were automatically calculated via this software. The App significantly increased reporting ability while decreasing analysis time, improved immediate participant feedback, lowered time for score to participant, and decreased workload for data analysis. Additionally, Quantum placed participant assessment data on Benner’s novice to expert continuum (1984).
In Figure 1 Quantum’s Administrative Score Report aggregates the participants’ learning outcome data. Descriptive information is provided first: Who tested? How many tested? When did they test?
Next, two reliability estimates are calculated after each test administration next in compliance with measurement best practices (AERA, APA, & NCME, 2014). The KR(20), or Kuder-Richardson Formula, measures overall test reliability with values between 0.0 and +1.0. The closer the KR(20) is to +1.0, the more reliable an exam is considered because its items do a good job consistently distinguishing among higher and lower performing participants. In Figure 1, KR20 = 0.97 indicating the items did an excellent job distinguishing participants’ abilities.
We can also consider the reliability of the mean rating. The intraclass correlation ICC(1,2) measures agreement of ratings; it addresses the extent to which raters make essentially the same ratings. ICC(1,2) for raters in this OSCE is therefore 0.83 with a 95% confidence interval of (0.646, 0.981) indicating raters showing a very high level of agreement.
Now the program can have confidence in the OSCE test scores and can draw conclusions from the results since the reliability of the data are known and the reliability estimates are greater than or equal to +0.70. As recognized by the Standards (AERA, APA, & NCME, 2014), the level of reliability of scores has implications for the validity of score interpretations.
A sample roster from the Quantum administrative score report is shown in Figure 2 of the aggregate data that may be used for numerous purposes (e.g., course review, faculty audit, accreditation documentation of participant evaluation). Data may be sorted by each heading label. The arrow in the blue box (in the far right column) is a dynamic link to the individual participant’s score report. (Note: These were not given to the participants, only retained as part of the participant evaluation and treated as protected test information so the OSCE blueprint is not compromised.)
Individual participant score reports will be used to map performance over time and follows measurement best practices (see Figure 3). All participant score reports are:
- Personalized: immediately address the participant’s question, “How did I do?”
- Readable: Colors and shapes assists in the score report’s interpretation.
- Actionable: answers the participant’s question, “What do I need to do next?” Addressed with customized remediation; adds specific debriefing.
When learning objectives are mapped accordingly, then performance may be mapped over time. These score reports may be used to show participant competence, audit preceptors and program concepts, and used for faculty feedback.
Quantum’s quality control statistics monitored all variables in the competency-based assessment since overlap was maintained by the rater rotation plan throughout the test administration. A chi-square test for homogeneity was performed to determine whether the eight raters were equally lenient/severe. The eight raters were not equally lenient/severe X2 (7) = 4.53, p < .05. As shown in Figure 4, Rater #2 was most severe and Rater #6 was most lenient.
Raters, even when highly trained, do not make equal assessments of equal performances. Thus the need for an algorithm which accounts for rater severity/leniency. Quantum’s algorithm accounts for rater severity/leniency and item difficulty before calculating a participant’s test score which provides a leveling assessment for the participant regardless of which rater completes the evaluation.
Seven raters showed excellent intra-rater decision consistency, allowing for normal variability. Rater #4 showed evidence of inconsistency. In fact rater #4, had 30% more variability in her ratings in comparison to the other seven raters. This means that rater #4 gave highly unexpected rating occasionally and needs additional development for high stakes assessment. Quantum’s software is unique in that it provides diagnostic data which has been previously unattainable in everyday simulation evaluation and often unquantifiable without rigorous research studies.
The objectives, specific to the FNP II participant, were selected to assess knowledge, skills, and attitudes. An item analysis of the technical quality of the performance measures showed appropriate fit allowing for normal variability. Four performance measures need to be rewritten for more specific assessment as each were written too broadly in the original case. For example, “Did the participant incorporate current guidelines in clinical practice to formulate an age/gender appropriate plan of care?”
Furthermore, Quantum’s software examined the spread of performance measures along the variable of FNP II. Six statistically distinct regions of item difficulty that the participants have distinguished, were identified. All gaps in content representation will need to be investigated by looking at the unique ordering of item difficulties and their individual standard errors. Validity evidence is an on-going process and all data gathered are in support of test plans and test score interpretation.
Overall feedback from all participants served as a vital component for improvement of the process and procedures for full implementation in all clinical graduate courses.
- Liked having two raters per participant
- Faculty would have worked out better if all faculty raters were all on the same floor
- Everyone is eager to see the statistics
- Faculty loved rating one participant at a time
- Rater 2 had some technical difficulties with camera angles which prevented faculty from giving a thorough evaluation. Learning new equipment could have contributed
- Rotation schedule was managed very effectively
- The 10-minute patient encounter time period went well…was beneficial for participants to learn a 10 minute patient encounter time frame was not unrealistic
- Faculty feedback on APP improvement was given to Objectivity Plus staff
The pilot implementation was successful. The software allowed for comprehensive competency measurement, statistical analysis and program evaluation simply and efficiently. The pilot assisted in advancing the state of competency-based assessment since we measured it reliably. The program can now draw evidenced based conclusions about the participants. What they are doing well and needs improvement in relation to the performance measures. Faculty were extremely satisfied with the pilot’s evidence based outcome measurement and have already begun refining and scheduling OSCEs for FNP III & IV.
Regarding participants, focused feedback was overwhelming appreciated. Specific individualized feedback not only increased satisfaction but allowed for recognition of areas competency needed improvement.
From a program administration standpoint, this pilot demonstrates how valid and reliable assessment using a statistical software such as Quantum can reveal patterns in raters' scoring. Groundbreakingly, the analysis yielded data to handle the practical issue of moderation of scores to address rater differences. Making the OSCE more objective. Truly putting the “O” in OSCE!
Investing so much time and resource into simulation programs, the ROI has to provide data that’s robust, rich, and reliable! In order to do so, it requires a standardized psychometrically sound process, which includes validity evidence, rater analysis, and measurable outcomes to justify the investment.
About the Authors
Dr. Lori Lioce serves as the Executive Director of the Learning & Technology Resource Center and a Clinical Associate Professor at The University of Alabama in Huntsville College of Nursing. She earned her DNP from Samford University and is a AANP Fellow. Her clinical background is concentrated in emergency medicine and her research interests include simulation education and operations, healthcare policy, ethics, substance use disorders. She currently serves on the Board of Directors for SimGHOSTS.
Dr. Steve Hetherman is a psychometrician and managing partner of Objectivity Plus. He earned his EdD from Teachers College, Columbia University. His measurement interests include classical test theory, Rasch measurement theory, simulation testing, and competency-based assessment.
1. American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], & Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. Washington, DC: AERA.
2. Benner, P. (1984). From novice to expert: Excellence and power in clinical nursing practice. Menlo Park: Addison-Wesley, pp.13-34.
3. Hetherman, S.C., Lioce, L., Gambardella, L., & Longo, B. (2017). Development of Quantum, an Instructor-Mediated Performance Assessment Test, and Student Measure Validation. Journal of Nursing & Healthcare, 3(1) 1-10.
Originally published in Issue 2, 2019 of MT Magazine.