Putting the “O” in OSCE: An Objective Structured Clinical Examination to Measure Outcomes

Contact Our Team

For more information about how Halldale can add value to your marketing and promotional campaigns or to discuss event exhibitor and sponsorship opportunities, contact our team to find out more


The America's -

Rest of World -

Lori Lioce, DNP, FNP-BC, CHSE-A, CHSOS, FAANP, Stephen Hetherman, EdD describe the steps taken to develop a high stakes OSCE that is reliable and quantifiable. This pilot was successful in advancing competency based assessment of individual performance.

There are many variables in the execution of successful simulation, student learning, and retention.Implementation of objective structured clinical examinations (OSCE) must begin with faculty and staff development and training. Only when faculty and staff collaborate in the design and operational implementation is the simulation fair to participants, the outcome measurable, and the simulation repeatable. The planning of high stakes events are vital to a fair and reliable process.

Graduate Nursing faculty chatting during rounds. Image credit: Monkey Business Images/Shutterstock, Inc.

This article describes use of ObjectivityPlus’ Quantum software during an OSCE with graduate nursing students. Thisstatistical simulation software was used to measure achievement of plannedoutcomes, calculate and account for instructor subjectivity, and measureattainment of individual participant competence.

This pilot included eight faculty members,four staff members and 71 participants. All aspects of the OSCE event tookplace over seven hours. A description of the experience consisted of six steps:(1) Setting objectives; (2) OSCE clinical scenario development; (3) Facultydry-run; (4) Refining performance measures; (5) Operational plan andparticipant implementation; and (6) Data analysis, debrief and qualityimprovement.

OSCE design began between the UAH Collegeof Nursing faculty/staff and Objectivity Plus staff. Faculty choose Quantum forits objective, standardized, competency-based assessment software’s ability toprovide extensive validity evidence. Development began with an overview andtraining of Objectivity Plus’ website to highlight features which validateparticipant’s competence.

In particular, the administrative page forauthoring a clinical scenario and the portal to manage user access and roles.Administrative tools allowed staff to create a schedule and to communicate withall participants.

The debrief features allowed immediatereporting from individual performance data to debrief the learner, and a liverating feature to monitor test administration progress. The score reportfeature provides access to both aggregate and individual participantinformation.

Step 1: Set Objectives

In clinical nursing education, thecompetent demonstration of knowledge, skills, and attitudes are important asparticipants enter clinical today and throughout their academic and professionalcareer. As part of this OSCE, each participant should demonstrate his or herclinical competence throughout this standardized, competency-based assessment.

For this OSCE the participant should:


Perform patient examination and formulate appropriatediagnosis and treatment plan.

Document and report findings


Address the son’s questions about hismother’s status and concerns

Address measures that can be taken tomonitor patient’s improvement/decline

Step2: OSCE Clinical Scenario Selection and Design

The Associate Dean for Graduate Programscreated an OSCE task force with the FNP II course manager. The task forceincluded the Associate Dean, FNP course manager, one faculty member who hadOSCEs in their educational preparation, two FNP clinical instructors, and aCertified Healthcare Simulation Educator- Advanced (CHSE-A). The purpose of thetask force was to select and adapt a provider level clinical scenario fromMedEdportal. Faculty from the task force refined the scenario which increasedexpertise for implementation and expanded faculty development. UsingObjectivity Plus’s administrative tool, the clinical scenario was parsed intoforms for participant implementation and performance measures were uploaded andtested.

Step3: Faculty Dry-Run

Faculty serving as raters were providedtraining to rate participants using the Quantum App on tablets provided byObjectivity Plus for the guided pilot. The App provided the OSCE scenariooverview for the participant. Additional documents were developed using theauthoring tool to standardize operations and decrease variability. Theseincluded a set up document for repeatability, directions to the participant, apatient chart, and clinical documentation sheets.

Step4: Refining Performance Measures

A scheduled dry-run of the OSCE allowedraters to refine the performance measures. The raters were split into smallgroups to role play the standardized patient or participate as the learners.The raters who were not part of the design checked off each performance measure(item) to test the design. This provided a fresh perspective to identify flawsin the scenario design, to divide broad items and draft additional specificitems. A group debriefing of the dry-run allowed revision and consensus ofitems.

In the analysis and parsing of the originalitems, three additional items were written to assist with clarity and meaning.In total, 12 items were uploaded to the App forming a rating sheet for ratersto use during the live OSCEs. This was a crucial step to allow for moreaccurate assessment and analysis of competencies.

Schedule for Paired Rater Rotation Schedule. Image credit: Objectivity Plus.

Step5: Operational Plan and Participant Implementation

The staff developed a rater rotation planwhere raters rotated and were paired with different raters three times. Thisdesign was chosen for this OSCE given the large number of participants and wasscheduled for 6.5 hours with a break and a lunch (see Table 1). The scheduleaccommodated 71 participants in 20 minute rotations in four exam rooms. Eachparticipant was assessed by two raters: their assigned clinical faculty and anindependent second rater. Rater 1 was the facilitator/debriefer and was locatednear the OSCE exam room. Rater 2 was the independent rater and watched a liveZoom feed from another floor which provided a location barrier to supportindependent rating.

This pilot’s rater rotation plan was lessdata intensive, but admittedly less precise than a fully crossed researchdesign. Nevertheless, overlap was satisfied to calculate leniency/severity ofall raters. Quantum’s software scaled the raters’ leniency/severity on the sameinterval measure (logits) units as for items and participants. Objectivemeasurement requires that rater leniency/severity levels be modelled andstatistically controlled. As a result of this rater rotation plan, initialleniency/severity levels were calculated for each rater and will serve as abaseline for all future OSCEs with these faculty and certainly increaseobjectivity. Subsequently, rating plans will be devised so that eachparticipant is rated by one rater.

The OSCE was broken down into the followingphases: 5 minutes to review the patient chart, 10 minutes patient encounter,and 5 minutes for direct feedback guided by the debrief report in the App.

Step6: Data Analysis, Debrief and Quality Improvement

A virtual debriefing of score reports wascompleted. The rich results added vital data for program evaluation. Dataincluded aggregate cohort and individual score reports with standard errors ofmeasurement which were automatically calculated via this software. The Appsignificantly increased reporting ability while decreasing analysis time,improved immediate participant feedback, lowered time for score to participant,and decreased workload for data analysis. Additionally, Quantum placedparticipant assessment data on Benner’s novice to expert continuum (1984).

In Figure 1 Quantum’s Administrative ScoreReport aggregates the participants’ learning outcome data. Descriptiveinformation is provided first: Who tested? How many tested? When did they test?

Next, two reliability estimates are calculatedafter each test administration next in compliance with measurement bestpractices (AERA, APA, & NCME, 2014). The KR(20), or Kuder-RichardsonFormula, measures overall test reliability with values between 0.0 and +1.0.The closer the KR(20) is to +1.0, the more reliable an exam is consideredbecause its items do a good job consistently distinguishing among higher andlower performing participants. In Figure 1, KR20 = 0.97 indicating the itemsdid an excellent job distinguishing participants’ abilities.

We can also consider the reliability of themean rating. The intraclass correlation ICC(1,2) measures agreement of ratings;it addresses the extent to which raters make essentially the same ratings.ICC(1,2) for raters in this OSCE is therefore 0.83 with a 95% confidenceinterval of (0.646, 0.981) indicating raters showing a very high level ofagreement.

Now the program can have confidence in theOSCE test scores and can draw conclusions from the results since thereliability of the data are known and the reliability estimates are greaterthan or equal to +0.70. As recognized by the Standards (AERA, APA, & NCME,2014), the level of reliability of scores has implications for the validity ofscore interpretations.

A sample roster from the Quantum administrativescore report is shown in Figure 2 of the aggregate data that may be used fornumerous purposes (e.g., course review, faculty audit, accreditationdocumentation of participant evaluation). Data may be sorted by each headinglabel. The arrow in the blue box (in the far right column) is a dynamic link tothe individual participant’s score report. (Note: These were not given to theparticipants, only retained as part of the participant evaluation and treatedas protected test information so the OSCE blueprint is not compromised.)

Individual participant score reports willbe used to map performance over time and follows measurement best practices(see Figure 3). All participant score reports are:

  • Personalized: immediatelyaddress the participant’s question, “How did I do?”
  • Readable: Colors and shapesassists in the score report’s interpretation.
  • Actionable: answers theparticipant’s question, “What do I need to do next?” Addressed with customizedremediation; adds specific debriefing.

When learning objectives are mappedaccordingly, then performance may be mapped over time. These score reports maybe used to show participant competence, audit preceptors and program concepts,and used for faculty feedback.


Quantum’s quality control statisticsmonitored all variables in the competency-based assessment since overlap wasmaintained by the rater rotation plan throughout the test administration. Achi-square test for homogeneity was performed to determine whether the eightraters were equally lenient/severe. The eight raters were not equallylenient/severe X2 (7) = 4.53, p < .05. As shown in Figure 4, Rater #2 wasmost severe and Rater #6 was most lenient.

Raters, even when highly trained, do notmake equal assessments of equal performances. Thus the need for an algorithmwhich accounts for rater severity/leniency. Quantum’s algorithm accounts forrater severity/leniency and item difficulty before calculating a participant’stest score which provides a leveling assessment for the participant regardlessof which rater completes the evaluation.

Seven raters showed excellent intra-raterdecision consistency, allowing for normal variability. Rater #4 showed evidenceof inconsistency. In fact rater #4, had 30% more variability in her ratings incomparison to the other seven raters. This means that rater #4 gave highlyunexpected rating occasionally and needs additional development for high stakesassessment. Quantum’s software is unique in that it provides diagnostic datawhich has been previously unattainable in everyday simulation evaluation andoften unquantifiable without rigorous research studies.

The objectives, specific to the FNP IIparticipant, were selected to assess knowledge, skills, and attitudes. An itemanalysis of the technical quality of the performance measures showedappropriate fit allowing for normal variability. Four performance measures needto be rewritten for more specific assessment as each were written too broadlyin the original case. For example, “Did the participant incorporate currentguidelines in clinical practice to formulate an age/gender appropriate plan ofcare?”

Furthermore, Quantum’s software examinedthe spread of performance measures along the variable of FNP II. Sixstatistically distinct regions of item difficulty that the participants havedistinguished, were identified. All gaps in content representation will need tobe investigated by looking at the unique ordering of item difficulties andtheir individual standard errors. Validity evidence is an on-going process andall data gathered are in support of test plans and test score interpretation.

Overall feedback from all participantsserved as a vital component for improvement of the process and procedures forfull implementation in all clinical graduate courses.


  • Liked having two raters perparticipant
  • Faculty would have worked outbetter if all faculty raters were all on the same floor
  • Everyone is eager to see thestatistics
  • Faculty loved rating oneparticipant at a time
  • Rater 2 had some technicaldifficulties with camera angles which prevented faculty from giving a thoroughevaluation. Learning new equipment could have contributed
  • Rotation schedule was managedvery effectively
  • The 10-minute patient encountertime period went well…was beneficial for participants to learn a 10 minutepatient encounter time frame was not unrealistic
  • Faculty feedback on APPimprovement was given to Objectivity Plus staff


The pilot implementation was successful.The software allowed for comprehensive competency measurement, statisticalanalysis and program evaluation simply and efficiently. The pilot assisted inadvancing the state of competency-based assessment since we measured itreliably. The program can now draw evidenced based conclusions about theparticipants. What they are doing well and needs improvement in relation to theperformance measures. Faculty were extremely satisfied with the pilot’sevidence based outcome measurement and have already begun refining and schedulingOSCEs for FNP III & IV.

Regarding participants, focused feedbackwas overwhelming appreciated. Specific individualized feedback not onlyincreased satisfaction but allowed for recognition of areas competency neededimprovement.

From a program administration standpoint,this pilot demonstrates how valid and reliable assessment using a statisticalsoftware such as Quantum can reveal patterns in raters' scoring.Groundbreakingly, the analysis yielded data to handle the practical issue ofmoderation of scores to address rater differences. Making the OSCE moreobjective. Truly putting the “O” in OSCE!

Investing so much time and resource into simulation programs, the ROI has to provide data that’s robust, rich, and reliable! In order to do so, it requires a standardized psychometrically sound process, which includes validity evidence, rater analysis, and measurable outcomes to justify the investment.

Aboutthe Authors

Dr. Lori Lioce serves as the ExecutiveDirector of the Learning & Technology Resource Center and a ClinicalAssociate Professor at The University of Alabama in Huntsville College ofNursing. She earned her DNP from Samford University and is a AANP Fellow. Herclinical background is concentrated in emergency medicine and her researchinterests include simulation education and operations, healthcare policy,ethics, substance use disorders. She currently serves on the Board of Directorsfor SimGHOSTS.

Dr. Steve Hetherman is a psychometricianand managing partner of Objectivity Plus. He earned his EdD from TeachersCollege, Columbia University. His measurement interests include classical testtheory, Rasch measurement theory, simulation testing, and competency-basedassessment.

1. American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], & Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. Washington, DC: AERA.
2. Benner, P. (1984). From novice to expert: Excellence and power in clinical nursing practice. Menlo Park: Addison-Wesley, pp.13-34.
3. Hetherman, S.C., Lioce, L., Gambardella, L., & Longo, B. (2017). Development of Quantum, an Instructor-Mediated Performance Assessment Test, and Student Measure Validation. Journal of Nursing & Healthcare, 3(1) 1-10.

Originally published in  Issue 2, 2019 of MT Magazine.


More events

Related articles

More Features

More features