Saturday, February 16, 2019
Takeaways from “Best Practices in High-Stakes Testing” Conference
On February 7–8, 2019, the National Conference of Bar Examiners and the Law School Admission Council co-hosted a conference entitled “Best Practices in High Stakes Testing: What Legal Educators Need to Know.” The conference consisted of a welcome reception on Thursday evening, three plenary presentations on Friday morning, and a choice of breakout sessions on Friday afternoon.
After Friday’s opening remarks—which included a nod to academic support pioneer Paula Lustbader—psychometrician Gage Kingsbury explained the difference between admissions testing and licensure testing. Admissions testing is characterized by its lack of a minimum “passing” score and its need for a high level of reliability across a wide range of performances. (Here, the LSAT.) Meanwhile, a licensure exam typically follows a program of instruction, and includes a pre-identified passing score. In addition, a licensure test must possess a high reliability that the pass/no pass line is properly drawn. (Here, the bar exam.)
Both types of exams should possess not only a strong case for the validity of their use, but also the fairness of their use. Validity, in this context, means that the test is being used for an appropriate purpose. (Validity is not the same thing as reliability; reliability is the likelihood that the examination will produce consistent scores over time for all takers in the normal range of performance). Fairness, however, is more complicated. Most psychometricians agree that a test need not ensure that all subgroups (e.g. Caucasians and Latinos) score the same, but that the test should be equally reliable and equally valid for all subgroups.
Dr. Kingsbury highlighted how the term “high stakes” is frequently misused by the testing community. Almost every test is high stakes for someone; the key is to recognize for whom. Is the test high stakes for the test taker, the testing agency, the educators, the community, or some combination of stakeholders? The more stakeholders who are impacted by the results of the examination, then the higher the stakes. Moreover, a real issue arises when the examination is high stakes for one population, but not for another. For example, a statistical “terminal disaster” occurred with the “No Child Left Behind” testing model because the examination was very high stakes for the educators, but a no-stakes examination for the test taker.
He concluded with a quick tip: When crafting high stakes examinations, drafters should pay attention to how much information is being tested per minute of examination—the answer to this question will aid the drafter in determining which testing format (e.g. essay, multiple-choice) is best equipped to achieve the drafter’s goal.
Next, Professor James Wollack of the University of Wisconsin-Madison discussed the “Implications of Standardization: Test Security for LSAT and Bar Exam.” More specifically, his talk focused on exam cheating in the academic setting.
Shockingly, 68% of undergraduate students nationwide admitted to cheating at least once during the academic year. Of those who admitted to cheating, 85% believed cheating was essential to their academic success, 90% didn’t believe they would get caught, and, in fact, 95% did not get caught. To put all these numbers in perspective, a website which enables students to purchase original pre-written term papers and projects receives 8,000 unique hits each day.
In the admissions test context, over the last decade (virtually) all the major testing industries (e.g. SAT, ACT, MCAT, GMAT, TOEFL) experienced documented incidents of cheating. For example, one student foolishly placed a craigslist ad seeking a surrogate to take his LSAT exam for him. Additionally, across the licensure credentialing industry, cheating is widespread (e.g. radiologist, school bus drivers, sommeliers, and the list goes on). Regarding the bar exam, a member of the NCBE disclosed that six different jurisdictions reported “testing irregularities” regarding 30 different candidates last year on the bar examination.
Professor Wollack identified five factors that impact whether a person will cheat on an examination: (1) the stakes of the examination, (2) the examinee’s predisposition, (3) the perceived need to cheat, (4) the opportunity to cheat, and (5) the perceived punishment if caught. Specifically, potential cheaters begin by weighing the pros/cons of cheating, asking themselves questions like: “Will an important decision be made with this particular test score? Will I get a benefit for a good score? Will I likely be punished for a bad score?” The examinee will then weigh those answers against his or her own personal moral compass. Generally, society is becoming more and more tolerant of cheating. Third, the student will likely weigh their perception of how well they will perform on the exam relative to the target score without cheating. Essentially, they will ask themselves, “Can I pass without cheating?” He stressed that this step is tied to the student’s own self-assessment or perception and is not based on actual objective indicators of success. Fourth, the student will assess whether there will be an opportunity to cheat. Do the test conditions allow for fraud? Professor Wollack stressed that anything proctors allow in the testing room is an area of vulnerability. In fact, there is an entire industry devoted to cheating enabled clothing (e.g. earbuds designed as earrings, cameras in glasses and shirt buttons). Cell phones, however, are the single biggest threat, especially students who utilize a two-phone system: a decoy phone voluntarily surrendered to the proctor at the start of the test and a secreted cheater phone. Lastly, the student will weigh their perception of what will happen to them if caught against the potential for reward or gain.
The definitive source for test administration guidance, including security standards, is the “Joint Standards for Educational and Psychological Testing” manual. The book explains that the examinee’s behavior is influenced by many environmental and administrative factors. Therefore, examiners are wise to manage these conditions thoughtfully. The manual explains, that when designing a test, developers should adopt a comprehensive approach to test security at all phases of the test, beginning at test development, and continuing through messaging with candidates, delivery, post-exam web monitoring, and statistical detection for indicia of test fraud. When a departure from any protocol occurs, the departure must be meticulously documented to ward against fraud.
In response to an attendee’s question, Professor Wollack advised that it was worthwhile to share large scale grading rubrics and testing outcomes with students (especially for purposes of formative assessment), without fear of jeopardizing the test security. Meanwhile, granular level rubrics should be safeguarded, if the test is going to be re-used again in the future. He concluded his presentation by quoting Julie Andrews, suggesting that amateur [test proctors] practice until they get it right; professionals practice until they can’t get it wrong.”
Immediately before lunch, panelists William Adams (ABA Deputy Managing Director), Gage Kingsbury (psychometrician consultant), and Mark Raymond (psychometrician at the NCBE) discussed ABA Standard 314 and explained “How Formative and Summative Assessments differ from a Standardized Admission Test (LSAT) and a Licensure (Bar Exam).” The panelists outlined the need for various types of assessment and explained the pros and cons of the different types of assessment. Much of what was mentioned in this hour is very familiar to academic support professionals but was undoubtedly helpful to the other folks in the audience. The most popular piece of advice during this hour came from Dr. Raymond. He suggested that to draft quality “distractor” answers to multiple-choice question, the professor should first ask the question as an open-ended response question. Then, on a later test administration, take the most popular wrong answers supplied by the students and use those responses in the newly-converted multiple choice question.
After the lunch break, attendees could choose from four breakout sessions: formative assessment, LSAT scoring, bar exam scoring, or fairness issues in standardized testing.
In the bar exam scoring session, Kellie Early began by describing the three major components of the exam. She then explained how each multiple-choice question undergoes a three-year vetting process, which includes soliciting feedback from recently barred attorneys. (As an aside, the LSAC also uses a three-year timeline for launching new questions on the LSAT.) Meanwhile, essay questions are not vetted in advance of a testing administration because they are too memorable and thus subject to security breaches. All MBE and written questions are selected for inclusion on the examination in the 8 to 18 months before the exam administration. The examination booklets are formally printed about two months prior to the administration and shipped to the local jurisdictions in the final month.
Next, Douglas Ripkey explained the difference between the terms raw score, scaled score, equating (MBE), and scaling (written component). Equating refers to the statistical process of determining comparable scores on different exam forms, while scaling is the process of placing written scores on the same scales as the MBE. To equate the MBE, the NCBE relies upon a subset of previously used questions (a.k.a. equators) to establish a baseline of proficiency over time for each testing group. To scale the written component, the NCBE relies upon the belief that there is a strong correlation between performance on the MBE and performance on the written components. First, one must determine the mean and standard deviation of MBE scores in the testing group. Next, one must determine the mean and standard deviation of the written scores. Lastly, using z-score statistics, the NCBE “rescales the essay scores so that they have a mean and standard deviation that is the same as the MBE mean and standard deviation.” He likened the process to converting Fahrenheit degrees to Celsius degrees, with each measurement system using shared anchor points such as the temperature at which water boils and water freezes. Mr. Ripkey concluded by stating that statistically speaking all applicants could “pass” the bar examination in any testing administration.
In the last hour of the conference, participants could choose from workshops exploring formative assessment techniques, the future of the LSAT, the future of the bar exam, or fairness issues in standardized testing.
In the fairness workshop, Dr. Mark Raymond (psychometrician at the NCBE) and Ben Theis (test developer at LSAC) explained how high-stakes test makers handle fairness concerns. Typically, test developers look for bias in three places (i) at the test question level, (ii) at the test score level, and (iii) at the decision level.
LSAC employs two separate dedicated “fairness” reviews, one by an external committee and one in-house. Once the questions are included on the pre-test section of the LSAT, LSAC tracks the question’s statistical analyses. The developers are using the “pre-test” section of the LSAC to vet these questions with a diverse testing group. Then they look for “residual” differences or “DIF” on a scatter plot. Plainly stated, individuals with the same objective criteria (male, aged Y, with X degree) should both perform equally well on the exam. If they don’t, then the developers look to see if race or gender could account for the statistical difference. LSAC employs a “presume unfairness” mentality for any questionable item. If the item exhibits any unfairness qualities, the item is discarded permanently without further evaluation,
LSAC revealed that minorities are over predicted on the LSAT. In other words—more frequently than is true for other populations—the minority student’s LSAT score suggests that the student will perform better in law school than they do. While more research is needed, the findings suggest that for some reason minority students are under performing in law school despite solid LSAT scores.
To summarize, this one-day conference exposed the numerous attendees (including law school deans, faculty, academic support and bar exam specialists, and admissions personnel) to a solid foundation in the hot topics associated with standardized testing in legal education.
(Kirsha Trychta, Guest Blogger)