Does the interplay of grading policies and teacher-course evaluations undermine our education system?
Teacher Course Evaluations and Student Grades: An Academic Tango
PAMELA L. JOHNSON
Valen E. Johnson
D
uring the past decade or so, many colleges have placed an increasing emphasis on “teaching effectiveness” in faculty promotion, tenure, and salary reviews. In most cases, the mechanism used to measure teaching effectiveness is a locally developed evaluation form that is completed by an instructor’s students toward the end of the course, usually before students have received their final course grades. The practice of using student evaluations of teaching (SETs) to evaluate faculty teaching effectiveness raises a number of concerns, including the basic validity of these forms and their sensitivity to external biases. The question of validity involves the extent to which SETs (or items on these forms) accurately predict student learning. Questions of bias involve the possibility that student responses are influenced by factors unrelated to the faculty member’s instructional effectiveness. The topic of this article is the biasing effect that faculty grading practices have on SETs. A broader discussion of this and related issues may be found in my book The GPA Myth, from which most of the following analyses are drawn. Both the validity of SETs and potential biases to SETs have been discussed extensively in the educational literature.
A simple search of the ERIC database produces thousands of articles concerning various aspects of SETs, and Greenwald summarizes more than 170 studies that examined the specific issue of whether SETs represented valid measures of student learning. Clearly, a comprehensive review of this literature is not possible here, and so I will simply summarize the current state of knowledge concerning the relation between grading practices and student evalua-
tions of teaching by asserting that it is rather confused. Much of this confusion arises from the fact that it is difficult to conduct experiments to investigate this relationship because of constraints involving human subjects. Those experiments that have been conducted tend to confirm the notion that higher student grades cause higher teacher course evaluations, but over the past 2 or 3 decades several other explanations for the correlation between grades and
CHANCE
9
chance15-3
7/16/02
1:38 PM
Page 10
Survey Instrument
Listed below is a subset of the items that appeared on the DUET Web site. The instructor’s concern for the progress of individual students was 1) Very Poor 2) Poor 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable How effective was the instructor in encouraging students to ask questions and express their viewpoints? 1) Very poor 2) Poor 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable How would you rate this instructor’s enthusiasm in teaching this course? 1) Very bad 2) Bad 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable How easy was it to meet with the instructor outside of class? 1) Very difficult 2) Difficult 3) Not Hard 4) Easy 5) Very Easy 6) Not Applicable How does this instructor(s) compare to all instructors that you have had at Duke? 1) Very bad 2) Bad 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable How good was the instructor(s) at communicating course material? 1) Very bad 2) Bad 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable To what extent did this instructor demand critical or original thinking? 1) Never 2) Seldom 3) Sometimes 4) Often 5) Always 6) Not Applicable How valuable was feedback on examinations and graded materials? 1) Very poor 2) Poor 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable How good was the instructor at relating course material to current research in the field? 1) Very bad 2) Bad 3) Fair 4) Good 5) Very Good 6) Excellent 7) Not Applicable SETs have been proposed. Among these are the teacher-effectiveness theory— under which observed correlations between student grades and favorable SETs are attributed to quality teaching (i.e., higher teaching leads to both higher grades and SETs)—and theories based on the intervention of unobserved student or classroom characteristics. For example, one might expect that highly motivated students enrolled in small, upper-level classes learn more, leading to both higher grades and more appreciation of a teacher’s efforts. Understanding the relationship between grades and SETs is made still more difficult by the fact that many edu10
VOL. 15, NO. 3, 2002
evaluation experiment conducted at Duke University. In this experiment, changes to a SET were recorded as Duke University freshmen completed a teacher-course evaluation form both before and after they had received their final course grades. By accounting for the grades this cohort of students expected to receive in their initial survey response, it is possible to separate the effects of grades on student evaluations of teaching from other intervening variables.
Experimental Design
The data used in this investigation were collected during an experiment conducted at Duke University during the fall semester of 1998 and the spring semester of 1999. The experiment, called DUET (Duke Undegraduates Evaluate Teaching), utilized a commercially operated Web site to collect student opinions on the courses they either had taken or were taking. The DUET Web site was activated for two 3-week periods, the first beginning the week prior to fall registration in 1998 and the second a week prior to spring registration in 1999. Both periods fell approximately 10 weeks into their respective semesters. The survey was conducted immediately before and during registration as an incentive for students to participate; by completing the DUET survey prior to registration, students could view course evaluation data and grade data for courses they planned to take the following semester. An important aspect of the experimental design involved the manner in which data were collected for first-year students. Participating first-year students completed the survey for their fall courses twice: once before completing their fall courses and receiving their final grades, and once after. Because one DUET survey item asked students what grade they received or expected to receive in their courses, the responses collected from freshmen at the two time points provide an ideal mechanism for investigating the influence that expected and received student grades have on student evaluations of teaching. Upon entering the DUET Web site, students were initially confronted with text informing them that course evalu-
cational researchers have been hesitant to acknowledge the biasing effects of grades on SETs because much of their research relies heavily on the use of student evaluations of teaching. Clearly, biases to teacher-course evaluation forms caused by student grades undermine the validity of research based on these forms. Readers interested in a more thorough discussion of these issues and an entry into this literature may refer to the references provided at the end of this article. In this article, the causal effect of grades on student evaluations of teaching are investigated by examining new evidence collected in an online course
chance15-3
7/16/02
1:38 PM
Page 11
ation data collected on the site would be used as part of a study investigating the feasibility of collecting course evaluations on the Web. They were also told that their responses would not be accessible to either their instructors, faculty not asssociated with the experiment, other students, or university administrators. After indicating their consent to participate in the study, items on the survey were presented to students in groups of five to seven questions. With only one or two exceptions, all items included a “Not Applicable” response. Thirty-eight items were included on the survey, but, because of space constraints, attention here is restricted to nine questions that relate to studentinstructor interaction. These items and possible student responses appear in the sidebar on the previous page. During the two semesters that the DUET Web site was active, 11,521 complete course evaluations were recorded. Each evaluation consisted of 38 item responses for a single course. Of the 6,471 eligible full-time, degreeseeking students who matriculated at Duke University after 1995, 1,894 students (29 percent) participated in the experiment. Among first-year students, approximately one-half participated in both the fall and the spring surveys. A detailed investigation of the relationship between various demographic variables and response rates of the corresponding groups is provided in The GPA Myth. That analysis suggests that response patterns were not significantly affected by the variables most associated with differential response rates. In particular, although response rates did vary according to students’ gender, ethnicity, academic year, academic major, and GPA, student response patterns did not vary substantially according to these categories.
strongly disagree
neutral strongly agree disagree
agree
Latent variable
Figure 1. Latent variable distributions. The three curves in this figure represent hypothetical density functions (shifted vertically so as to facilitate display on the same horizontal axis) from which latent variables underlying the generation of ordinal data are assumed to be drawn.
Statistical Analysis
Because student responses to the DUET survey take the form of ordinal data—that is, responses fall into ordered categories—it is natural to estimate the effects of received and expected grades on SETs using ordinal regression models. To this end, I assume the existence of an unobserved, or latent, variable that underlies each student’s perception of
an item. As the value of this latent variable increases, so too does the probability that the person rates the item in one of the higher categories. This concept is illustrated in Figure 1. The curve at the top of the figure represents the distribution of the latent variable for an individual who is likely to strongly disagree or to disagree with a statement. Individuals whose latent variables are drawn from the distribution in the middle of the figure are more likely to respond in a neutral way, while individ-
uals whose latent distributions are depicted at the bottom of the figure are likely to agree or strongly agree with the question. Statistical issues that arise in analyzing ordinal data using latent variable models involve establishing a scale of measurement (i.e., establishing the location of 0 on the horizontal axis in Figure 1), estimating the locations of the category cutoffs (the vertical, dashed lines in Figure 1 that separate the response categories), and, most important, modeling
CHANCE
11
chance15-3
7/16/02
1:38 PM
Page 12
Table 1—Estimated Effects of Expected Grades on DUET Items.
Item Instructor concern ED 1.53 (1.03) Encouraged questions 2.98 (0.71) Instructor enthusiasm 2.99 (0.72) Instructor availability 2.24 (0.82) Instructor rating 2.11 (0.79) Instructor communication 2.78 (0.70) Critical thinking 1.26 (0.73) Usefulness of exams 1.70 (0.80) Related course to research 0.60 (0.98) EC0.07 (0.55) 1.07 (0.44) 0.67 (0.52) 2.67 (0.58) -0.43 (0.72) 0.52 (0.51) 2.10 (0.66) 0.33 (0.53) -0.29 (0.69) EC 0.93 (0.28) 1.61 (0.29) 2.12 (0.28) 2.10 (0.29) 1.99 (0.31) 1.97 (0.29) 2.69 (0.29) 0.88 (0.31) 1.10 (0.28) EC+ 2.62 (0.42) 2.37 (0.42) 2.70 (0.42) 2.85 (0.39) 2.34 (0.40) 1.54 (0.39) 1.51 (0.41) 2.40 (0.43) 2.14 (0.39) EB2.85 (0.20) 2.44 (0.23) 3.19 (0.23) 3.27 (0.22) 2.92 (0.22) 2.88 (0.24) 2.63 (0.22) 2.69 (0.25) 2.30 (0.22) EB 3.02 (0.16) 2.90 (0.18) 3.50 (0.19) 3.25 (0.16) 2.71 (0.16) 3.30 (0.21) 3.14 (0.17) 3.04 (0.21) 3.01 (0.19) EB+ 2.70 (0.14) 3.09 (0.17) 3.28 (0.18) 3.06 (0.15) 2.77 (0.14) 3.29 (0.21) 3.39 (0.16) 3.20 (0.20) 3.11 (0.16) EA3.13 (0.13) 3.26 (0.17) 3.44 (0.17) 3.35 (0.13) 2.96 (0.13) 3.52 (0.20) 3.13 (0.14) 3.49 (0.20) 3.42 (0.16) EA 3.57 (0.14) 3.53 (0.16) 3.96 (0.18) 3.60 (0.13) 3.36 (0.12) 4.13 (0.21) 3.04 (0.13) 3.53 (0.19) 3.63 (0.15) EA+ 4.14 (0.21) 3.54 (0.23) 4.61 (0.25) 3.88 (0.22) 4.10 (0.22) 4.44 (0.26) 3.03 (0.22) 4.35 (0.28) 4.15 (0.24)
the effects that explanatory variables have in shifting the distribution of the latent variables. The first problem— establishing a scale of measurement— can be solved by setting the shape of the curves used to describe the distributions of the latent variables and by pinning down the value of one of the category cutoffs. For this purpose, in the analyses that follow, the shape of these distributions is assumed to be a standard logistic density function, and the value of the lowest category cutoff is fixed at 0. To model the location of the logistic curve that determines the probability that a particular student responds in a certain way on a given DUET item (i.e., to model the location of the center of the curves depicted in Figure 1), assume that for each DUET item and student, there is a value, say ci, that represents the ith student’s perception of the given DUET item for the course in question, excluding influences of expected or received grade. That is, ci is defined to represent the central value of student i’s latent perception distribution for that DUET item when no consideration is 12
VOL. 15, NO. 3, 2002
given to expected or received grade. For brevity, call this value the course effect. For DUET data collected in the fall of 1998, the effects of expected grade on students’ perception of items are modeled by assuming that the influence of expecting a grade of, say, g, on the ith student’s perception of a given DUET item is to shift the latent perception distribution of that student by an amount equal to Eg. Student expectations of their grades were defined using their responses to one of the items on the DUET survey. Available responses for this item were A+, A, ..., F, but, because of the paucity of responses in categories lower than C-, values of D+, D, D-, and F were pooled together. The central value of the curve describing the ith student’s perception of a DUET item, including possible influences of expected grade, was thus assumed to be ci + Eg. Assumptions made for data collected in the spring of 1999 are similar. As in the fall, ci denotes the central value of the ith student’s latent perception distribution for a given DUET item, where, as before, influences of expected or
received grades are excluded. The relative values of ci are assumed to be constant across semesters for the same student/DUET item combinations. An important difference between data collected in the fall and spring is that students who completed the survey in the spring had, by that time, received their actual course grades for their fall courses. For this reason, modifications to students’ perceptions of their fall courses during the spring survey period are attributed to received grade rather than expected grade. Let Rg denote the influence that receiving a grade of g has on a student’s perception of a course attribute in the spring, where g is coded as before but now represents the grade actually received in the course. With this notation in hand, the center of each student’s latent distribution in the spring was assumed to be ci + Rg. Finally, distinct category cutoffs were defined separately for the fall and spring data. By so doing, systematic changes in student perceptions of classes over time were naturally integrated into the probability model for the student responses.
chance15-3
7/16/02
1:38 PM
Page 13
Table 2—Estimated Effects of Received Grades on DUET Items.
Item Instructor concern RD 0.53 (0.31) Encouraged questions 1.08 (0.32) Instructor enthusiasm 1.62 (0.33) Instructor availability 1.27 (0.30) Instructor rating 1.18 (0.34) Instructor communication 0.79 (0.28) Critical thinking 1.69 (0.35) Usefulness of exams 0.64 (0.28) Related course to research 2.37 (0.38) RC0.74 (0.31) 0.98 (0.31) 1.87 (0.35) 0.79 (0.37) 0.70 (0.34) 1.83 (0.31) 1.33 (0.33) 0.93 (0.33) 0.29 (0.53) RC 1.40 (0.21) 1.99 (0.21) 2.23 (0.24) 2.09 (0.21) 2.09 (0.21) 2.34 (0.22) 2.36 (0.23) 1.81 (0.22) 2.29 (0.26) RC+ 1.80 (0.23) 1.89 (0.24) 1.64 (0.24) 1.99 (0.24) 1.33 (0.24) 1.61 (0.23) 2.21 (0.24) 2.04 (0.25) 2.00 (0.27) RB2.18 (0.24) 2.02 (0.22) 2.63 (0.23) 2.45 (0.22) 2.24 (0.22) 2.32 (0.20) 2.53 (0.22) 1.63 (0.22) 2.78 (0.24) RB 2.11 (0.13) 2.52 (0.12) 2.61 (0.16) 2.57 (0.13) 2.09 (0.12) 2.67 (0.12) 3.04 (0.14) 1.96 (0.13) 2.87 (0.14) RB+ 2.72 (0.14) 2.86 (0.11) 2.69 (0.16) 3.11 (0.11) 2.21 (0.12) 2.67 (0.11) 3.03 (0.14) 2.78 (0.14) 2.85 (0.14) RA2.98 (0.14) 3.25 (0.10) 2.91 (0.17) 3.13 (0.10) 2.82 (0.11) 3.27 (0.11) 3.40 (0.13) 3.14 (0.14) 3.28 (0.11) RA 3.38 (0.13) 3.45 (0.10) 3.13 (0.16) 3.53 (0.11) 3.06 (0.10) 3.31 (0.10) 3.05 (0.12) 3.54 (0.14) 3.07 (0.10) RA+ 3.16 (0.20) 3.00 (0.17) 3.15 (0.23) 3.48 (0.19) 2.86 (0.18) 3.38 (0.18) 3.34 (0.21) 3.31 (0.21) 3.46 (0.22)
To summarize this model for the paired student responses, we have assumed that student responses to each DUET survey item depend first on an underlying perceptual variable that determines each student’s general impression of the course attribute in question and second on the student’s expected or received grade in that course. A possible drawback of this formulation is that, at first glance, there appear to be almost as many parameters in the model as there are paired student responses to the survey. Indeed, if only the fall or the spring data were examined separately, there would be more parameters in the model than observations in the data, since each student response to each DUET item for each course invokes its own value of ci . However, with two semesters of data, there were two observations of each student’s response to each DUET item for each course taken, and only one value of ci to estimate for each such pair. For purposes of estimating the remaining model parameters, this results in a surplus of observations in number equal to the
response pairs. Because more than 2,500 paired responses were obtained for each item, this leaves ample data available for estimating the fixed effects of grades. An advantage of estimating ci for each paired student response is that it permits the effects of grades on student evaluations of teaching to be separated from course attributes and other student background variables. Presumably, prior student interest, student motivation, academic prowess, and the intellectual level at which an instructor pitched his or her lectures are all absorbed in this variable. The consequence of fitting a random course effect for each paired response is that estimates of grade effects are based only on changes to survey responses that accompany expected and received grades. Posterior means of the expected grade and received grade parameters estimated from this model, obtained using estimation procedures described in Ordinal Data Modeling, are provided in Tables 1 and 2. Posterior standard deviations are indicated in parentheses
below each estimate. The coefficients in Tables 1 and 2 can most easily be interpreted in terms of odds and odds ratios. Because the latent student perception variables were assumed to have standard logistic distributions, the ratio of the probability that a student rated an item above a given category, say m, to the probability that he or she rated it less than or equal to category m in the fall 1998 survey can be expressed
Pr(response > m) = exp(− γ m + ci + Eg ). Pr(response ≤ m)
In this expression, γm denotes the category cutoff for category m (illustrated in stylized form by a vertical dashed line in Figure 1), and Eg denotes the coefficient listed for an expected grade of g in Table 1. This ratio of probabilities is called the odds. The corresponding odds for the probability that a student responded in category m or higher during the spring 1999 survey is given by
Pr(response > m) = exp(− γ m + ci + Rg ), Pr(response ≤ m)
CHANCE
13
chance15-3
7/16/02
1:38 PM
Page 14
Encouraged questions
3.5
4
Instructor enthusiasm C-
Instructor concern
3
2.5
3.0
2
2.0
1
1.5
1.0
0
1
2
3
4