By Jacqueline Dewar, Loyola Marymount University
What happens to the data from your teaching evaluations? Who sees the data? Are your numbers compared with other data? What interpretations or conclusions result? How well informed is everyone, including you, about the limitations of this data, and conditions that should be satisfied before it is used in evaluating teaching?
Despite many shortcomings of student ratings of teaching (SRT), some of which I mention below, their use is likely to continue indefinitely because the data is easy to collect, and gathering it requires little time on the part of students or faculty. I refer to them as student ratings, not evaluations, because “evaluation” indicates that a judgment of value or worth has been made (by the students), while “ratings” denote data that need interpretation (by the faculty member, colleagues, or administrators) (Benton & Cashin, 2011).
Readers may be asked to interpret the data from their SRT on their annual reviews or in their applications for tenure or promotion. They may even find themselves on committees charged with reviewing the overall teaching evaluation process or the particular form that students use at their institutions, as I did. For these reasons, I thought it might be helpful to discuss some general issues concerning SRT and then present a few practical guidelines for using and interpreting SRT data.
My career as a mathematics professor spanned four decades (1973-2013) at Loyola Marymount University, a comprehensive private institution in Los Angeles. During that time, my teaching was assessed each semester by student “evaluations.” For nearly all of those 40 years this was the only method used on a regular basis. If there were student complaints, a classroom observation by a senior faculty member might take place, which happened to me once as an untenured faculty member. Later on, as a senior faculty member, I myself was called upon to perform a few classroom observations.
During 2006–2011, I also directed a number of faculty development programs on campus, including the Center for Teaching Excellence. In that role, I served as a resource person to a Faculty Senate committee appointed in 2010 to develop a comprehensive system for evaluating teaching. Prior to that, I had participated in a successful faculty-led effort to revise the form students used to rate our teaching, and I worked to develop and disseminate guidelines about how that data should be interpreted. During that two-year process (2007-2009), I discovered that my colleagues and I, and even faculty developers on other campuses, had a lot to learn about the limitations of this data (Dewar, 2011).
Because teaching is such a complex and multi-faceted task, its evaluation requires the use of multiple measures. Classroom observations, peer review of teaching materials (syllabus, exams, assignments, etc.), course portfolios, student interviews (group or individual), and alumni surveys are other measures that could be employed (Arreola, 2007; Chism, 2007; Seldin, 2004). In practice, SRT are the most commonly used measure (Seldin, 1999) and, frequently, the primary measure (Ellis, Deshler, & Speer, 2016; Loeher, 2006). Even worse, “many institutions reduce their assessment of the complex task of teaching to data from one or two questions” (Fink, 2008, p. 4).
The use of SRT has garnered many critics (e.g., Stark & Freishtat, 2014) and supporters (e.g., Benton & Cashin, 2011; Benton & Ryalls, 2016) of their reliability and validity. Back-and-forth discussions about SRT occur frequently on the listserve maintained by the professional society for faculty developers known as the POD (for Professional and Organizational Development ) Network (see http://podnetwork.org). Earlier this month, in just one 24-hour period, there were 18 postings by 12 individuals on the topic (see https://groups.google.com/a/podnetwork.org/forum/#!topic/discussion/pBpkkck_xEk)
The advent of online courses has provided new opportunities to investigate gender bias in SRT, leading to new calls for banishing them from use in personnel decisions (MacNell, Driscoll, & Hunt, 2015; Boring, Ottoboni, & Stark, 2016). Still, as noted above, experts continue to argue their merits.
Setting aside questions of bias, readers should be aware of many factors that can affect the reliability and validity of SRTs. These include the content and wording of the items that are on the form and how the data are reported.
Some issues related to the items on the form are:
- They must address qualities that students are capable of rating (e.g., students would not be qualified to judge an instructor’s knowledge of the subject matter).
- The students’ interpretation of the wording should be the same as the intended meaning (e.g., students and instructors may have very different understandings of words like “fair” and “challenging”).
- The wording of items should not privilege or be more applicable to certain types of instruction than others (e.g., references to the instructor’s “presentations” or the “classroom” may inadvertently favor traditional lecture over pedagogies such as IBL, cooperative learning in small groups, flipped classrooms, or community-based learning).
- The items should follow the principles of good survey design (e.g., no item should be “double-barreled,” that is, ask for a rating of two distinct factors, such as The instructor provided timely and useful feedback. See Berk (2006) for more practical and entertaining advice.)
- Inclusion of global items, such as Rate this course as a learning experience, maybe be efficient for personnel committees, but data obtained from such items provide no insight into specific aspects of teaching and can be misleading (Stark & Freishtat, 2014).
Regarding how the data are reported:
#1. Sufficient Response Ratio
There must be an appropriately high response ratio. If the response rate is low, the data cannot be considered representative of the class as a whole. For classes with 5 to 20 students enrolled, 80% is recommended; for classes with between 21 and 50 students, 75% is recommended. For still larger classes, 50% is acceptable. Data should not be considered in personnel decisions if the response rate falls below these levels (Stark & Freishtat, 2014; Theall & Franklin, 1991, p. 89). (NOTE: Items left blank or marked Not Applicable should not be included in the count of the number of responses. Therefore, the response ratio for an individual instructor may vary from item to item.)
#2. Appropriate Comparisons
Because students tend to give higher ratings to courses in their majors or to electives than they do to courses required for graduation, the most appropriate comparisons are made between courses of a similar nature (Pallet, 2006). For example, the average across all courses in a College of Arts and Sciences or even across all mathematics department courses would not be a valid comparison for a quantitative literacy course.
#3. When Good Teaching is the Average
When interpreting an instructor’s rating on a particular item, it is more appropriate to look at the descriptor corresponding to the rating, or the rating’s location along the scale, instead of comparing it to an average of ratings (Pallet, 2006). In other words, a good rating is still good, even when the numerical value falls below the average (for example, getting a 4.0 on a scale of 5, when the average is 4.2). Stark and Freishtat (2014) go even farther, recommending reporting the distribution of scores, the number of responders, and the response rate, but not averages.
#4. Written Comments
Narrative comments are often given great consideration by administrators, but this practice is problematic. Only about 10% of students write comments (unless there is an extreme situation), and the first guideline recommends a minimum 50% response threshold. Thus decisions should not rest on a 10% sample just because the comments were written rather than given in numerical form! Student comments can be valuable for the insights they provide into classroom practice and they can guide further investigation or be used along with other data, but they should not be used by themselves to make decisions (Theall & Franklin, 1991, pp. 87-88).
#5. Other Considerations
- Class-size can affect ratings. Students tend to rank instructors teaching small classes (less than 10 or 15) most highly, followed by those with 16 to 35 and then those with over 100 students. Thus, the least favorably rated are classes with 35 to 100 students (Theall & Franklin, 1991, p. 91).
- There are disciplinary differences in ratings. Humanities courses tend to be rated more highly than those in the physical sciences (Theall & Franklin, 1991, p. 91).
Many basic, and difficult, issues related to the use of SRT for evaluating teaching effectiveness have not been addressed here, such as how to define “teaching effectiveness.” I hope even this limited discussion has helped make readers more aware of issues surrounding the use of SRT, and that they will sample the resources and links provided.
Arreola, R. (2007). Developing a comprehensive faculty evaluation system: A handbook for college faculty and administrators on designing and operating a comprehensive faculty evaluation system (3rd ed.). San Francisco: Anker Publishing.
Berk, R. A. (2006). Thirteen strategies to measure college teaching. Sterling, VA: Stylus.
Benton, S. L., & Cashin, W. E. (2011). IDEA Paper No. 50: Student ratings of teaching: A summary of research and literature. Manhattan, KS: The IDEA Center. Retrieved from: http://ideaedu.org/wp-content/uploads/2014/11/idea-paper_50.pdf
Benton, S. L., & Ryalls, K. R. (2016). IDEA Paper #58: Challenging misconceptions about student ratings of instruction. Manhattan, KS: The IDEA Center. Retrieved from http://www.ideaedu.org/Portals/0/Uploads/Documents/IDEA%20Papers/IDEA%20Papers/PaperIDEA_58.pdf
Boring, A., Ottoboni, K., & Stark, P.B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. Science Open Research. DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
Chism, N. (2007). Peer review of teaching: A sourcebook (2nd ed). Bolton, MA: Anker.
Dewar, J. (2011). Helping stakeholders understand the limitations of SRT data: Are we doing enough? Journal of Faculty Development, 25(3), 40-44.
Ellis, J., Deshler, J., & Speer, N. (2016, August). How do mathematics departments evaluate their graduate teaching assistant professional development programs? Paper presented at the 40th Conference of the International Group for the Psychology of Mathematics Education, Szeged, Hungary.
Fink, L. D. (2008). Evaluating teaching: A new approach to an old problem. In D. Robertson & L. Nilson (Eds.), To improve the academy: Vol. 26 (pp. 3-21). San Francisco, CA: Jossey-Bass.
Loeher, L. (2006, October). An examination of research university faculty evaluations policies and practices. Paper presented at the 31st annual meeting of the Professional and Organizational Development Network in Higher Education, Portland, OR.
MacNell, L., Driscoll, A. & Hunt, A.N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291-303. DOI:10.1007/s10755-014-9313-4
McKeachie, W. J. (2007). Good teaching makes a difference—and we know what it is. In R. P. Perry and J.C. Smart, (Eds.), The scholarship of teaching and learning in higher education: An evidence-based approach (pp. 457-474). New York, NY: Springer.
Pallett, W. (2006). Uses and abuses of student ratings. In P. Seldin (Ed.), Evaluating faculty performance: A practical guide to assessing teaching, research, and service. Bolton, MA: Anker Publishing.
Seldin, P. (Ed.). (1999). Changing practices in evaluating teaching. Bolton, MA: Anker Publishing.
Seldin, P. (2007). The teaching portfolio: A practical guide to improved performance and promotion/tenure decisions (3rd ed). Bolton, MA: Anker Publishing.
Stark, P. B., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research. DOI: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1
Theall, M., & Franklin, J. (Eds.). (1991). New directions for teaching and learning: No. 48. Effective practices for improving teaching. San Francisco, CA: Jossey-Bass.