*By Spencer Bagley, University of Northern Colorado; Jim Gleason, University of Alabama; Lisa Rice, Arkansas State University; Matt Thomas, Ithaca College, Diana White, Contributing Editor, University of Colorado Denver*

(Note: Authors are listed alphabetically; all authors contributed equally to the preparation of this blog entry.)

Concept inventories have emerged over the past two decades as one way to measure conceptual understanding in STEM disciplines, with the Calculus Concept Inventory (CCI), developed by Epstein and colleagues (Epstein, 2007, 2013), being one of the primary instruments developed in the area of differential calculus. The CCI is a criterion-referenced instrument, measuring classroom normalized gains, which specifically is the change in the class average divided by the possible change in the class average. Its goal was to evaluate the impact of teaching techniques on conceptual learning of differential calculus.

While the CCI represents a good start toward measuring calculus understanding, recent studies point out some significant issues with the instrument. This is concerning, given that there seems to be an increased use of the instrument in formal and informal studies and assessment. For example, in a recent special issue of PRIMUS (Maxson & Szaniszlo, 2015a, 2015b) related to flipped classrooms in mathematics, three of the five papers dealing with calculus cited and used the CCI. In this blog we provide an overview of concept inventories, discuss the CCI, outline some problems we found, and suggest future needs for high-quality conceptual measures of calculus understanding.

Before proceeding, however, we would like to acknowledge and thank the designers of the CCI for starting the process of developing a measure for students’ understanding of calculus. We regret that with the passing of Jerome Epstein, he is unable to respond directly to our findings or contribute to future work. His efforts, and those of his collaborators, have undoubtedly had tremendous impact on the awareness of the mathematics community of concept inventories and the associated need to teach and learn conceptually, and we believe they contributed positively to the teaching and learning of calculus. We hope that the mathematical community will continue the work that he started.

The first concept inventory to make a significant impact in the undergraduate education community was the Force Concept Inventory (FCI), written by Hestenes, Wells, and Swackhamer (1992). Despite the fact that most physics professors deemed the questions on the inventory “too trivial to be informative” (Hestenes et al., 1992, p. 2), students did poorly on the test and, in both high-school and university physics classes, only made modest gains. Of the 1,500 high-school students and over 500 university students who took the test, high school students were learning 20%-23% of the previously unknown concepts, and college students at most 32% (Hestenes et al., 1992, p. 6). Through a well-documented process of development and refinement, the test has become an accepted and widely used tool in the physics community, and has led to changes in the way introductory physics has been taught (e.g., Hake, 1998; Mazur, 1997). The FCI paved the way for the broad application of analyzing student conceptual understanding of the basic ideas in various STEM disciplines (Hake, 1998, 2007; Hestenes et al., 1992), including physics, chemistry, astronomy, biology, and geoscience.

Recently, the authors of this blog post conducted a thorough analysis of the CCI (Gleason et.al., 2015a, 2015b), with the primary objective to assess the degree to which the CCI conforms to certain standards for psychometric properties, including content validity, internal structure validity, and internal reliability. One can think of validity as determining whether an instrument measures what it is intended to measure, and of reliability as determining how well the instrument measures whatever it is measuring. Thus for educational instruments, validity addresses whether a person’s score on an instrument is meaningful with regards to measuring the desired constructs and helps researchers make inferences. The goal of establishing content validity is to determine the degree to which the instrument measures what it intends, and internal structure validity investigates “beneath” the item responses of participants, including how subscales may relate to each other.

Subscales are usually created when there is a desire to understand different components of a knowledge state of the individual or group, and when there is an expectation of different levels of knowledge in the different categories. For example, a high school geometry test might consist of subscales measuring student understanding of triangles, circles, and arcs. Another example would be the different components of an ACT or SAT test where scores are given for English, Math and Reading, as well as a composite score. The items within the subscales should be highly correlated and items between the different subscales may also be correlated, though likely to a much lesser extent. The goal of a validity study regarding the internal structure of an instrument is to determine if the items are measuring distinct constructs or just a single underlying construct in order to justify the usage of subscores.

Since one of the goals of concept inventories is to measure conceptual understanding before exposure to the content, students are required to use their prior knowledge to respond to assessment items at initial enrollment in the course. After completing the course, the concept inventory can then be used to measure gains in conceptual understanding. To ensure validity in this process of measuring gains, items must be carefully written to avoid using terminology taught in the course to which students have no prior exposure.

However, several researchers noticed that released CCI items contained terminology and notation introduced only in a calculus course, such as the word “derivative” and the notation \(f’(x)\). This is problematic because the CCI is meant to only assess students understanding of concepts in calculus, rather than the specific vocabulary of calculus. While a majority (67%) of calculus students at Ph.D. granting institutions have had previous exposure to calculus, 41% of all post-secondary calculus students did not take calculus in high school and high school calculus students have had no previous exposure (Bressoud, Mesa, & Rasmussen, 2015). Items that contain unfamiliar terminology and notation would confuse students and generate responses around random chance for those items. However, Epstein claims they intended the instrument to measures above random chance at the pre-test and to avoid “confusing wording” (Epstein, 2013, p. 7). Though he may have meant “confusing” to mean “convoluted”, a student with no background in calculus would be in a poor position to answer a question in which the notation \(f’(x)\) appears. Because of the use of calculus specific terminology, the validity of pre-test scores is questionable for populations with large numbers of students lacking previous exposure to calculus, such as those in high schools, at community colleges and at regional universities.

With regard to internal structure validity, issues emerged with the CCI when conducting a factor analysis. (For technical details, see p. 1291-1297 of these proceedings.) A factor analysis explores relationships among the underlying factors of the assessment instrument and the cause of those relationships through the analysis of student responses in order to determine the number of underlying factors of the instrument, and their relationships. Epstein and colleagues suggested that the CCI measures conceptual understanding of calculus through three factors (functions, derivatives, and limits/ratios/the continuum). In other words, they claim that a three-factor model captures all of the correlations among the items on the instrument. However, we showed that the item responses are so closely correlated that the total CCI score is explained by one factor, which appears to be an overall knowledge of calculus content, that “can adequately account for the pattern of correlations among the items” and that there are no sub-scales (DeVillis, 2003, p. 109). This finding means the CCI cannot generate valuable information about conceptual understanding of different components of calculus, such as limits or rates of change, but instead is measuring an overall calculus knowledge.

One method of measuring the reliability of an instrument is to measure the extent to which the individual items of an instrument fall within the same general construct. In this regard, Epstein (2013) reported, and the authors of this blog confirmed, that the CCI has an internal reliability Cronbach alpha of around 0.7, meaning that the instrument has a 51% error variance and a standard error of 10% on each individual student score (Cohen & Swerdlik, 2010, Tavakol & Dennick 2011). In particular, this does not meet the established standard of having an alpha of 0.80 or higher necessary for use in research for any type of educational assessment (Lance, Butts, & Michels, 2006, p. 206).

**Conclusion**

With the centrality of calculus to undergraduate mathematics programs and a variety of mathematically intensive partner disciplines, such as economics, physics, and engineering, there is a need to look at the course’s learning outcomes. Recent efforts through the MAA’s National Studies of College Calculus have helped the mathematical community better understand the current state of calculus programs around the country. Data and research on student outcomes in calculus, especially with regards to conceptual knowledge, lag somewhat behind. Part of this is attributable to a lack of appropriate, well-validated instruments to measure outcomes. As most faculty are not trained in rigorous assessment development, they often depend on others for instruments to measure student learning in courses and programs.

Because of the aforementioned concerns, though, we conclude that the existing CCI does not conform to accepted standards for educational testing (American Educational Research Association, 2014; DeVellis, 2012). As such, users of the CCI should be very aware of its limitations. In particular, it may underestimate the conceptual understanding at the beginning of a calculus course for students who have never taken a calculus class before but understand the ideas underlying calculus. We recommend careful consideration in using the CCI and urge users to keep in mind the kind of information being sought. In addition, we suggest exercising extreme caution in using it for any type of formal assessment processes.

Given the shortcomings of the CCI, as well as the inherent limitations of a static instrument with set questions, we argue that there is a need to create an item bank, consisting of rigorously-developed and validated questions on which we have solid psychometric properties, that measure students’ conceptual understanding of differential calculus. Such an item bank would significantly impact teaching and learning during the first two years for undergraduate STEM. Such an item bank could be used by instructors for formative and summative assessment during their calculus courses to improve student learning. The resources could also be used by researchers and evaluators to measure growth of student conceptual understanding during a first semester calculus course to compare gains of students in classrooms implementing differing instructional techniques.

If permission were granted for the CCI to be used as a launching point, then perhaps some of those questions could be used or modified. Prior work on developing conceptually-focused instruments in mathematics, such as the Precalculus Concept Assessment (Carlson, Oehrtman, & Engelke, 2010) and the Calculus Concept Readiness Instrument (Carlson, Madison, & West, 2010), could serve as models for the item-development process.

**References**

American Educational Research Association., American Psychological Association., National Council on Measurement in Education., & Joint Committee on Standards for Educational and Psychological Testing (U.S.). (2014). *Standards for educational and psychological testing*. Washington, DC: Author.

Bressoud, D., Mesa, V., Rasmussen, C. (2015). *Insights and recommendations from the MAA national study of college calculus*. MAA Press.

Carlson, M., Madison, B., & West, R. (2010). The Calculus Concept Readiness (CCR) Instrument: Assessing student readiness for calculus. arXiv preprint. *arXiv*, *1010.2719*.

Carlson, M., Oehrtman, M., & Engelke, N. (2010). The precalculus concept assessment: A tool for assessing students’ reasoning abilities and understandings. *Cognition and Instruction, 28*(2), 113-145.

Cohen, R. & Swerdlik, M. (2010). *Psychological testing and assessment*. Burr Ridge, IL: McGraw-Hill.

DeVellis, R.F. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA: SAGE Publications

DeVellis, R.F. (2012). Scale Development: Theory and applications (3rd ed.). Thousand Oaks, CA: SAGE Publications.

Epstein, J. (2007). Development and validation of the Calculus Concept Inventory. In *Proceedings of the Ninth International Conference on Mathematics Education in a Global Community* (pp. 165–170).

Epstein, J. (2013). The calculus concept inventory – Measurement of the effect of teaching methodology in mathematics. *Notices of the American Mathematical Society, 60*(8), 2-10.

Gleason, J., Thomas, M., Bagley, S., Rice, L., White, D., and Clements, N. (2015a) Analyzing the Calculus Concept Inventory: Content Validity, Internal Structure Validity, and Reliability Analysis, *Proceedings of the 37**th** International Conference of the North American Chapter of the Psychology of Mathematics Education, *1291-1297.

Gleason, J., White, D., Thomas, M., Bagley, S., and Rice, L. (2015b) The Calculus Concept Inventory: A Psychometric Analysis and Framework for a New Instrument, *Proceedings of the 18**th** Annual Conference on Research in Undergraduate Mathematics Education, *135-149.

Hake, R. R. (1998). Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. *American Journal of Physics, 66*, 64-74.

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. *The Physics Teacher*, *30*(3), 141–158. doi:10.1119/1.2343497

Lance, C. E., Butts, M. M., & Michels, L. C. (2006). The sources of four commonly reported cutoff criteria: What did they really say?. *Organizational Research Methods*, 9(2), 202-220.

Maxson, K. & Szaniszlo, Z. (Ed.). (2015a). Special Issue on the Flipped Classroom: Reflections on Implementation [Special Issue]. *PRIMUS,* 25(8).

Maxson, K. & Szaniszlo, Z. (Ed.). (2015b). Special Issue: Special Issue on the Flipped Classroom: Effectiveness as an Instructional Model [Special Issue]. *PRIMUS,* 25(9-10).

Mazur, E. (1997). *Peer instruction: A user’s manual*. Upper Saddle River, NJ: Prentice Hall.

Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach’s alpha. *International Journal of Medical Education*, *2*, 53–55. http://doi.org/10.5116/ijme.4dfb.8dfd

Do you happen to know if the CCI is currently still being developed further and if so by whom?

I don’t know of anyone who is currently developing it. I think it would be a great thing for someone to be working on though!