Comparison of vertical scaling methods in the context of NCLB

  • Author / Creator
    Gotzmann, Andrea Julie
  • Vertical scaling is the process of establishing a numerical test score scale across several age or grade levels. Given that the current literature does not indicate which of the different vertical scaling procedure works “best” for all situations. This study evaluated the performance of four vertical scaling procedures (concurrent calibration, fixed common item parameters, test characteristic curve, and hybrid characteristic curve), across two content areas (Reading and Mathematics), two score distribution types (normal and negatively skewed), and two sample sizes (1,500 and 3,000). Five outcome measures were used to evaluate the results: decision accuracy, decision consistency, conditional standard errors at each of two cut-scores, root-mean-squared-differences of the scale scores between scaling procedures, and correlations between scaling procedures’ final item parameters. The data used in this study was from a U.S. large scale testing program in Reading and Mathematics for grades 3 through 8. These data were used to simulate the type of score distribution and sample sizes considered with 100 replicates for these combinations.

    The largest differences among the four vertical scaling procedures for Reading were found at the lower and upper grade levels, particularly for decision accuracy. Differences were found between the normal and skewed distributions, for decision accuracy where a different pattern of results were found. The accuracy results decreased markedly as grades increased for the skewed distribution. For Mathematics the largest differences across all outcome measures occurred across grade levels rather than across vertical scaling procedures. Sample size for both Reading and Mathematics did not seem to have an effect.

    Practitioners should ensure high decision accuracy and consistency values across all grade levels, and that a particular scaling procedure does not result in undesirable results. If a state program allows different procedures for different content areas, then the hybrid characteristic curve procedure would be most appropriate for Reading and the test characteristic procedure most appropriate for Mathematics. However, if the procedure must be the same, then the hybrid characteristic curve procedure could be used for both Reading and Mathematics. Measurement specialists can use these results to guide their implementation of vertical scaling for their state assessment programs.

  • Subjects / Keywords
  • Graduation date
    Fall 2011
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.