首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
The value‐added method of Haberman is arguably one of the most popular methods to evaluate the quality of subscores. The method is based on the classical test theory and deems a subscore to be of added value if the subscore predicts the corresponding true subscore better than does the total score. Sinharay provided an interpretation of the added value of subscores in terms of scores and subscores on parallel forms. This article extends the results of Sinharay and considers the prediction of a subscore on a parallel form from both the subscore and the total raw score on the original form. The resulting predictor essentially becomes the augmented subscore suggested by Haberman. The proportional reduction in mean squared error of the resulting predictor is interpreted as a squared multiple correlation coefficient. The practical usefulness of the derived results is demonstrated using an operational data set.  相似文献   

2.
Brennan ( 2012 ) noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman ( 2008 ) suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. According to this method, a subscore has added value if the corresponding true subscore is predicted better by the subscore than by the total score. In this note, parallel‐forms scores are considered. It is proved that another way to interpret the method of Haberman is that a subscore has added value if it is in better agreement than the total score with the corresponding subscore on a parallel form. The suggested interpretation promises to make the method of Haberman more accessible because several practitioners find the concept of parallel forms more acceptable or easier to understand than that of a true score. Results are shown for data from two operational tests.  相似文献   

3.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman suggested a method based on classical test theory to determine whether subscores have added value over total scores. In this article I first provide a rich collection of results regarding when subscores were found to have added value for several operational data sets. Following that I provide results from a detailed simulation study that examines what properties subscores should possess in order to have added value. The results indicate that subscores have to satisfy strict standards of reliability and correlation to have added value. A weighted average of the subscore and the total score was found to have added value more often.  相似文献   

4.
Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. One way to interpret the method is that a subscore has added value only if it has a better agreement than the total score with the corresponding subscore on a parallel form. The focus of this article is on classification of the examinees into “pass” and “fail” (or master and nonmaster) categories based on subscores. A new CTT‐based method is suggested to assess whether classification based on a subscore is in better agreement, than classification based on the total score, with classification based on the corresponding subscore on a parallel form. The method can be considered as an assessment of the added value of subscores with respect to classification. The suggested method is applied to data from several operational tests. The added value of subscores with respect to classification is found to be very similar, except at extreme cutscores, to their added value from a value‐added analysis of Haberman.  相似文献   

5.
6.
The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias.  相似文献   

7.
Recent research has proposed a criterion to evaluate the reportability of subscores. This criterion is a value‐added ratio (VAR), where values greater than 1 suggest that the true subscore is better approximated by the observed subscore than by the total score. This research extends the existing literature by quantifying statistical significance and effect size for using VAR to provide practical guidelines for subscore interpretation and reporting. Findings indicate that subscores with VAR ≥ 1.1 are a minimum requirement for a meaningful contribution to a user's score interpretation; subscores with .9 < VAR < 1.1 are redundant with the total score and subscores with VAR ≤ .9 would be misleading to report. Additionally, we discuss what to do when subscores do not add value, yet must be reported, as well as when VAR ≥ 1.1 may be undesirable.  相似文献   

8.
In common-item equating the anchor block is generally built to represent a miniature form of the total test in terms of content and statistical specifications. The statistical properties frequently reflect equal mean and spread of item difficulty. Sinharay and Holland (2007) suggested that the requirement for equal spread of difficulty may be too restrictive. They suggested that an anchor test with representative content coverage and equal mean item difficulty but a smaller spread of item difficulty (miditest) may provide the same or better results for equating while decreasing the pressure to find very hard and very easy items to include in the anchor. Analyses to date have concentrated on the results of equating the scores from one form to another with findings that are supportive of the Sinharay and Holland concept (Sinharay &; Holland, 2006a, 2006b, 2007; Liu, Sinharay, Holland, Feigenbaum, &; Curley, 2009). These studies do not address longer chains of equating. It is important to monitor the possibility of scale drift over forms. The current research begins to address this issue.  相似文献   

9.
Will subscores provide additional information than what is provided by the total score? Is there a method that can estimate more trustworthy subscores than observed subscores? To answer the first question, this study evaluated whether the true subscore was more accurately predicted by the observed subscore or total score. To answer the second question, three subscore estimation methods (i.e., subscore estimated from the observed subscore, total score, or a combination of both the subscore and total score) were compared. Analyses were conducted using data from six licensure tests. Results indicated that reporting subscores at the examinee level may not be necessary as they did not provide much additional information over what is provided by the total score. However, at the institutional level (for institution size ≥ 30), reporting subscores may not be harmful, although they may be redundant because the subscores were predicted equally well by the observed subscores or total scores. Finally, results indicated that estimating the subscore using a combination of observed subscore and total score resulted in the highest reliability.  相似文献   

10.
This study explores an anchor that is different from the traditional miniature anchor in test score equating. In contrast to a traditional “mini” anchor that has the same spread of item difficulties as the tests to be equated, the studied anchor, referred to as a “midi” anchor (Sinharay & Holland), has a smaller spread of item difficulties than the tests to be equated. Both anchors were administered in an operational SAT administration and the impact of anchor type on equating was evaluated with respect to systematic error or equating bias. Contradicting the popular belief that the mini anchor is best, the results showed that the mini anchor does not always produce more accurate equating functions than the midi anchor; the midi anchor was found to perform as well as or even better than the mini anchor. Because testing programs usually have more middle difficulty items and few very hard or very easy items, midi external anchors are operationally easier to build. Therefore, the results of our study provide evidence in favor of the midi anchor, the use of which will lead to cost saving with no reduction in equating quality.  相似文献   

11.
The choice of anchor tests is crucial in applications of the nonequivalent groups with anchor test design of equating. Sinharay and Holland (2006, 2007) suggested “miditests,” which are anchor tests that are content‐representative and have the same mean item difficulty as the total test but have a smaller spread of item difficulties. Sinharay and Holland (2006, 2007), Cho, Wall, Lee, and Harris (2010), Fitzpatrick and Skorupski (2016), Liu, Sinharay, Holland, Curley, and Feigenbaum (2011a), Liu, Sinharay, Holland, Feigenbaum, and Curley (2011b), and Yi (2009) found the miditests to lead to better equating than minitests, which are representative of the total test with respect to content and difficulty. However, these findings recently came into question as Trierweiler, Lewis, and Smith (2016) concluded, based on a comparison of correlation coefficients of miditests and minitests with the total test, that making an anchor test a miditest does not generally increase the anchor to total score correlation and recommended the continuation of the practice of using minitests over miditests. Their recommendation raises the question, “Should miditests continue to be considered in practice?” This note defends the miditests by citing literature that favors miditests and then by showing that miditests perform as well as the minitests in most realistic situations considered in Trierweiler et al. (2016), which implies that miditests should continue to be seriously considered by equating practitioners.  相似文献   

12.
The purpose of this ITEMS module is to provide an introduction to subscores. First, examples of subscores from an operational test are provided. Then, a review of methods that can be used to examine if subscores have adequate psychometric quality is provided. It is demonstrated, using results from operational and simulated data, that subscores have to be based on a sufficient number of items and have to be sufficiently distinct from each other to have adequate psychometric quality. It is also demonstrated that several operationally reported subscores do not have adequate psychometric quality. Recommendations are made for those interested in reporting subscores for educational tests.  相似文献   

13.
Criterion‐related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion‐related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.  相似文献   

14.
Subscores can be of diagnostic value for tests that cover multiple underlying traits. Some items require knowledge or ability that spans more than a single trait. It is thus natural for such items to be included on more than a single subscore. Subscores only have value if they are reliable enough to justify conclusions drawn from them and if they contain information about the examinee that is distinct from what is in the total test score. In this study we show, for a broad range of conditions of item overlap on subscores, that the value of the subscore is always improved through the removal of such items.  相似文献   

15.
This study investigates the relationships among factor correlations, inter-item correlations, and the reliability estimates of subscores, providing a guideline with respect to psychometric properties of useful subscores. In addition, it compares subscore estimation methods with respect to reliability and distinctness. The subscore estimation methods explored in the current study include augmentation based on classical test theory and multidimensional item response theory (MIRT). The study shows that there is no estimation method that is optimal according to both criteria. Augmented subscores show the most improvement in reliability compared to observed subscores but are the least distinct.  相似文献   

16.
Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability.  相似文献   

17.
The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.  相似文献   

18.
Subscores Based on Classical Test Theory: To Report or Not to Report   总被引:1,自引:0,他引:1  
There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here.  相似文献   

19.
This study focused on the development of a new classroom environment instrument for late-elementary students. The development of the survey of contemporary learning environments (SoCLE) followed a content analysis of three similar instruments on constructivist learning environments and the literature on characteristics of contemporary learning environments. Through a bottom-up development process, grounded in the epistemological foundations of constructivism, we note the difficulty in operationalizing student perceptions of contemporary learning environments and point to challenges in developing well-grounded construct validity. In a field study of 453 fifth–sixth grade students, the SoCLE was defined with three subscales: Reflective Processes (α?=?0.68), Ownership (α?=?0.65), and Multiple Perspectives (α?=?0.53). The Ownership and Multiple Perspectives subscales indicated relationships with overall academic achievement assessed by a nation-wide testing program, the Iowa Tests of Basic Skills. In a multiple linear regression model, ownership predicted the overall composite achievement score, as well as reading comprehension, social studies, and science subscores. Multiple perspectives predicted the language subscore. These findings highlight that characteristics of contemporary learning environments that are epistemologically based may support achievement in late-elementary students.  相似文献   

20.
The nonequivalent groups with anchor test (NEAT) design involves missing data that are missing by design. Three equating methods that can be used with a NEAT design are the frequency estimation equipercentile equating method, the chain equipercentile equating method, and the item-response-theory observed-score-equating method. We suggest an approach to perform a fair comparison of the three methods. The approach is then applied to compare the three equating methods using three data sets from operational tests. For each data set, we examine how the three equating methods perform when the missing data satisfy the assumptions made by only one of these equating methods. The chain equipercentile equating method is somewhat more satisfactory overall than the other methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号