Which of the following refers to the extent to which a selection test provides consistent results?

The reliability of an assessment tool is the extent to which it consistently and accurately measures learning.

The validity of an assessment tool is the extent by which it measures what it was designed to measure.

Reliable assessment results will give you confidence that repeated or equivalent assessments will provide consistent results. This puts you in a better position to make generalised statements about a student’s level of achievement, especially when you are using the results of an assessment to make decisions about teaching and learning, or reporting back to students and their parents or caregivers. No results, however, can be completely reliable. There is always some random variation that may affect the assessment, so you should always be prepared to question results.

Factors that can affect reliability:

  • The length of the assessment – a longer assessment generally produces more reliable results.
  • The suitability of the questions or tasks for the students being assessed.
  • The phrasing and terminology of the questions.
  • The consistency in test administration – for example, the length of time given for the assessment, instructions given to students before the test.
  • The design of the marking schedule and moderation of marking procedures.
  • The readiness of students for the assessment – for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed.

How to be sure that a formal assessment tool is reliable

Check in the user manual for evidence of the reliability coefficient. These are measured between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.

Assessment tool manuals contain comprehensive administration guidelines. It is essential to read the manual thoroughly before conducting the assessment.

Validity

Educational assessment should always have a clear purpose, making validity  the most important attribute of a good test.

The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics. For example, a test of reading comprehension should not require mathematical ability.

There are several different types of validity:

  • Face validity - do the assessment items appear to be appropriate?
  • Content validity - does the assessment content cover what you want to assess?
  • Criterion-related validity - how well does the test measure what you want it to?
  • Construct validity: are you measuring what you think you're measuring?

A valid assessment should have good coverage of the criteria (concepts, skills and knowledge) relevant to the purpose of the examination.

Examples:

  • The PROBE test is a form of reading running record which measures reading behaviours and includes some comprehension questions. It allows teachers to see the reading strategies that students are using, and potential problems with decoding. The test would not, however, provide in-depth information about a student’s comprehension strategies across a range of texts.
  • STAR (Supplementary Test of Achievement in Reading) is not designed as a comprehensive test of reading ability. It focuses on assessing students’ vocabulary understanding, basic sentence comprehension and paragraph comprehension. It is most appropriately used for students who don’t score well on more general testing (such as PAT or e-asTTle) as it provides a more fine-grained analysis of basic comprehension strategies.

There is an important relationship between reliability and validity. An assessment that has very low reliability will also have low validity. A measurement with very poor accuracy or consistency is unlikely to be fit for its purpose. However, the things required to achieve a very high degree of reliability can impact negatively on validity. For example, consistency in assessment conditions leads to greater reliability because it reduces 'noise' (variability) in the results. On the other hand, one of the things that can improve validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to be set appropriate to the learning context and to be made relevant to particular groups of students. Insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.

The Overall Teacher Judgment balances these ideas with a balance between the reliability of a formal assessment tool, and the flexibility to use other evidence to make a judgment.

Articles from NZCER SET magazine - Set 2, 2005 and Set 3, 2005 - written by Charles Darr. Used with permission.

Reliability and Validity

All researchers strive to deliver accurate results. Accurate results are both reliable and valid. Reliability means that the results obtained are consistent. Validity is the degree to which the researcher actually measures what he or she is trying to measure.

Reliability and validity are often compared to a marksman's target. In the illustration below, Target B represents measurement with poor validity and poor reliability. The shots are neither consistent nor accurate. Target A shows a measurement that has good reliability, but has poor validity as the shots are consistent, but they are off the center of the target. Target C shows a measure with good validity and good reliability because all of the shots are concentrated at the center of the target.

Which of the following refers to the extent to which a selection test provides consistent results?

Random Errors: Random error is a term used to describe all chance or random factors than confound—undermine—the measurement of any phenomena. Random errors in measurement are inconsistent errors that happen by chance. They are inherently unpredictable and transitory. Random errors include sampling errors, unpredictable fluctuations in the measurement apparatus, or a change in a respondents mood, which may cause a person to offer an answer to a question that might differ from the one he or she would normally provide. The amount of random errors is inversely related to the reliability of a measurement instrument.[1] As the number of random errors decreases, reliability rises and vice versa.

Systematic Errors: Systematic or Non-Random Errors are a constant or systematic bias in measurement. Here are two everyday examples of systematic error: 1) Imagine that your bathroom scale always registers your weight as five pounds lighter that it actually is and 2) The thermostat in your home says that the room temperature is 72º, when it is actually 75º. The amount of systematic error is inversely related to the validity of a measurement instrument.[2] As systematic errors increase, validity falls and vice versa.

Reliability:

As stated above, reliability is concerned with the extent to which an experiment, test, or measurement procedure yields consistent results on repeated trials. Reliability is the degree to which a measure is free from random errors. But, due to the every present chance of random errors, we can never achieve a completely error-free, 100% reliable measure. The risk of unreliability is always present to a limited extent.

Here are the basic methods for estimating the reliability of empirical measurements: 1) Test-Retest Method, 2) Equivalent Form Method, and 3) Internal Consistency Method.[3]

Test-Retest Method: The test-retest method repeats the measurement—repeats the survey—under similar conditions. The second test is typically conducted among the same respondents as the first test after a short period of time has elapsed. The goal of the test-retest method is to uncover random errors, which will be shown by different results in the two tests. If the results of the two tests are highly consistent, we can conclude that the measurements are stable and reliability is deemed high. Reliability is equal to the correlation of the two test scores taken among the same respondents at different times.

There are some problems with the test-retest method. First, it may be difficult to get all the respondents to take the test—complete the survey or experiment—a second time. Second, the first and second tests may not be truly independent. The mere fact that the respondent participated in the first measurement might affect their responses on the second measurement. And, third, environmental or personal factors could cause the second measurement to change.

Equivalent Form Method: The equivalent form method is used to avoid the problems mentioned above with the test-retest method. The equivalent form method measures the ability of similar instruments to produce results that have a strong correlation. With this method, the researcher creates a large set of questions that address the same construct and then randomly divides the questions into two sets. Both instruments are given to the same sample of people. If there is a strong correlation between the instruments, we have high reliability.

The equivalent form method is also not without problems. First, it can be very difficult—some would say nearly impossible—to create two totally equivalent forms. Second, even when equivalency can be achieved, it may not be worth the investment of time, energy, and funds.

Internal Consistency and the Split-Half Method: These methods for establishing reliability rely on the internal consistency of an instrument to produce similar results on different samples during the same time period. Internal consistency is concerned with equivalence. It addresses the question: Is there an equal amount of random error introduced by using two different samples to measure phenomena?

The split-half method measures the reliability of an instrument by dividing the set of measurement items into two halves and then correlating the results. For example, if we are interested the in perceived practicality of electric cars and gasoline-powdered cars, we could use a split-half method and ask the "same" question two different ways.

Which of the following refers to the extent to which a selection test provides consistent results?

To be reliable, the answers to these two questions should be consistent. The problem with this method is that different "splits" can result in different coefficients of reliability. To overcome this problem researchers use the Cronbach alpha (α) technique, which is named for educational psychologist Lee Cronbach. Cronbach alpha (α) calculates the average reliability for all possible ways of splitting a set of questions in half. A lack of correlation of an item with other items suggests low reliability and that this item does not belong in the scale. Cronbach's alpha technique requires that all items in the scale have equal intervals. If this condition cannot be met, other statistical analysis should be considered. Chronbach's alpha is also called the coefficient of reliability.

Validity:

Validity is defined as the ability of an instrument to measure what the researcher intends to measure. There are several different types of validity in social science research. Each takes a different approach to assessing the extent to which a measure actually measures what the researcher intends to measure. Each type of validity has different meaning, uses, and limitations.[4]

Face Validity: Face validity is the degree to which subjectively is viewed as measuring what it purports to measure. It is based on the researcher's judgment or the collective judgment of a wide group of researchers. As such, it is considered the weakest form of validity. With face validity, a measure "looks like it measures what we hope to measure," but it has not been proven to do so.

Content Validity: Content validity is frequently considered equivalent to face validity. Content or logical validity is the extent to which experts agree that the measure covers all facets of the construct. To establish content validity all aspects or dimensions of a construct must be included. If we are constructing a test of arithmetic and we only focus on addition skills, we would clearly lack content validity as we have ignored subtraction, multiplication, and division. To establish content validity, we must review the literature on the construct to make certain that each dimension of the construct is being measured.

Criterion Validity: Criterion Validity measures how well a measurement predicts outcome based on information from other variables. It measures the match between the survey question and the criterion—content or subject area—it purports to measure. The SAT test, for instance, is said to have criterion validity because high scores on this test are correlated with a students' freshman grade point averages.

There are two types of criterion validity: Predictive Validity and Concurrent Validity. Predictive Validity refers to the usefulness of a measure to predict future behavior or attitudes. Concurrent Validity refers to the extent to which another variable measured at the same time as the variable of interest can be predicted by the instrument. Concurrent validity is evidenced when a measure strongly correlates with a previously validated measure.

Construct Validity: Construct validity is the degree to which an instrument represents the construct it purports to represent. It involves an understanding the theoretical foundations of the construct. A measure has construct validity when is conforms to the theory underlying the construct.

There are two types of convergent validity: Convergent Validity and Discriminant Validity. Convergent validity is the correlation among measures that claim to measure the same construct. Discriminant validity measures the lack of correlation among measures that do not measure the same construct. For there to be high levels of construct validity you need high levels of correction among measures that cover the same construct, and low levels of correlation among measures that cover different constructs.

 

Which of the following refers to the extent to which a selection test provides consistent results?

[1] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 14-15.

[2] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 13-14.

[3] Carmines, Edward G. and Richard A. Zeller, Richard A., Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. pp. 37-51.

[4] Carmines, Edward G. and Richard A. Zeller, Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications Inc., 1979. p. 17. 


toc | return to top | previous page | next page