Which two types of validity in an experiment can observer bias threaten?

In quantitative research, you have to consider the reliability and validity of your methods and measurements.

Inhaltsverzeichnis Show

Construct validity
What is a construct?
What is construct validity?
Content validity
Face validity
Criterion validity
What is a criterion variable?
What is criterion validity?
Ambiguous temporal precedence
Confounding
Selection bias
Repeated testing (also referred to as testing effects)
Instrument change (instrumentality)
Regression toward the mean
Mortality/differential attrition
Selection-maturation interaction
Compensatory rivalry/resentful demoralization
Experimenter bias
Mutual-internal-validity problem

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

Construct validity: Does the test measure the concept that it’s intended to measure?
Content validity: Is the test fully representative of what it aims to measure?
Face validity: Does the content of the test appear to be suitable to its aims?
Criterion validity: Do the results accurately measure the concrete outcome they are designed to measure?

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity, which deal with the experimental design and the generalizability of results.

Construct validity

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed, but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organizations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

There is no objective, observable entity called “depression” that we can measure directly. But based on existing psychological research and theory, we can measure depression based on a collection of symptoms and indicators, such as low self-confidence and low energy levels.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Content validity

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened.

A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the results may not be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge.

Professional editors proofread and edit your paper by focusing on:

Academic style
Vague sentences
Grammar
Style consistency

See an example

Face validity

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

You create a survey to measure the regularity of people’s dietary habits. You review the survey items, which ask questions about every meal of the day and snacks eaten in between for every day of the week. On its surface, the survey seems like a good representation of what you want to test, so you consider it to have high face validity.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a “gold standard” measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

A university professor creates a new test to measure applicants’ English writing ability. To assess how well the test really does measure students’ writing ability, she finds an existing test that is considered a valid measurement of English writing ability, and compares the results when the same group of students take both tests. If the outcomes are very similar, the new test has high criterion validity.

Internal validity is the extent to which a piece of evidence supports a claim about cause and effect, within the context of a particular study. It is one of the most important properties of scientific studies and is an important concept in reasoning about evidence more generally. Internal validity is determined by how well a study can rule out alternative explanations for its findings (usually, sources of systematic error or 'bias'). It contrasts with external validity, the extent to which results can justify conclusions about other contexts (that is, the extent to which results can be generalized).

Inferences are said to possess internal validity if a causal relationship between two variables is properly demonstrated.[1][2] A valid causal inference may be made when three criteria are satisfied:

the "cause" precedes the "effect" in time (temporal precedence),
the "cause" and the "effect" tend to occur together (covariation), and
there are no plausible alternative explanations for the observed covariation (nonspuriousness).[2]

In scientific experimental settings, researchers often change the state of one variable (the independent variable) to see what effect it has on a second variable (the dependent variable).[3] For example, a researcher might manipulate the dosage of a particular drug between different groups of people to see what effect it has on health. In this example, the researcher wants to make a causal inference, namely, that different doses of the drug may be held responsible for observed changes or differences. When the researcher may confidently attribute the observed changes or differences in the dependent variable to the independent variable (that is, when the researcher observes an association between these variables and can rule out other explanations or rival hypotheses), then the causal inference is said to be internally valid.[4]

In many cases, however, the size of effects found in the dependent variable may not just depend on

variations in the independent variable,
the power of the instruments and statistical procedures used to measure and detect the effects, and
the choice of statistical methods (see: Statistical conclusion validity).

Rather, a number of variables or circumstances uncontrolled for (or uncontrollable) may lead to additional or alternative explanations (a) for the effects found and/or (b) for the magnitude of the effects found. Internal validity, therefore, is more a matter of degree than of either-or, and that is exactly why research designs other than true experiments may also yield results with a high degree of internal validity.

In order to allow for inferences with a high degree of internal validity, precautions may be taken during the design of the study. As a rule of thumb, conclusions based on direct manipulation of the independent variable allow for greater internal validity than conclusions based on an association observed without manipulation.

When considering only Internal Validity, highly controlled true experimental designs (i.e. with random selection, random assignment to either the control or experimental groups, reliable instruments, reliable manipulation processes, and safeguards against confounding factors) may be the "gold standard" of scientific research. However, the very methods used to increase internal validity may also limit the generalizability or external validity of the findings. For example, studying the behavior of animals in a zoo may make it easier to draw valid causal inferences within that context, but these inferences may not generalize to the behavior of animals in the wild. In general, a typical experiment in a laboratory, studying a particular process, may leave out many variables that normally strongly affect that process in nature.

To recall eight of these threats to internal validity, use the mnemonic acronym, THIS MESS,[5] which stands for:

Testing,
History,
Instrument change,
Statistical regression toward the mean,
Maturation,
Experimental mortality,
Selection, and
Selection Interaction.

Ambiguous temporal precedence

When it is not known which variable changed first, it can be difficult to determine which variable is the cause and which is the effect.

Confounding

A major threat to the validity of causal inferences is confounding: Changes in the dependent variable may rather be attributed to variations in a third variable which is related to the manipulated variable. Where spurious relationships cannot be ruled out, rival hypotheses to the original causal inference may be developed.

Selection bias

Selection bias refers to the problem that, at pre-test, differences between groups exist that may interact with the independent variable and thus be 'responsible' for the observed outcome. Researchers and participants bring to the experiment a myriad of characteristics, some learned and others inherent. For example, sex, weight, hair, eye, and skin color, personality, mental capabilities, and physical abilities, but also attitudes like motivation or willingness to participate.

During the selection step of the research study, if an unequal number of test subjects have similar subject-related variables there is a threat to the internal validity. For example, a researcher created two test groups, the experimental and the control groups. The subjects in both groups are not alike with regard to the independent variable but similar in one or more of the subject-related variables.

Self-selection also has a negative effect on the interpretive power of the dependent variable. This occurs often in online surveys where individuals of specific demographics opt into the test at higher rates than other demographics.

History

Events outside of the study/experiment or between repeated measures of the dependent variable may affect participants' responses to experimental procedures. Often, these are large-scale events (natural disaster, political change, etc.) that affect participants' attitudes and behaviors such that it becomes impossible to determine whether any change on the dependent measures is due to the independent variable, or the historical event.

Maturation

Subjects change during the course of the experiment or even between measurements. For example, young children might mature and their ability to concentrate may change as they grow up. Both permanent changes, such as physical growth and temporary ones like fatigue, provide "natural" alternative explanations; thus, they may change the way a subject would react to the independent variable. So upon completion of the study, the researcher may not be able to determine if the cause of the discrepancy is due to time or the independent variable.

Repeated testing (also referred to as testing effects)

Repeatedly measuring the participants may lead to bias. Participants may remember the correct answers or may be conditioned to know that they are being tested. Repeatedly taking (the same or similar) intelligence tests usually leads to score gains, but instead of concluding that the underlying skills have changed for good, this threat to Internal Validity provides a good rival hypotheses.

Instrument change (instrumentality)

The instrument used during the testing process can change the experiment. This also refers to observers being more concentrated or primed, or having unconsciously changed the criteria they use to make judgments. This can also be an issue with self-report measures given at different times. In this case, the impact may be mitigated through the use of retrospective pretesting. If any instrumentation changes occur, the internal validity of the main conclusion is affected, as alternative explanations are readily available.

Regression toward the mean

This type of error occurs when subjects are selected on the basis of extreme scores (one far away from the mean) during a test. For example, when children with the worst reading scores are selected to participate in a reading course, improvements at the end of the course might be due to regression toward the mean and not the course's effectiveness. If the children had been tested again before the course started, they would likely have obtained better scores anyway. Likewise, extreme outliers on individual scores are more likely to be captured in one instance of testing but will likely evolve into a more normal distribution with repeated testing.

Mortality/differential attrition

This error occurs if inferences are made on the basis of only those participants that have participated from the start to the end. However, participants may have dropped out of the study before completion, and maybe even due to the study or programme or experiment itself. For example, the percentage of group members having quit smoking at post-test was found much higher in a group having received a quit-smoking training program than in the control group. However, in the experimental group only 60% have completed the program. If this attrition is systematically related to any feature of the study, the administration of the independent variable, the instrumentation, or if dropping out leads to relevant bias between groups, a whole class of alternative explanations is possible that account for the observed differences.

Selection-maturation interaction

This occurs when the subject-related variables, color of hair, skin color, etc., and the time-related variables, age, physical size, etc., interact. If a discrepancy between the two groups occurs between the testing, the discrepancy may be due to the age differences in the age categories.

Diffusion

If treatment effects spread from treatment groups to control groups, a lack of differences between experimental and control groups may be observed. This does not mean, however, that the independent variable has no effect or that there is no relationship between dependent and independent variable.

Compensatory rivalry/resentful demoralization

Behavior in the control groups may alter as a result of the study. For example, control group members may work extra hard to see that the expected superiority of the experimental group is not demonstrated. Again, this does not mean that the independent variable produced no effect or that there is no relationship between dependent and independent variable. Vice versa, changes in the dependent variable may only be affected due to a demoralized control group, working less hard or motivated, not due to the independent variable.

Experimenter bias

Experimenter bias occurs when the individuals who are conducting an experiment inadvertently affect the outcome by non-consciously behaving in different ways to members of control and experimental groups. It is possible to eliminate the possibility of experimenter bias through the use of double-blind study designs, in which the experimenter is not aware of the condition to which a participant belongs.

Mutual-internal-validity problem

Experiments that have high internal validity can produce phenomena and results that have no relevance in real life, resulting in the mutual-internal-validity problem.[6][7] It arises when researchers use experimental results to develop theories and then use those theories to design theory-testing experiments. This mutual feedback between experiments and theories can lead to theories that explain only phenomena and results in artificial laboratory settings but not in real life.

All models are wrong
Construct validity
Content validity
Ecological validity
External validity
Statistical conclusion validity
Statistical model validation
Validity in statistics

^ Brewer, M. (2000). Research Design and Issues of Validity. In Reis, H. and Judd, C. (eds.) Handbook of Research Methods in Social and Personality Psychology. Cambridge:Cambridge University Press.
^ a b Shadish, W., Cook, T., and Campbell, D. (2002). Experimental and Quasi-Experimental Designs for Generilized Causal Inference Boston:Houghton Mifflin.
^ Levine, G. and Parkinson, S. (1994). Experimental Methods in Psychology. Hillsdale, NJ:Lawrence Erlbaum.
^ Liebert, R. M. & Liebert, L. L. (1995). Science and behavior: An introduction to methods of psychological research. Englewood Cliffs, NJ: Prentice Hall.
^ Wortman, P. M. (1983). "Evaluation research – A methodological perspective". Annual Review of Psychology. 34: 223–260. doi:10.1146/annurev.ps.34.020183.001255.
^ Schram, Arthur (2005-06-01). "Artificiality: The tension between internal and external validity in economic experiments". Journal of Economic Methodology. 12 (2): 225–237. doi:10.1080/13501780500086081. ISSN 1350-178X.
^ Lin, Hause; Werner, Kaitlyn M.; Inzlicht, Michael (2021-02-16). "Promises and Perils of Experimentation: The Mutual-Internal-Validity Problem". Perspectives on Psychological Science: 1745691620974773. doi:10.1177/1745691620974773. ISSN 1745-6916.