For the
statistical consultant working with social science researchers the estimation
of reliability and validity is a task frequently encountered. Measurement issues differ in the social
sciences in that they are related to the quantification of abstract, intangible
and non observable constructs. In many
instances, then, the meaning of quantities is only inferred.
Let us begin by a
general description of the paradigm that we are dealing with. Most concepts in the behavioral sciences have
meaning within the context of the theory that they are a part of. Each concept, thus, has an operational
definition which is governed by the overarching theory. If a concept is involved in the testing of
hypothesis to support the theory it has to be measured. So the first decision that the research is
faced with is “how shall the concept be measured?” That is the type of measure. At a very broad level the type of measure can
be observational, self-report, interview, etc.
These types ultimately take shape of a more specific form like
observation of ongoing activity, observing video-taped events, self-report
measures like questionnaires that can be open-ended or close-ended, Likert-type
scales, interviews that are structured, semi-structured or unstructured and
open-ended or close-ended. Needless to
say, each type of measure has specific types of issues that need to be
addressed to make the measurement meaningful, accurate, and efficient.
Another important
feature is the population for which
the measure is intended. This decision
is not entirely dependent on the theoretical paradigm but more to the immediate
research question at hand.
A third point that
needs mentioning is the purpose of the
scale or measure. What is it that
the researcher wants to do with the measure?
Is it developed for a specific study or is it developed with the
anticipation of extensive use with similar populations?
Once some of these
decisions are made and a measure is developed, which is a careful and tedious
process, the relevant questions to raise are “how do we know that we are indeed
measuring what we want to measure?” since the construct that we are measuring
is abstract, and “can we be sure that if we repeated the measurement we will
get the same result?”. The first
question is related to validity and second to reliability. Validity and reliability are two important
characteristics of behavioral measure and are referred to as psychometric
properties.
It is important to
bear in mind that validity and reliability are not an all or none issue but a
matter of degree.
Validity:
Very simply,
validity is the extent to which a test measures what it is supposed to
measure. The question of validity is
raised in the context of the three points made above, the form of the test, the
purpose of the test and the population for whom it is intended. Therefore, we cannot ask the general question
“Is this a valid test?”. The question to
ask is “how valid is this test for the decision that I need to make?” or “how
valid is the interpretation I propose for the test?” We can divide the types of validity into
logical and empirical.
Content Validity:
When we want to
find out if the entire content of the behavior/construct/area is represented in
the test we compare the test task with the content of the behavior. This is a logical method, not an empirical
one. Example, if we want to test
knowledge on American Geography it is not fair to have most questions limited
to the geography of New England.
Face Validity:
Basically face
validity refers to the degree to which a test appears to measure what it
purports to measure.
Criterion-Oriented or Predictive Validity:
When you are
expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called the criterion
and the current score is the prediction.
This is an empirical check on the value of the test – a criterion-oriented
or predictive validation.
Concurrent Validity:
Concurrent validity
is the degree to which the scores on a test are related to the scores on
another, already established, test administered at the same time, or to some
other valid criterion available at the same time. Example, a new simple test is to be used in place
of an old cumbersome one, which is considered useful, measurements are obtained
on both at the same time. Logically,
predictive and concurrent validation are the same, the term concurrent
validation is used to indicate that no time elapsed between measures.
Construct Validity:
Construct validity
is the degree to which a test measures an intended hypothetical construct. Many times psychologists assess/measure
abstract attributes or constructs. The
process of validating the interpretations about that construct as indicated by the
test score is construct validation. This
can be done experimentally, e.g., if we want to validate a measure of
anxiety. We have a hypothesis that
anxiety increases when subjects are under the threat of an electric shock, then
the threat of an electric shock should increase anxiety scores (note: not all construct validation is this
dramatic!)
A correlation
coefficient is a statistical summary of the relation between two
variables. It is the most common way of
reporting the answer to such questions as the following: Does this test predict performance on the
job? Do these two tests measure the same
thing? Do the ranks of these people
today agree with their ranks a year ago?
(rank correlation
and product-moment correlation)
According to Cronbach,
to the question “what is a good validity coefficient?” the only sensible answer
is “the best you can get”, and it is unusual for a validity coefficient to rise
above 0.60, though that is far from perfect prediction.
All in all we need
to always keep in mind the contextual questions: what is the test going to be used for? how
expensive is it in terms of time, energy and money? what implications are we
intending to draw from test scores?
Reliability:
Research requires
dependable measurement. (Nunnally)
Measurements are reliable to the extent that they are repeatable and that any
random influence which tends to make measurements different from occasion to
occasion or circumstance to circumstance is a source of measurement error. (Gay) Reliability is the degree to which a
test consistently measures whatever it measures. Errors of measurement that affect reliability
are random errors and errors of measurement that affect validity are systematic
or constant errors.
Test-retest,
equivalent forms and split-half reliability are all determined through
correlation.
Test-retest Reliability:
Test-retest
reliability is the degree to which scores are consistent over time. It indicates score variation that occurs from
testing session to testing session as a result of errors of measurement. Problems:
Memory, Maturation, Learning.
Equivalent-Forms or Alternate-Forms Reliability:
Two tests that are
identical in every way except for the actual items included. Used when it is likely that test takers will
recall responses made during the first session and when alternate forms are available. Correlate the two scores. The obtained coefficient is called the
coefficient of stability or coefficient of equivalence. Problem:
Difficulty of constructing two forms that are essentially equivalent.
Both of the above
require two administrations.
Split-Half Reliability:
Requires only one
administration. Especially appropriate
when the test is very long. The most
commonly used method to split the test into two is using the odd-even strategy. Since longer tests tend to be more reliable,
and since split-half reliability represents the reliability of a test only half
as long as the actual test, a correction formula must be applied to the
coefficient. Spearman-Brown prophecy formula.
Split-half
reliability is a form of internal consistency reliability.
Rationale Equivalence Reliability:
Rationale
equivalence reliability is not established through correlation but rather
estimates internal consistency by determining how all items on a test relate to
all other items and to the total test.
Internal Consistency Reliability:
Determining how
all items on the test relate to all other items. Kudser-Richardson-> is an estimate of
reliability that is essentially equivalent to the average of the split-half
reliabilities computed for all possible halves.
Standard Error of Measurement:
Reliability can
also be expressed in terms of the standard error of measurement. It is an estimate of how often you can expect
errors of a given size.
REFERENCES
Berk,
R., 1979. Generalizability of Behavioral
Observations: A Clarification of
Interobserver Agreement and Inter observer Reliability. American
Journal of Mental Deficiency, Vol. 83, No. 5, p. 460-472.
Cronbach,
L., 1990. Essentials of psychological testing. Harper & Row, New York.
Carmines,
E., and Zeller, R., 1979. Reliability and Validity Assessment.
Sage Publications, Beverly Hills,
California.