Statistical Analysis of Data
By: Siddiq ullah
Slide 1
What is statistics?
Latin “status”---political state—info useful to state (size of population, armed forces etc)
A branch of mathematics concerned with understanding and summarizing collections of numbers
A collection of numerical facts systematically arranged
Slide 2
Descriptive Statistics
Statistics which describe attributes of a sample or population.
Includes measures of central tendency statistics (e.g., mean, median, mode), frequencies, percentages. Minimum, maximum, and range for a data set, variance etc.
Organize and summaries a set of data
Slide 3
Inferential Statistics
Used to make inferences or judgments about a larger population based on the data collected from a small sample drawn from the population.
A key component of inferential statistics is the calculation of statistical significance of a research finding.
1. Involves
· Estimation
· Hypothesis Testing
2. Purpose
· Make Decisions About Population Characteristics
Slide 5
Key Terms
1. Population (Universe)
All Items of Interest
2. Sample
Portion of Population
3. Parameter
Summary Measure about Population
4. Statistic
Summary Measure about Sample
Slide 6
Key Terms
Parameter:
A characteristic of the population. Denoted with Greek letters such as m or
.
Statistic:
A characteristic of a sample. Denoted with English letters such as X or S.
Sampling Error:
Describes the amount of error that exists between a sample statistic and corresponding population parameter
Slide 7
Slide 8
Some Notations…
Population
All items under consideration by researcher
m = population mean
s = population standard
deviation
N = population size
p = population percentage
Sample
A portion of the population selected for study
x = sample mean
s = sample standard
deviation
n = sample size
p = sample percentage
Slide 9
Descriptive & Inferential Statistics (DS & IS)
DS gather information about a population characteristic (e.g. income) and describe it with a parameter of interest (e.g. mean)
IS uses the parameter to test a hypothesis pertaining to that characteristic. E.g.
Ho: mean income = UD 4,000
H1: mean income < UD 4,000)
The result for hypothesis testing is used to make inference about the characteristic of interest (e.g. Americans ® upper middle income)
Slide 10
Examples of Descriptive and Inferential Statistics
Descriptive Statistics Inferential Statistics
· Graphical * Confidence interval
-Arrange data in tables * Margin of error
-Bar graphs and pie charts * Compare means of two samples
· Numerical - Pre/post scores
-Percentages - t Test
-Averages * Compare means from three samples
-Range - Pre/post and follow-up
· Relationships - ANOVA = analysis of variance
-Correlation coefficient - Levels of Measurement
-Regression analysis
Slide 11
Another characteristic of data, which determines which statistical calculations are meaningful
Nominal: Qualitative data only; categories of names, labels, or qualities; Can’t be ordered (i.e, best to worst) ex: Survey responses of Yes/No
Ordinal: Qualitative/quantitative; can be ordered, but no meaningful subtractions: ex. Grades A, B, C, D, F
Interval: Quantitative only; meaningful subtractions but not ratios, zero is only a position (not “none”) ex: Temperatures
Ratio: Quantitative only, meaningful subtractions and ratios; zero represents “none” ex. Weights of babies
Slide 12
Measures of Central Tendency
“Say you were standing with one foot in the oven and one foot in an ice bucket. According to the average, you should be perfectly comfortable.”
The mode – applies to ratio, interval, ordinal or nominal scales.
The median – applies to ratio, interval and ordinal scales
The mean – applies to ratio and interval scales
Slide 13
Measuring Variability
Range: lowest to highest score
Average Deviation: average distance from the mean
Variance: average squared distance from the mean
Used in later inferential statistics
Standard Deviation: square root of variance
expressed on the same scale as the mean
Slide 13
Parametric statistics
Statistical analysis that attempts to explain the population parameter using a sample
E.g. of statistical parameters: mean, variance, std. dev., R2, t-value, F-ratio, rxy, etc.
It assumes that the distributions of the variables being assessed belong to known parameterized families of probability distributions
Slide 14
Frequencies and Distributions
Frequency-A frequency is the number of times a value is observed in a distribution or the number of times a particular event occurs.
Distribution-When the observed values are arranged in order they are called a rank order distribution or an array. Distributions demonstrate how the frequencies of observations are distributed across a range of values.
The Mode
Defined as the most frequent value (the peak)
· Applies to ratio, interval, ordinal and nominal scales
· Sensitive to sampling error (noise)
· Distributions may be referred to as uni modal, bimodal or multimodal, depending upon the number of peaks
The Median
- Defined as the 50th percentile
- Applies to ratio, interval and ordinal scales
- Can be used for open-ended distributions
The Mean
Applies only to ratio or interval scales
Sensitive to outliers
How to find?
Mean – the average of a group of numbers.
2, 5, 2, 1, 5
Mean = 3
Mean is found by evening out the numbers
2, 5, 2, 1, 5
2, 5, 2, 1, 5
2, 5, 2, 1, 5
mean = 3
How to Find the Mean of a Group of Numbers
Step 1 – Add all the numbers.
8, 10, 12, 18, 22, 26
8+10+12+18+22+26 = 96
Step 2 – Divide the sum by the number of addends.
8, 10, 12, 18, 22, 26
8+10+12+18+22+26 = 96
How many addends are there?
Step 2 – Divide the sum by the number of addends.
16
The mean or average of these numbers is 16.
8, 10, 12, 18, 22, 26
What is the mean of these numbers?
7, 10, 16
11
26, 33, 41, 52
38
Median
is in the
Middle
is in the
Middle
Median – the middle number in a set of ordered numbers.
1, 3, 7, 10, 13
Median = 7
How to Find the Median in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 2 – Find the middle number.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
This is your median number.
Step 3 – If there are two middle numbers, find the mean of these two numbers.
18, 19, 21, 25, 27, 28
When to use this measure?
With a non-normal distribution, the median is appropriate
21+ 25 = 46
What is the median of these numbers?
16, 10, 7
7, 10, 16
10
29, 8, 4, 11, 19
4, 8, 11, 19, 29
11
31, 7, 2, 12, 14, 19
2, 7, 12, 14, 19, 31 13
12 + 14 = 26 2) 26
26
Mode
is the most
Popular
is the most
Popular
Mode – the number that appears most frequently in a set of numbers.
1, 1, 3, 7, 10, 13
Mode = 1
How to Find the Mode in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 18
18, 18, 19, 21, 24
Step 2 – Find the number that is repeated the most.
21, 18, 24, 19, 18
18, 18, 19, 21, 24
Which number is the mode?
1, 2, 2, 9, 9, 4, 9, 10
1, 2, 2, 4, 9, 9, 9, 10
9
When to use this measure?
If your data is nominal, you may use the mode and range
Using all three measures provides a more complete picture of the characteristics of your sample set.
Measures of Variability (Dispersion)
Range – applies to ratio, interval, ordinal scales
Semi-interquartile range – applies to ratio, interval, ordinal scales
Variance (standard deviation) – applies to ratio, interval scales
Understanding the variation
The more the data is spread out, the larger the range, variance, SD and SE (Low precision and accuracy)
The more concentrated the data (precise or homogenous), the smaller the range, variance, and standard deviation (high precision and accuracy)
If all the observations are the same, the range, variance, and standard deviation = 0
None of these measures can be negative
Two distant means with little variations are more likely to be significantly different and vice versa
Range
Interval between lowest and highest values
Generally unreliable – changing one value (highest or lowest) can cause large change in range.
Range
is the distance
Between
is the distance
Between
Range – the difference between the greatest and the least value in a set of numbers.
1, 1, 3, 7, 10, 13
Range = 12
What is the range?
22, 21, 27, 31, 21, 32
21, 21, 22, 27, 31, 32
32 – 21 = 11
How to Find the Range in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 2 – Find the lowest and highest numbers.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 3 – Find the difference between these 2 numbers.
18, 19, 21, 24, 27
27 – 18 = 9
The range is 9
Mid-range: Average of the smallest and largest observations
Measure of relative position
Percentiles and Percentile Ranks
Percentile: The score at or below which a given % of scores lie.
Percentile Rank: The percentage of scores at or below a given score
Mid-hinge: The average of the first and third quartiles.
Quartiles:
Observations that divide data into four equal parts.
First Quartile (Q1)
Semi-Interquartile Range
The interquartile range is the interval between the first and third quartile, i.e. between the 25th and 75th percentile.
The semi-inter quartile range is half the interquartile range.
Can be used with open-ended distributions
Unaffected by extreme scores
Example1: the third quartile of students in the Biometry class = ¾ X 36 = 27th item
Example 2: 60th percentile of the class would be 60/100*36 = 21.6 = 22nd item (round off)
Inter-quartile range/deviation
(Mid-spread): Difference between the Third and the First Quartiles, therefore, considers data of central half and ignores the extreme values
Inter-quartile Range = Q3 - Q1
Quartile deviation = (Q3 - Q1)/2
Inter-quartile Range = Q3 - Q1
Quartile deviation = (Q3 - Q1)/2
Quartile Deviation
Measures the dispersion of the middle 50% of the distribution
-- rank the data
-- calculate upper and lower quartiles (UQ & LQ)
Number Sample sorted Values
1 25
2 27
3 20
4 23
5 26
6 24
7 19
8 16
9 25
10 18
11 30
12 29
13 32
14 26
15 24
16 21
17 28
18 27
19 20
20 16
21 14
Number Sample Sorted Values Ranked Values
1 25 14
2 27 16
3 20 16
4 23 18
5 26 19
6 24 20
7 19 20
8 16 21
9 25 23
10 18 24
11 30 24
12 29 25
13 32 25
14 26 26
15 24 26
16 21 27
17 28 27
18 27 28
19 20 29
20 16 30
21 14 32
Number Sample sorted Values Ranked Values
1 25 14 LL
2 27 16
3 20 16
4 23 18
5 26 19
6 24 20 LQ or Q1
7 19 20
8 16 21
9 25 23
10 18 24
11 30 24 Md or Q2
12 29 25
13 32 25
14 26 26
15 24 26
16 21 27 UQ or Q3
17 28 27
18 27 28
19 20 29
20 16 30
21 14 32 UL
Variance
Variance is the average of the squared deviations
Closely related to the standard deviation
In order to eliminate negative sign, deviations are squared (squared units e.g. m2)
v = s2
Variance (for a sample)
Steps:
Compute each deviation
Square each deviation
Sum all the squares
Divide by the data size (sample size) minus one: n-1
Example of Variance
Variance = 54/9 = 6
It is a measure of “spread”.
Notice that the larger the deviations (positive or negative) the larger the variance
Population Variance and Standard Deviation
Sample Variance and Standard Deviation
The standard deviation
It is defines as the square root of the variance
Standard deviation (SD):
Positive square root of the variance
SD = + √ S(y-Ñž)2÷ n
Variance and standard deviation are
useful for probability and hypothesis testing, therefore, is widely used unlike mean deviation
Population parameters and sample statistics
If we are working with samples, the calculation under-estimates the variance and SD which is biased
Therefore, instead of using n, n-1 (degrees of freedom) is used for sample, e.g.
Standard Deviation
example
Example: {4, 7, 6, 3, 8, 6, 7, 4, 5, 3}
Measure of relationship
correlation
Definitions
Correlation is a statistical technique
that is used to measure a relationship
between two variables.
Correlation requires two scores from
each individual (one score from each
of the two variables)
Correlation Coefficients
A correlation coefficient is a statistic that indicates the strength & direction of the relationship b/w 2 variables (or more) for 1 group of participants
Another definition – specifically for Spearman’s rho:
Spearman’s correlation coefficient is a standardized measure of the strength of relationship b/w 2 variables that does not rely on the assumptions of a parametric test (nonparametric data)
Uses Pearson’s correlation coefficient performed on data that have been converted into ranked scores
Distinguishing Characteristics of
Correlation
Correlation procedures involve one
sample containing all pairs of X and Y
scores
Neither variable is called the
independent or dependent variable
Use the individual pair of scores to
create a scatter plot
The scatter plot
Correlation and causality
The fact that there is a relationship
between two variables does not mean that
changes in one variable cause the
changes in the other variable.
A statistical relationship can exist even though one variable does not cause or influence the other.
Correlation research cannot be used to
infer causal relationships between two variables
in the following examples
�� example 1 - correlation coefficient =1
�� example 2 - correlation coefficient =-1
�� example 3 - correlation coefficient =0
�� the correlation coefficient for the parametric case is called the Pearson product moment correlation coefficient (r)
example 1
paired values
A 3 6 9 12 15
B 1 2 3 4 5
�� variable A (income of family) (1000 pounds)
�� variable B (# of cars owned)
�� here is a perfect and positive correlation as one variate increases in precisely the same proportion as the other variate increases
example 2
paired values
A 3 6 9 12 15
B 5 4 3 2 1
�� variable A (income of family) (100
pounds)
�� variable B (# of children)
�� here is a perfect and negative correlation as one variate decreases in precisely the same proportion as the other variate increases
example 3
paired values
A 3 6 9 12 15
B 4 1 3 5 2
variable A (income of family)
variable B (last number of postal code)
here there is almost no correlation because one
variate does not systematically change with the
other. Any association is caused by A and B being
randomly distributed
Correlation coefficients provide a single numerical value to represent the relationship b/w the 2 variables
Correlation coefficients ranges -1 to +1
-1.00 (negative one) a perfect, inverse relationship
+1.00 (positive one) a perfect, direct relationship
0.00 indicates no relationship
Graphic Representations of Correlation
The form of the relationship
In a linear relationship as the X scores increase the Y scores tend to change in one direction only and can be summarised by a straight line
In a non-linear or curvilinear relationship as the X scores change the Y scores do not tend to only increase or only decrease: the Y scores change their direction of change
Computing a correlation
Alternative Formula for the Correlation Coefficient
Computing a Correlation
Non-linearity
2 Types of Correlation Coefficient Tests
1)Pearson r
Full name is “Pearson product-moment correlation coefficient”
r (lower case r & italicized) is the statistic (fact/piece of data obtained from a study of a large quantity of num. data) for this test
2)Spearman’s rho
Full name is “Spearman’s rank-order correlation coefficient”
rho (lower case rho & italicized) is the statistic for this test
Correlation Coefficients & Strength
Strength of relationship is one thing a correlation coefficient test can tell us
Rule of Thumb for strength size (generally)
A correlation coefficient (r or rho)
Value of 0.00 indicates “no relationship”
Values b/w .01 & .24 may be called “weak”
Values b/w .25 & .49 may be called “moderate”
Values b/w .50 & .74 may be called “moderately strong”
Values b/w .75 and .99 may be called “strong”
A value of 1.00 is called “perfect”
Describing strength of relationships with positive or negative values
What is true in the positive is true in the negative
Ex: values b/w .75 & .99 are “very strong” & values b/w -.75 & -.99 are “very strong” though it is an inverse relationship
Correlation Coefficients &Scatterplots
Scatterplots used to visually show trend of data
Tells us
If relationship indicated
Kind of relationship
Outliers – cases differing from general trend
Graph may indicate direction, strength, and/or relationship of two variables
NOTE
It is ESSENTIAL to plot a scatter plot before conducting correlation analysis
If no relationship found in scatter plot,
No need to conduct correlation
When to Use Pearson r
Use Pearson r when:
Looking at relationship b/w 2 scale variables
Interval or ratio measurements
Data not highly skewed
Distribution of scores is approximately symmetrical
Relationship b/w variables is linear
When to Use Spearman’s rho
Use Spearman’s rho when:
One or both variables are ordinal
Ex: college degree, weight, or height given ranking order (i.e. 1 = lightest, 2 = middle, 3 = heaviest)
One or both sets of data are highly skewed
Distributions are not symmetrical
Relationship is not curvilinear
As determined in examination of scatter plot
Spearman Rank Order Correlation
This correlation coefficient is simply the Pearson r calculated on the rankings of the X and Y variables.
Because ranks of N objects are the integers from 1 to N, the sums and sums of squares are known (provided there are no ties).
Spearman Rank Order Correlation
Spearman Rank Order Correlation
Spearman Rank Order Correlation
Since we know the sum of the scores and the sum of their squares, we automatically know the variance of the integers from 1 to N.
Spearman Rank Order Correlation
Suppose we compute it with N in the denominator instead of
Spearman Rank Order Correlation
Example
Different Scales, Different Measures of Association
Used to describe the linear
relationship between two variables
that are both interval or ratio variables
The symbol for Pearson’s correlation
coefficient is r
The underlying principle of r is that it
compares how consistently each Y
value is paired with each X value in a
linear fashion
The Pearson Correlation formula
degree to which X and Y vary together
r = ---------------------------------------------------
degree to which X and Y vary separately
Co-variability of X and Y
= -----------------------------------------
variability of X and Y separately
∑XY-(∑X)(∑Y)/N
r = -----------------------------------------
√ (∑X*2 –(∑X) *2/N) (∑Y*2 –(∑Y) *2/N)
Degree of freedom=N-2
Sum of Product Deviations
We have used the sum of
squares or SS to measure the amount
of variation or variability for a single
variable
The sum of products or SP provides a
parallel procedure for measuring the
amount of co variation or co variability
between two variables
Definitional Formula
SS =Σ (X- x)(X -x)
or =Σ (Y -y)(Y -y)
Note :
SP =Σ (X -x)(Y- y)
example
X Y XY
1 3 3
2 6 12
4 4 16
5 7 35
ΣX=12
ΣY=20
ΣXY=66
Substituting:
SP = 66 - 12(20)/4
= 66 - 60
= 6
Calculation of Pearson’s
Correlation Coefficient
Correlation Coefficient
Pearson’s correlation coefficient is a
ratio comparing the co variability of X
and Y (the numerator) with the
variability of X and Y separately (the
denominator)
SP measures the co variability of X and Y The variability of X and Y is measured by calculating the SS for X and Y scores separately
Pearson correlation coefficient
r = SP / √ SS X SS Y
example
X Y X-X Y-Y (X-X)(Y-Y) (X-X)2 (Y-Y)2
0 1 -6 -1 +6 36 1
10 3 +4 +1 +4 16 1
4 1 -2 -1 +2 4 1
8 2 +2 0 0 4 0
8 3 +2 +1 +2 4 1
SP = 6+4+2+0+2 = 14
SSX = 36+16+4+4+4 = 64
SSY = 1+1+1+0+1
r = SP / √ SS X SS Y
r= 14/√ 64 * 4
14 ÷ 16
= + 0.875
Inferential statistics
Regression
Regression. The best fit line of prediction.
Using a correlation (relationship between variables) to predict one variable from knowing the score on the other variable
Usually a linear regression (finding the best fitting straight line for the data)
Best illustrated in a scatter plot with the regression line also plotted
The scatter plot
In correlation data, it is sometimes useful to
regard one variable as an independent variable and the other as a dependent variable.
In these circumstances, a linear relationship
between two variables X and Y can be
expressed by the equation Y=bX + a
Where Y is the dependent variable, X the
independent variable and b and a are
constants
In the general linear equation the value of
b is called the slope
The slope determines how much the Y
variable will change when X is increased
by one point
The value of a in the general equation is
called the Y-intercept(cutting the graph)
It determines the value of Y when X=0
A regression is a statistical method for studying the relationship between a single dependent variable and one or more independent variables.
In its simplest form a regression specifies a linear relationship between the dependent and independent variables.
Yi = b0 + b1 X1i + b2 X2i + ei
for a given set of observations
In the social sciences, a regression is generally used to represent a causal process.
Y represents the dependent variable
B0 is the intercept (it represents the predicted value of Y if X1 and X2 equal zero.)
X1 and X2 are the independent variables (also called predictors or regressors)
b1 and b2 are called the regression coefficients and provide a measure of the effect of the independent variables on Y (they measure the slope of the line)
e is the stuff not explained by the causal model.
Why use regression?
Regression is used as a way of testing hypotheses about causal relationships.
Specifically, we have hypotheses about whether the independent variables have a positive or a negative effect on the dependent variable.
Just like in our hypothesis tests about variable means, we also would like to be able to judge how confident we are in our inferences.
Standard Error of Estimate
A regression equation, by itself,
allows you to make predictions, but it
does not provide any information
about the accuracy of the predictions
The standard error of estimate gives a
measure of the standard distance
between a regression line and the
actual data points
To calculate the standard error of estimate
Find a sum of squared deviations (SS)
Each deviation will measure the distance
between the actual Y value (data) and the
predicted Ŷ value (regression line)
This sum of squares is commonly called
SSerror
Definition of Standard Error
The standard deviation of the sampling distribution is the standard error. For the mean, it indicates the average distance of the statistic from the parameter.
Example of Height
Raw Data vs. Sampling Distribution
Formula: Standard Error of Mean
To compute the SEM, use:
For our Example:
Standard Error (SE)
It has become popular recently
Researchers often misunderstand and mis- use SE
Variability of observations is SD while variability of 2 or more sample means is SE
Therefore, often called “Standard error of the means” and SD of a set of observations or a population
Covariance
When two variables covary in opposite directions, as smoking and lung capacity do, values tend to be on opposite sides of the group mean. That is, when smoking is above its group mean, lung capacity tends to be below its group mean.
Consequently, by averaging the product of deviation scores, we can obtain a measure of how the variables vary together.
The Sample Covariance
Instead of averaging by dividing by N, we divide by . The resulting formula is
Calculating Covariance
Calculating Covariance
So we obtain
What is Analysis of Variance?
ANOVA is an inferential test designed for use with 3 or more data sets
t-tests are just a form of ANOVA for 2 groups
ANOVA only interested in establishing the existence of a statistical differences, not their direction.
Based upon an F value (R. A. Fisher) which reflects the ratio between systematic and random/error variance…
Procedure for computing 1-way ANOVA for independent samples
Step 1: Complete the table
i.e.
-square each raw score
-total the raw scores for each group
-total the squared scores for each group.
Step 2: Calculate the Grand Total correction factor
GT =
=
Step 3: Compute total Sum of Squares
SStotal= åX2 - GT
= (Ã¥XA2+XB2+XC2) - GT
Step 4: Compute between groups Sum of Squares
SSbet= - GT
= + + - GT
Step 5: Compute within groups Sum of Squares
SSwit= SStotal - SSbet
Step 6: Determine the d.f for each sum of squares
dftotal= (N - 1)
dfbet= (k - 1)
dfwit= (N - k)
Step 7/8: Estimate the Variances & Compute F
=
=
Step 9: Consult F distribution table
-d1 is your df for the numerator (i.e. systematic variance)
-d2 is your df for the denominator (i.e. error variance)
Statistical Decision Process
Type I error – rejecting a true null hypothesis. (treatment has an effect when in fact the treatment has no effect)
Alpha level for a hypothesis test is the probability that the test will lead to a Type I error
Alpha and Probability Values
The level of significance that is selected prior to data collection for accepting or rejecting a null hypothesis is called alpha. The level of significance actually obtained after the data have been collected and analyzed is called the probability value, and is indicated by the symbol p.
Inferential Statistics
Level of significance. The second determinant of statistical power is the p value at which the null hypothesis is to be rejected. Statistical power can be increased by lowering the level of significance needed to reject the null hypothesis.
Error Types
Example - Efficacy Test for New drug
Type I error - Concluding that the new drug is better than the standard (HA) when in fact it is no better (H0). Ineffective drug is deemed better.
Type II error - Failing to conclude that the new drug is better (HA) when in fact it is. Effective drug is deemed to be no better.
Non- parametric statistics
Non-parametric methods
So far we assumed that our samples were drawn from normally distributed populations.
techniques that do not make that assumption are called distribution-free or nonparametric tests.
In situations where the normal assumption is appropriate, nonparametric tests are less efficient than traditional parametric methods.
Nonparametric tests frequently make use only of the order of the observations and not the actual values.
Usually do not state hypotheses in terms of a specific parameter
They make vary few assumptions about the population distribution- distribution-free tests.
Suited for data measured in ordinal and nominal scales
Not as sensitive as parametric tests; more likely to fail in detecting a real difference between two treatments
Statistical analysis that attempts to explain the population parameter using a sample without making assumption about the frequency distribution of the assessed variable
In other words, the variable being assessed is distribution-free
Types of nonparametric tests
Chi-square statistic tests for Goodness of Fit (how well the obtained sample proportions fit the population proportions specified by the null hypothesis
Test for independence – tests whether or not there is a relationship between two variables
Non-Parametric Methods
Spearman Rho Rank Order Correlation Coefficient
To calculate the Spearman rho:
Rank the observations on each variable from lowest to highest.
Tied observations are assigned the average of the ranks.
The difference between the ranks on the X and Y variables are summed and squared:
rrho = 1 – [(6Ã¥D2)/ n (n2 – 1)
Is there a relationship between the number of cigarettes smoked and severity of illness?
The null and alternative hypotheses are:
HO: There is no relationship between the number of cigarettes smoked and severity of illness
HA: This is a relationship between the number of cigarettes smoked and severity of illness
a = .05
rrho = 1 – [(6Ã¥D2)/ n (n2 – 1)]
= 1 – [6(24) / 8(64-1)]
= .71
tcalc = 2.49
tcrit = 2.447, df = 6, p = .05
Since the calculated t is > the critical value of t, we reject the null hypothesis and conclude that there is a statistically significant positive relationship between the number of cigarettes smoked and severity of illness
Use:A non-parametric procedure that we can use to assess the relationship between variables is the Spearman rho.
Goodness of Fit
The chi-square test is a “goodness of fit” test
it answers the question of how well do experimental data fit expectations.
Example
As an example, you count F2 offspring, and get 290 purple and 110 white flowers. This is a total of 400 (290 + 110) offspring.
We expect a 3/4 : 1/4 ratio. We need to calculate the expected numbers (you MUST use the numbers of offspring, NOT the proportion!!!); this is done by multiplying the total offspring by the expected proportions. This we expect 400 * 3/4 = 300 purple, and 400 * 1/4 = 100 white.
Thus, for purple, obs = 290 and exp = 300. For white, obs = 110 and exp = 100.
Chi square formula
Now it's just a matter of plugging into the formula:
2 = (290 - 300)2 / 300 + (110 - 100)2 / 100
= (-10)2 / 300 + (10)2 / 100
= 100 / 300 + 100 / 100
= 0.333 + 1.000
= 1.333.
This is our chi-square value
State H0 H0 : 120
State H1 H1 : ¹
Choose = 0.05
Choose n n = 100
Choose Test: Z, t, X2 Test (or p Value)
Compute Test Statistic (or compute P value)
Search for Critical Value
Make Statistical Decision rule
Express Decision
Steps in Test of Hypothesis
Determine the appropriate test
Establish the level of significance:α
Formulate the statistical hypothesis
Calculate the test statistic
Determine the degree of freedom
Compare computed test statistic against a tabled/critical value
1. Determine Appropriate Test
Chi Square is used when both variables are measured on a nominal scale.
It can be applied to interval or ratio data that have been categorized into a small number of groups.
It assumes that the observations are randomly sampled from the population.
All observations are independent (an individual can appear only once in a table and there are no overlapping categories).
It does not make any assumptions about the shape of the distribution nor about the homogeneity of variances.
2. Establish Level of Significance
α is a predetermined value
The convention
α = .05
α = .01
α = .001
3. Determine The Hypothesis:
Whether There is an Association or Not
Whether There is an Association or Not
Ho : The two variables are independent
Ha : The two variables are associated
4. Calculating Test Statistics
5. Determine Degrees of Freedom
df = (R-1)(C-1)
6. Compare computed test statistic against a tabled/critical value
The computed value of the Pearson chi- square statistic is compared with the critical value to determine if the computed value is improbable
The critical tabled values are based on sampling distributions of the Pearson chi-square statistic
If calculated c2 is greater than c2 table value, reject Ho
Example
Suppose a researcher is interested in voting preferences on gun control issues.
A questionnaire was developed and sent to a random sample of 90 voters.
The researcher also collects information about the political party membership of the sample of 90 respondents.
Bivariate Frequency Table or Contingency Table
1. Determine Appropriate Test
Party Membership ( 2 levels) and Nominal
Voting Preference ( 3 levels) and Nominal
2. Establish Level of Significance
Alpha of .05
3. Determine The Hypothesis
Ho : There is no difference between D & R in their opinion on gun control issue.
Ha : There is an association between responses to the gun control survey and the party membership in the population.
4. Calculating Test Statistics
5. Determine Degrees of Freedom
df = (R-1)(C-1) =
(2-1)(3-1) = 2
(2-1)(3-1) = 2
Critical Chi-Square
Critical values for chi-square are found on tables, sorted by degrees of freedom and probability levels. Be sure to use p = 0.05.
If your calculated chi-square value is greater than the critical value from the table, you “reject the null hypothesis”.
If your chi-square value is less than the critical value, you “fail to reject” the null hypothesis (that is, you accept that your genetic theory about the expected ratio is correct).
Chi-Square Table
6. Compare computed test statistic against a tabled/critical value
α = 0.05
df = 2
Critical tabled value = 5.991
Test statistic, 11.03, exceeds critical value
Null hypothesis is rejected
Democrats & Republicans differ significantly in their opinions on gun control issues