Statistical Analysis of Data

By: Siddiq ullah

**Slide 1**

**What is statistics?**

Latin “status”---political state—info useful to state (size of population, armed forces etc)

A branch of mathematics concerned with understanding and summarizing collections of numbers

A collection of numerical facts systematically arranged

**Slide 2**

**Descriptive Statistics**

Statistics which describe attributes of a sample or population.

Includes measures of central tendency statistics (e.g., mean, median, mode), frequencies, percentages. Minimum, maximum, and range for a data set, variance etc.

Organize and summaries a set of data

**Slide 3**

**Inferential Statistics**

Used to make inferences or judgments about a larger population based on the data collected from a small sample drawn from the population.

A key component of inferential statistics is the calculation of statistical significance of a research finding.

1. Involves

· Estimation

· Hypothesis Testing

2. Purpose

· Make Decisions About Population Characteristics

**Slide 5**

**Key Terms**

1. Population (Universe)

All Items of Interest

2. Sample

Portion of Population

3. Parameter

Summary Measure about Population

4. Statistic

Summary Measure about Sample

**Slide 6**

**Key Terms**

**Parameter:**

A characteristic of the population. Denoted with Greek letters such as m or

.

**Statistic:**

A characteristic of a sample. Denoted with English letters such as X or S.

**Sampling Error:**

Describes the amount of error that exists between a sample statistic and corresponding population parameter

**Slide 7**

**Slide 8**

Some Notations…

Population

All items under consideration by researcher

m = population mean

s = population standard

deviation

N = population size

p = population percentage

Sample

A portion of the population selected for study

x = sample mean

s = sample standard

deviation

n = sample size

p = sample percentage

**Slide 9**

**Descriptive & Inferential Statistics (DS & IS)**

DS gather information about a population characteristic (e.g. income) and describe it with a parameter of interest (e.g. mean)

IS uses the parameter to test a hypothesis pertaining to that characteristic. E.g.

Ho: mean income = UD 4,000

H1: mean income < UD 4,000)

The result for hypothesis testing is used to make inference about the characteristic of interest (e.g. Americans ® upper middle income)

**Slide 10**

**Examples of Descriptive and Inferential Statistics**

**Descriptive Statistics Inferential Statistics**

· Graphical * Confidence interval

-Arrange data in tables * Margin of error

-Bar graphs and pie charts * Compare means of two samples

· Numerical - Pre/post scores

-Percentages - t Test

-Averages * Compare means from three samples

-Range - Pre/post and follow-up

· Relationships - ANOVA = analysis of variance

-Correlation coefficient - Levels of Measurement

-Regression analysis

**Slide 11**

**Another characteristic of data, which determines which statistical calculations are meaningful**

**Nominal:**Qualitative data only; categories of names, labels, or qualities; Can’t be ordered (i.e, best to worst) ex: Survey responses of Yes/No

**Ordinal:**Qualitative/quantitative; can be ordered, but no meaningful subtractions: ex. Grades A, B, C, D, F

**Interval:**Quantitative only; meaningful subtractions but not ratios, zero is only a position (not “none”) ex: Temperatures

**Ratio:**Quantitative only, meaningful subtractions and ratios; zero represents “none” ex. Weights of babies

**Slide 12**

**Measures of Central Tendency**

“Say you were standing with one foot in the oven and one foot in an ice bucket. According to the average, you should be perfectly comfortable.”

The mode – applies to ratio, interval, ordinal or nominal scales.

The median – applies to ratio, interval and ordinal scales

The mean – applies to ratio and interval scales

**Slide 13**

Measuring Variability

**lowest to highest score**

__Range:__**: average distance from the mean**

__Average Deviation__**average squared distance from the mean**

__Variance:__Used in later inferential statistics

**square root of variance**

__Standard Deviation:__expressed on the same scale as the mean

**Slide 13**

**Parametric statistics**

Statistical analysis that attempts to explain the population parameter using a sample

E.g. of statistical parameters: mean, variance, std. dev., R2, t-value, F-ratio, rxy, etc.

It assumes that the distributions of the variables being assessed belong to known parameterized families of probability distributions

**Slide 14**

**Frequencies and Distributions**

Frequency-A frequency is the number of times a value is observed in a distribution or the number of times a particular event occurs.

Distribution-When the observed values are arranged in order they are called a rank order distribution or an array. Distributions demonstrate how the frequencies of observations are distributed across a range of values.

**The Mode**

Defined as the most frequent value (the peak)

· Applies to ratio, interval, ordinal and nominal scales

· Sensitive to sampling error (noise)

· Distributions may be referred to as uni modal, bimodal or multimodal, depending upon the number of peaks

**The Median**

- Defined as the 50th percentile
- Applies to ratio, interval and ordinal scales
- Can be used for open-ended distributions

**The Mean**

Applies only to ratio or interval scales

Sensitive to outliers

How to find?

Mean – the average of a group of numbers.

2, 5, 2, 1, 5

Mean = 3

Mean is found by evening out the numbers

2, 5, 2, 1, 5

2, 5, 2, 1, 5

2, 5, 2, 1, 5

mean = 3

**How to Find the Mean of a Group of Numbers**

Step 1 – Add all the numbers.

8, 10, 12, 18, 22, 26

8+10+12+18+22+26 = 96

Step 2 – Divide the sum by the number of addends.

8, 10, 12, 18, 22, 26

8+10+12+18+22+26 = 96

How many addends are there?

Step 2 – Divide the sum by the number of addends.

16

The mean or average of these numbers is 16.

8, 10, 12, 18, 22, 26

What is the mean of these numbers?

7, 10, 16

11

26, 33, 41, 52

38

Median

is in the

Middle

is in the

Middle

Median – the middle number in a set of ordered numbers.

1, 3, 7, 10, 13

Median = 7

How to Find the Median in a Group of Numbers

Step 1 – Arrange the numbers in order from least to greatest.

21, 18, 24, 19, 27

18, 19, 21, 24, 27

Step 2 – Find the middle number.

21, 18, 24, 19, 27

18, 19, 21, 24, 27

This is your median number.

Step 3 – If there are two middle numbers, find the mean of these two numbers.

18, 19, 21, 25, 27, 28

When to use this measure?

With a non-normal distribution, the median is appropriate

21+ 25 = 46

What is the median of these numbers?

16, 10, 7

7, 10, 16

10

29, 8, 4, 11, 19

4, 8, 11, 19, 29

11

31, 7, 2, 12, 14, 19

2, 7, 12, 14, 19, 31 13

12 + 14 = 26 2) 26

26

Mode

is the most

Popular

is the most

Popular

Mode – the number that appears most frequently in a set of numbers.

1, 1, 3, 7, 10, 13

Mode = 1

How to Find the Mode in a Group of Numbers

Step 1 – Arrange the numbers in order from least to greatest.

21, 18, 24, 19, 18

18, 18, 19, 21, 24

Step 2 – Find the number that is repeated the most.

21, 18, 24, 19, 18

18, 18, 19, 21, 24

Which number is the mode?

1, 2, 2, 9, 9, 4, 9, 10

1, 2, 2, 4, 9, 9, 9, 10

9

**When to use this measure?**

If your data is nominal, you may use the mode and range

Using all three measures provides a more complete picture of the characteristics of your sample set.

**Measures of Variability (Dispersion)**

Range – applies to ratio, interval, ordinal scales

Semi-interquartile range – applies to ratio, interval, ordinal scales

Variance (standard deviation) – applies to ratio, interval scales

Understanding the variation

The more the data is spread out, the larger the range, variance, SD and SE (Low precision and accuracy)

The more concentrated the data (precise or homogenous), the smaller the range, variance, and standard deviation (high precision and accuracy)

If all the observations are the same, the range, variance, and standard deviation = 0

None of these measures can be negative

Two distant means with little variations are more likely to be significantly different and vice versa

**Range**

Interval between lowest and highest values

Generally unreliable – changing one value (highest or lowest) can cause large change in range.

Range

is the distance

Between

is the distance

Between

Range – the difference between the greatest and the least value in a set of numbers.

1, 1, 3, 7, 10, 13

Range = 12

**What is the range?**

22, 21, 27, 31, 21, 32

21, 21, 22, 27, 31, 32

32 – 21 = 11

**How to Find the Range in a Group of Numbers**

Step 1 – Arrange the numbers in order from least to greatest.

21, 18, 24, 19, 27

18, 19, 21, 24, 27

Step 2 – Find the lowest and highest numbers.

21, 18, 24, 19, 27

18, 19, 21, 24, 27

Step 3 – Find the difference between these 2 numbers.

18, 19, 21, 24, 27

27 – 18 = 9

The range is 9

Mid-range: Average of the smallest and largest observations

Measure of relative position

Percentiles and Percentile Ranks

Percentile: The score at or below which a given % of scores lie.

Percentile Rank: The percentage of scores at or below a given score

Mid-hinge: The average of the first and third quartiles.

**Quartiles:**

Observations that divide data into four equal parts.

**First Quartile (Q1)**

Semi-Interquartile Range

The interquartile range is the interval between the first and third quartile, i.e. between the 25th and 75th percentile.

The semi-inter quartile range is half the interquartile range.

Can be used with open-ended distributions

Unaffected by extreme scores

Example1: the third quartile of students in the Biometry class = ¾ X 36 = 27th item

Example 2: 60th percentile of the class would be 60/100*36 = 21.6 = 22nd item (round off)

Inter-quartile range/deviation

(Mid-spread): Difference between the Third and the First Quartiles, therefore, considers data of central half and ignores the extreme values

Inter-quartile Range = Q3 - Q1

Quartile deviation = (Q3 - Q1)/2

Inter-quartile Range = Q3 - Q1

Quartile deviation = (Q3 - Q1)/2

**Quartile Deviation**

Measures the dispersion of the middle 50% of the distribution

-- rank the data

-- calculate upper and lower quartiles (UQ & LQ)

Number Sample sorted Values

1 25

2 27

3 20

4 23

5 26

6 24

7 19

8 16

9 25

10 18

11 30

12 29

13 32

14 26

15 24

16 21

17 28

18 27

19 20

20 16

21 14

Number Sample Sorted Values Ranked Values

1 25 14

2 27 16

3 20 16

4 23 18

5 26 19

6 24 20

7 19 20

8 16 21

9 25 23

10 18 24

11 30 24

12 29 25

13 32 25

14 26 26

15 24 26

16 21 27

17 28 27

18 27 28

19 20 29

20 16 30

21 14 32

Number Sample sorted Values Ranked Values

1 25 14 LL

2 27 16

3 20 16

4 23 18

5 26 19

6 24 20 LQ or Q1

7 19 20

8 16 21

9 25 23

10 18 24

11 30 24 Md or Q2

12 29 25

13 32 25

14 26 26

15 24 26

16 21 27 UQ or Q3

17 28 27

18 27 28

19 20 29

20 16 30

21 14 32 UL

**Variance**

Variance is the average of the squared deviations

Closely related to the standard deviation

In order to eliminate negative sign, deviations are squared (squared units e.g. m2)

v = s2

Variance (for a sample)

Steps:

Compute each deviation

Square each deviation

Sum all the squares

Divide by the data size (sample size) minus one: n-1

Example of Variance

Variance = 54/9 = 6

It is a measure of “spread”.

Notice that the larger the deviations (positive or negative) the larger the variance

Population Variance and Standard Deviation

Sample Variance and Standard Deviation

The standard deviation

It is defines as the square root of the variance

Standard deviation (SD):

Positive square root of the variance

SD = + √ S(y-ў)2÷ n

Variance and standard deviation are

useful for probability and hypothesis testing, therefore, is widely used unlike mean deviation

Population parameters and sample statistics

If we are working with samples, the calculation under-estimates the variance and SD which is biased

Therefore, instead of using n, n-1 (degrees of freedom) is used for sample, e.g.

Standard Deviation

example

Example: {4, 7, 6, 3, 8, 6, 7, 4, 5, 3}

Measure of relationship

correlation

Definitions

Correlation is a statistical technique

that is used to measure a relationship

between two variables.

Correlation requires two scores from

each individual (one score from each

of the two variables)

Correlation Coefficients

A correlation coefficient is a statistic that indicates the strength & direction of the relationship b/w 2 variables (or more) for 1 group of participants

Another definition – specifically for Spearman’s rho:

Spearman’s correlation coefficient is a standardized measure of the strength of relationship b/w 2 variables that does not rely on the assumptions of a parametric test (nonparametric data)

Uses Pearson’s correlation coefficient performed on data that have been converted into ranked scores

Distinguishing Characteristics of

Correlation

Correlation procedures involve one

sample containing all pairs of X and Y

scores

Neither variable is called the

independent or dependent variable

Use the individual pair of scores to

create a scatter plot

The scatter plot

Correlation and causality

The fact that there is a relationship

between two variables does not mean that

changes in one variable cause the

changes in the other variable.

A statistical relationship can exist even though one variable does not cause or influence the other.

Correlation research cannot be used to

infer causal relationships between two variables

in the following examples

�� example 1 - correlation coefficient =1

�� example 2 - correlation coefficient =-1

�� example 3 - correlation coefficient =0

�� the correlation coefficient for the parametric case is called the Pearson product moment correlation coefficient (r)

example 1

paired values

A 3 6 9 12 15

B 1 2 3 4 5

�� variable A (income of family) (1000 pounds)

�� variable B (# of cars owned)

�� here is a perfect and positive correlation as one variate increases in precisely the same proportion as the other variate increases

example 2

paired values

A 3 6 9 12 15

B 5 4 3 2 1

�� variable A (income of family) (100

pounds)

�� variable B (# of children)

�� here is a perfect and negative correlation as one variate decreases in precisely the same proportion as the other variate increases

example 3

paired values

A 3 6 9 12 15

B 4 1 3 5 2

variable A (income of family)

variable B (last number of postal code)

here there is almost no correlation because one

variate does not systematically change with the

other. Any association is caused by A and B being

randomly distributed

Correlation coefficients provide a single numerical value to represent the relationship b/w the 2 variables

Correlation coefficients ranges -1 to +1

-1.00 (negative one) a perfect, inverse relationship

+1.00 (positive one) a perfect, direct relationship

0.00 indicates no relationship

Graphic Representations of Correlation

The form of the relationship

In a linear relationship as the X scores increase the Y scores tend to change in one direction only and can be summarised by a straight line

In a non-linear or curvilinear relationship as the X scores change the Y scores do not tend to only increase or only decrease: the Y scores change their direction of change

Computing a correlation

Alternative Formula for the Correlation Coefficient

Computing a Correlation

Non-linearity

2 Types of Correlation Coefficient Tests

1)Pearson r

Full name is “Pearson product-moment correlation coefficient”

r (lower case r & italicized) is the statistic (fact/piece of data obtained from a study of a large quantity of num. data) for this test

2)Spearman’s rho

Full name is “Spearman’s rank-order correlation coefficient”

rho (lower case rho & italicized) is the statistic for this test

Correlation Coefficients & Strength

Strength of relationship is one thing a correlation coefficient test can tell us

Rule of Thumb for strength size (generally)

A correlation coefficient (r or rho)

Value of 0.00 indicates “no relationship”

Values b/w .01 & .24 may be called “weak”

Values b/w .25 & .49 may be called “moderate”

Values b/w .50 & .74 may be called “moderately strong”

Values b/w .75 and .99 may be called “strong”

A value of 1.00 is called “perfect”

Describing strength of relationships with positive or negative values

What is true in the positive is true in the negative

Ex: values b/w .75 & .99 are “very strong” & values b/w -.75 & -.99 are “very strong” though it is an inverse relationship

Correlation Coefficients &Scatterplots

Scatterplots used to visually show trend of data

Tells us

If relationship indicated

Kind of relationship

Outliers – cases differing from general trend

Graph may indicate direction, strength, and/or relationship of two variables

NOTE

It is ESSENTIAL to plot a scatter plot before conducting correlation analysis

If no relationship found in scatter plot,

No need to conduct correlation

When to Use Pearson r

Use Pearson r when:

Looking at relationship b/w 2 scale variables

Interval or ratio measurements

Data not highly skewed

Distribution of scores is approximately symmetrical

Relationship b/w variables is linear

When to Use Spearman’s rho

Use Spearman’s rho when:

One or both variables are ordinal

Ex: college degree, weight, or height given ranking order (i.e. 1 = lightest, 2 = middle, 3 = heaviest)

One or both sets of data are highly skewed

Distributions are not symmetrical

Relationship is not curvilinear

As determined in examination of scatter plot

Spearman Rank Order Correlation

This correlation coefficient is simply the Pearson r calculated on the rankings of the X and Y variables.

Because ranks of N objects are the integers from 1 to N, the sums and sums of squares are known (provided there are no ties).

**Spearman Rank Order Correlation**

Spearman Rank Order Correlation

Spearman Rank Order Correlation

Since we know the sum of the scores and the sum of their squares, we automatically know the variance of the integers from 1 to N.

Spearman Rank Order Correlation

Suppose we compute it with N in the denominator instead of

Spearman Rank Order Correlation

Example

Different Scales, Different Measures of Association

Used to describe the linear

relationship between two variables

that are both interval or ratio variables

The symbol for Pearson’s correlation

coefficient is r

The underlying principle of r is that it

compares how consistently each Y

value is paired with each X value in a

linear fashion

The Pearson Correlation formula

degree to which X and Y vary together

r = ---------------------------------------------------

degree to which X and Y vary separately

Co-variability of X and Y

= -----------------------------------------

variability of X and Y separately

∑XY-(∑X)(∑Y)/N

r = -----------------------------------------

√ (∑X*2 –(∑X) *2/N) (∑Y*2 –(∑Y) *2/N)

Degree of freedom=N-2

Sum of Product Deviations

We have used the sum of

squares or SS to measure the amount

of variation or variability for a single

variable

The sum of products or SP provides a

parallel procedure for measuring the

amount of co variation or co variability

between two variables

Definitional Formula

SS =Σ (X- x)(X -x)

or =Σ (Y -y)(Y -y)

Note :

SP =Σ (X -x)(Y- y)

example

X Y XY

1 3 3

2 6 12

4 4 16

5 7 35

ΣX=12

ΣY=20

ΣXY=66

Substituting:

SP = 66 - 12(20)/4

= 66 - 60

= 6

Calculation of Pearson’s

Correlation Coefficient

Correlation Coefficient

Pearson’s correlation coefficient is a

ratio comparing the co variability of X

and Y (the numerator) with the

variability of X and Y separately (the

denominator)

SP measures the co variability of X and Y The variability of X and Y is measured by calculating the SS for X and Y scores separately

Pearson correlation coefficient

r = SP / √ SS X SS Y

example

X Y X-X Y-Y (X-X)(Y-Y) (X-X)2 (Y-Y)2

0 1 -6 -1 +6 36 1

10 3 +4 +1 +4 16 1

4 1 -2 -1 +2 4 1

8 2 +2 0 0 4 0

8 3 +2 +1 +2 4 1

SP = 6+4+2+0+2 = 14

SSX = 36+16+4+4+4 = 64

SSY = 1+1+1+0+1

r = SP / √ SS X SS Y

r= 14/√ 64 * 4

14 ÷ 16

= + 0.875

Inferential statistics

Regression

Regression. The best fit line of prediction.

Using a correlation (relationship between variables) to predict one variable from knowing the score on the other variable

Usually a linear regression (finding the best fitting straight line for the data)

Best illustrated in a scatter plot with the regression line also plotted

The scatter plot

In correlation data, it is sometimes useful to

regard one variable as an independent variable and the other as a dependent variable.

In these circumstances, a linear relationship

between two variables X and Y can be

expressed by the equation Y=bX + a

Where Y is the dependent variable, X the

independent variable and b and a are

constants

In the general linear equation the value of

b is called the slope

The slope determines how much the Y

variable will change when X is increased

by one point

The value of a in the general equation is

called the Y-intercept(cutting the graph)

It determines the value of Y when X=0

A regression is a statistical method for studying the relationship between a single dependent variable and one or more independent variables.

In its simplest form a regression specifies a linear relationship between the dependent and independent variables.

Yi = b0 + b1 X1i + b2 X2i + ei

for a given set of observations

In the social sciences, a regression is generally used to represent a causal process.

Y represents the dependent variable

B0 is the intercept (it represents the predicted value of Y if X1 and X2 equal zero.)

X1 and X2 are the independent variables (also called predictors or regressors)

b1 and b2 are called the regression coefficients and provide a measure of the effect of the independent variables on Y (they measure the slope of the line)

e is the stuff not explained by the causal model.

Why use regression?

Regression is used as a way of testing hypotheses about causal relationships.

Specifically, we have hypotheses about whether the independent variables have a positive or a negative effect on the dependent variable.

Just like in our hypothesis tests about variable means, we also would like to be able to judge how confident we are in our inferences.

Standard Error of Estimate

A regression equation, by itself,

allows you to make predictions, but it

does not provide any information

about the accuracy of the predictions

The standard error of estimate gives a

measure of the standard distance

between a regression line and the

actual data points

To calculate the standard error of estimate

Find a sum of squared deviations (SS)

Each deviation will measure the distance

between the actual Y value (data) and the

predicted Ŷ value (regression line)

This sum of squares is commonly called

SSerror

Definition of Standard Error

The standard deviation of the sampling distribution is the standard error. For the mean, it indicates the average distance of the statistic from the parameter.

Example of Height

Raw Data vs. Sampling Distribution

Formula: Standard Error of Mean

To compute the SEM, use:

For our Example:

Standard Error (SE)

It has become popular recently

Researchers often misunderstand and mis- use SE

Variability of observations is SD while variability of 2 or more sample means is SE

Therefore, often called “Standard error of the means” and SD of a set of observations or a population

Covariance

When two variables covary in opposite directions, as smoking and lung capacity do, values tend to be on opposite sides of the group mean. That is, when smoking is above its group mean, lung capacity tends to be below its group mean.

Consequently, by averaging the product of deviation scores, we can obtain a measure of how the variables vary together.

The Sample Covariance

Instead of averaging by dividing by N, we divide by . The resulting formula is

Calculating Covariance

Calculating Covariance

So we obtain

What is Analysis of Variance?

ANOVA is an inferential test designed for use with 3 or more data sets

t-tests are just a form of ANOVA for 2 groups

ANOVA only interested in establishing the existence of a statistical differences, not their direction.

Based upon an F value (R. A. Fisher) which reflects the ratio between systematic and random/error variance…

Procedure for computing 1-way ANOVA for independent samples

Step 1: Complete the table

i.e.

-square each raw score

-total the raw scores for each group

-total the squared scores for each group.

Step 2: Calculate the Grand Total correction factor

GT =

=

Step 3: Compute total Sum of Squares

SStotal= åX2 - GT

= (åXA2+XB2+XC2) - GT

Step 4: Compute between groups Sum of Squares

SSbet= - GT

= + + - GT

Step 5: Compute within groups Sum of Squares

SSwit= SStotal - SSbet

Step 6: Determine the d.f for each sum of squares

dftotal= (N - 1)

dfbet= (k - 1)

dfwit= (N - k)

Step 7/8: Estimate the Variances & Compute F

=

=

Step 9: Consult F distribution table

-d1 is your df for the numerator (i.e. systematic variance)

-d2 is your df for the denominator (i.e. error variance)

Statistical Decision Process

Type I error – rejecting a true null hypothesis. (treatment has an effect when in fact the treatment has no effect)

Alpha level for a hypothesis test is the probability that the test will lead to a Type I error

Alpha and Probability Values

The level of significance that is selected prior to data collection for accepting or rejecting a null hypothesis is called alpha. The level of significance actually obtained after the data have been collected and analyzed is called the probability value, and is indicated by the symbol p.

Inferential Statistics

Level of significance. The second determinant of statistical power is the p value at which the null hypothesis is to be rejected. Statistical power can be increased by lowering the level of significance needed to reject the null hypothesis.

Error Types

Example - Efficacy Test for New drug

Type I error - Concluding that the new drug is better than the standard (HA) when in fact it is no better (H0). Ineffective drug is deemed better.

Type II error - Failing to conclude that the new drug is better (HA) when in fact it is. Effective drug is deemed to be no better.

Non- parametric statistics

Non-parametric methods

So far we assumed that our samples were drawn from normally distributed populations.

techniques that do not make that assumption are called distribution-free or nonparametric tests.

In situations where the normal assumption is appropriate, nonparametric tests are less efficient than traditional parametric methods.

Nonparametric tests frequently make use only of the order of the observations and not the actual values.

Usually do not state hypotheses in terms of a specific parameter

They make vary few assumptions about the population distribution- distribution-free tests.

Suited for data measured in ordinal and nominal scales

Not as sensitive as parametric tests; more likely to fail in detecting a real difference between two treatments

Statistical analysis that attempts to explain the population parameter using a sample without making assumption about the frequency distribution of the assessed variable

In other words, the variable being assessed is distribution-free

Types of nonparametric tests

Chi-square statistic tests for Goodness of Fit (how well the obtained sample proportions fit the population proportions specified by the null hypothesis

Test for independence – tests whether or not there is a relationship between two variables

Non-Parametric Methods

Spearman Rho Rank Order Correlation Coefficient

To calculate the Spearman rho:

Rank the observations on each variable from lowest to highest.

Tied observations are assigned the average of the ranks.

The difference between the ranks on the X and Y variables are summed and squared:

rrho = 1 – [(6åD2)/ n (n2 – 1)

Is there a relationship between the number of cigarettes smoked and severity of illness?

The null and alternative hypotheses are:

HO: There is no relationship between the number of cigarettes smoked and severity of illness

HA: This is a relationship between the number of cigarettes smoked and severity of illness

a = .05

rrho = 1 – [(6åD2)/ n (n2 – 1)]

= 1 – [6(24) / 8(64-1)]

= .71

tcalc = 2.49

tcrit = 2.447, df = 6, p = .05

Since the calculated t is > the critical value of t, we reject the null hypothesis and conclude that there is a statistically significant positive relationship between the number of cigarettes smoked and severity of illness

Use:A non-parametric procedure that we can use to assess the relationship between variables is the Spearman rho.

Goodness of Fit

The chi-square test is a “goodness of fit” test

it answers the question of how well do experimental data fit expectations.

Example

As an example, you count F2 offspring, and get 290 purple and 110 white flowers. This is a total of 400 (290 + 110) offspring.

We expect a 3/4 : 1/4 ratio. We need to calculate the expected numbers (you MUST use the numbers of offspring, NOT the proportion!!!); this is done by multiplying the total offspring by the expected proportions. This we expect 400 * 3/4 = 300 purple, and 400 * 1/4 = 100 white.

Thus, for purple, obs = 290 and exp = 300. For white, obs = 110 and exp = 100.

Chi square formula

Now it's just a matter of plugging into the formula:

2 = (290 - 300)2 / 300 + (110 - 100)2 / 100

= (-10)2 / 300 + (10)2 / 100

= 100 / 300 + 100 / 100

= 0.333 + 1.000

= 1.333.

This is our chi-square value

State H0 H0 : 120

State H1 H1 : ¹

Choose = 0.05

Choose n n = 100

Choose Test: Z, t, X2 Test (or p Value)

Compute Test Statistic (or compute P value)

Search for Critical Value

Make Statistical Decision rule

Express Decision

Steps in Test of Hypothesis

Determine the appropriate test

Establish the level of significance:α

Formulate the statistical hypothesis

Calculate the test statistic

Determine the degree of freedom

Compare computed test statistic against a tabled/critical value

1. Determine Appropriate Test

Chi Square is used when both variables are measured on a nominal scale.

It can be applied to interval or ratio data that have been categorized into a small number of groups.

It assumes that the observations are randomly sampled from the population.

All observations are independent (an individual can appear only once in a table and there are no overlapping categories).

It does not make any assumptions about the shape of the distribution nor about the homogeneity of variances.

2. Establish Level of Significance

α is a predetermined value

The convention

α = .05

α = .01

α = .001

3. Determine The Hypothesis:

Whether There is an Association or Not

Whether There is an Association or Not

Ho : The two variables are independent

Ha : The two variables are associated

4. Calculating Test Statistics

5. Determine Degrees of Freedom

df = (R-1)(C-1)

6. Compare computed test statistic against a tabled/critical value

The computed value of the Pearson chi- square statistic is compared with the critical value to determine if the computed value is improbable

The critical tabled values are based on sampling distributions of the Pearson chi-square statistic

If calculated c2 is greater than c2 table value, reject Ho

Example

Suppose a researcher is interested in voting preferences on gun control issues.

A questionnaire was developed and sent to a random sample of 90 voters.

The researcher also collects information about the political party membership of the sample of 90 respondents.

Bivariate Frequency Table or Contingency Table

1. Determine Appropriate Test

Party Membership ( 2 levels) and Nominal

Voting Preference ( 3 levels) and Nominal

2. Establish Level of Significance

Alpha of .05

3. Determine The Hypothesis

Ho : There is no difference between D & R in their opinion on gun control issue.

Ha : There is an association between responses to the gun control survey and the party membership in the population.

4. Calculating Test Statistics

5. Determine Degrees of Freedom

df = (R-1)(C-1) =

(2-1)(3-1) = 2

(2-1)(3-1) = 2

Critical Chi-Square

Critical values for chi-square are found on tables, sorted by degrees of freedom and probability levels. Be sure to use p = 0.05.

If your calculated chi-square value is greater than the critical value from the table, you “reject the null hypothesis”.

If your chi-square value is less than the critical value, you “fail to reject” the null hypothesis (that is, you accept that your genetic theory about the expected ratio is correct).

Chi-Square Table

6. Compare computed test statistic against a tabled/critical value

α = 0.05

df = 2

Critical tabled value = 5.991

Test statistic, 11.03, exceeds critical value

Null hypothesis is rejected

Democrats & Republicans differ significantly in their opinions on gun control issues