Chi-Square test in R

阅读量：4322 次

发布时间：2019-06-06

本文共 4401 字，大约阅读时间需要 14 分钟。

Chi-Square

Chi-Square distribution test

This Chi-Square test is used to assess fitting

Chi-Squared value is:

: is the observed value of class i

: is the expected value of class i

if is close to , is 0, could be an indicator shows the close level of observed distribution to the expected distribution. Normal distribution is a special case.

Chi-Square test also could be used to assess the fitting.

Example:

> O <- c(21,42,24,8,4,1) # Suppose we have a observed values

> N <- sum(E) # the sample size

> N

[1] 100

> c1 <- pbinom(0,5,.25) # Guess the sample should have The Binomial Distribution find it's expected probability

> c2 <- pbinom(1,5,.25)-pbinom(0,5,.25)

> c3 <- pbinom(2,5,.25)-pbinom(1,5,.25)

> c4 <- pbinom(3,5,.25)-pbinom(2,5,.25)

> c5 <- pbinom(4,5,.25)-pbinom(3,5,.25)

> c6 <- pbinom(5,5,.25)-pbinom(4,5,.25)

> P <- c(c1,c2,c3,c4,c5,c6)

> P

[1] 0.2373046875 0.3955078125 0.2636718750

[4] 0.0878906250 0.0146484375 0.0009765625

> sum(P)

[1] 1

> E <- P*N # calculate the expected frequency value in 100 samples

> E

[1] 23.73046875 39.55078125 26.36718750

[4] 8.78906250 1.46484375 0.09765625

> sum((O-E)^2/E) # calculate the chi-square value

[1] 13.47437

> 1-pchisq(13.47437,5) # calculate the p-value

[1] 0.01931663

p-value < 0.05

The goodness for fitting assess rules (you could set your own rules for your data):

p-value >= 0.25 Excellent fit

0.15 =< p-value < 0.25 Good fit

0.05 =< p-value < 0.15 Moderately Good fit

0.01 =< p-value < 0.05 Poor fit

Reject the null hypothesis, since we don't have significant evidence which indicate the E is Binomial Distribution.

Chi-Square Test for Independence

This lesson explains how to conduct a chi-square test for independence. The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference. The sample problem at the end of the lesson considers this example.

The test procedure described in this lesson is appropriate when the following conditions are met:

The sampling method is simple random sampling.

Each population is at least 10 times as large as its respective sample.

The variables under study are each categorical.

If sample data are displayed in a contingency table, the expected frequency count for each cell of the table is at least 5.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.

We set variable x and variable y as two categories, and test the independence of x and y. contain x and y in the same Contingency Table, the row is categories of x and the column is the categories of y.

	X1	X2	X3
Y1	O11	O12	O13
Y2	O21	O22	O23
Y3	O31	O32	O33

calculate the total number of each row and column show the table below:

	X1	X2	X3	Total in row
Y1	O11	O12	O13	Oy1=O11+ O12+ O13
Y2	O21	O22	O23	Oy2=O21+ O22+ O23
Y3	O31	O32	O33	Oy3=O31+ O32+ O33
Total in column	Ox1=O11+ O21+ O31	Ox2=O12+ O22+ O32	Ox3=O13+ O23+ O33	sample size N

Formula:

where O represents the observed frequency. E is the expected frequency under the null hypothesis and computed by

Example:

> library(MASS)

> tbl = table(survey$Smoke, survey$Exer)

> tb1

Error: object 'tb1' not found

> tbl

Freq None Some

Heavy 7 1 3

Never 87 18 84

Occas 12 3 4

Regul 9 1 7

The Smoke column records the students smoking habit, while the Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some" and "None".

test if Exer and Smoke are independent.

> chisq.test(tbl)

Result:

Pearson's Chi-squared test

data: tbl

X-squared = 5.4885, df = 6, p-value = 0.4828

Set the significance value is 0.05, p-value>0.05, we do not reject the null hypothesis that the smoking habit is independent of the exercise level of the students.

null hypothesis: the variables are independent.

alternative hypothesis: the variables are not independent.

Reference:

Weisstein, Eric W. "Chi-Squared Distribution." From MathWorld--A Wolfram Web Resource.

转载于:https://www.cnblogs.com/chaseskyline/p/3786048.html

你可能感兴趣的文章