χ2   Distribution

 

The χ2   Distribution is one of the distributions we will be using during the course. Unlike the normal and t-distributions that are symmetric, the χ2-distribution is skewed to the right. Like the t-distribution, the χ2-distribution consists of a whole family of distributions distinguished by a single whole number parameter, ν , called the number of degrees of freedom. This value of ν determines the skewness of the graph.  

We will use the χ2   Distribution in three applications:

        (a) Estimating a Population Variance
        (b) Performing a Goodness-of-fit Test
        (c) Contingency Tables

In all three applications, we will be looking for the value of the test statistic χ2 ....What is this  χ2 statistic?
Think about this experiment: You toss a coin 100 times. Of course, we can simulate this experiment using the TI calculator with the function randInt(1,2,100)-> L1. Then, we can sort the data and count the number of ones( tails) (or twos (heads)) we got. I did the experiment and I got 54 ones ( 46 twos). If we perform the experiment several times, we can get different values, or some of them could be repeated. We will call these , the observed (O) values. Before performing the experiment, we expected to get 50-50, if the coin is fair. We will call these, the expected (E) values. We call these the expected values because if we perform this experiment many many times, we expect to get equal number of tails and heads. This conviction is based on the fact that the probability of getting a tail or a head is 50-50%( if the coin is fair!). So, if we perform this experiment many many times, we will not be surprised of getting an average of 50 tails ( or 50 heads!). Now, look at the number defined as:   [ (Oheads-E)2 + (Otails - E)2] / E . If we use the values we got before (54,46), then this number will be equal  [ (46-50)2 + (54-50)2] / 50 = 0.64. This number is what we call the χ2 statistic. Observe that this number has to be a positive number. The amazing thing is that if we perform this experiment many many times, the distribution of the χ2 statistic values is not arbitrary, but follows a distribution called the χ2 distribution. The expression of the χ2 function is:

                                            

as you can see, the function depends on an additional parameter, ν, the degrees of freedom. 
To check how accurate are the predictions using this formula, I have performed three experiments.

 In the first one, I have simulated tossing a coin 500 times, and then I repeated the experiment 500 times. I have compared the values obtained in the experiments(O), with the values calculated(E) using the above formula for the intervals, 0-1, 1-2, 2-3, ....12-13. In this case we have 1 degree of freedom (ν=1). (At the end, you can find the program I wrote to make the simulation.)

Interval

Expected(E)

Observed(O)

0-1

343

344

1-2

80

68

2-3

37

45

3-4

18

24

4-5

10

10

5-6

6

5

6-7

3

1

7-8

2

2

8-9

1

0

9-10

<1

0

10-11

<1

0

11-12

<1

0

12-13

<1

1

 

  In the second experiment, I have simulated tossing a die 500 times, and then I have repeated the experiment 500 times. In this case the number the degrees of freedom is ν=5, and the value of χ2  is given by the expression:

                  χ= [(n1-m)2 + (n2-m)2 +(n3-m)2 +(n4-m)2 +(n5-m)2 +(n6-m)2 ] / m

where, ni is the number of times we observed the number i, and m is the expected value which is the total number of trials divided by 6. The results are in the next table:

Interval

Expected(E)

Observed(O)

0-1

19

10

1-2

57

64

2-3

75

78

3-4

75

65

4-5

67

59

5-6

55

60

6-7

43

45

7-8

32

33

8-9

24

30

9-10

17

22

10-11

12

11

11-12

8

12

12-13

6

4

13-14

4

2

14-15

3

1

15-16

2

1

16-17

1

1

17-18

<1

0

18-19

<1

2

 

 

In the third experiment, I have simulated tossing a soccerball-like die( 12 faces!) 500 times, and then I have repeated the experiment 500 times. In this case the number the degrees of freedom is ν=11, and the value of χ2  is given by the expression:

                

where, mi is the number of times we observed the number i, and m is the expected value which is the total number of trials divided by 12. The results are in the next table:

Interval

Expected

Observed

0-1

<1

0

1-2

<1

1

2-3

4

1

3-4

10

12

4-5

19

19

5-6

29

22

6-7

37

43

7-8

43

36

8-9

46

58

9-10

46

49

10-11

44

41

11-12

40

38

12-13

35

47

13-14

30

27

14-15

25

27

15-16

21

24

16-17

17

8

17-18

13

7

18-19

10

13

19-20

8

9

20-21

6

3

21-22

4

3

22-23

3

3

23-24

2

1

24-25

2

0

25-26

1

0

26-27

1

2

27-28

1

0

28-29

<1

1

29-30

<1

1


I find all of this really amazing!. You see that there is some order, some logic behind all of these statistical fluctuations! Why?...I don't know.  If you get a result like this in Physics, you say:...there is some law of conservation behind these numbers!...but, what we have here?...We are talking about coins and dice!.....but, wait a minute!!...there is more!!....

Next I decided to make a different kind of simulation.  What if instead of using dice and coins, we use some process that follows a normal (continuous) distribution? ….

Using the program randnorm(), I simulated selecting random samples of 100  individuals and asking them about their IQ. We know that the people’s IQ follows a normal distribution with mean value µ=100 and standard deviation σ=15.

I made two simulations: In one, I divided the data into ten classes where each class had the same probability (10%), in the other I divided the data into five classes where each class had the same probability (20%). To find the limits for each class I used the function invNorm(). So, in the case of 10 classes I found the percentiles 10th , 20th ,30th , and so on. In the case of the five classes I used the percentiles 20th, 40th,..and so on. Each simulation was repeated 500 times.

The results obtained for the simulation in the case of 5 groups is shown in the following table: In the first column you find the intervals used for the χ2 values. In the second column, you find the value used for the calculation. In the last columns you find the observed and the calculated values using 4 degrees of freedom. The mean value of χ2 calculated for the 500 simulations was 4.27 which is consistent with the value we could expect from a χ2 distribution with 4 degrees of freedom. So, the idea that the number of degrees of freedom could be less than four does not seem reasonable.

Χ2 Interval

χ2 Values

Observed

Expected using 4 d.f.

[0-1]

0.5

36

49

[1-2]

1.5

73

88

[2-3]

2.5

88

89

[3-4]

3.5

77

76

[4-5]

4.5

57

59

[5-6]

5.5

61

44

[6-7]

6.5

37

31

[7-8]

7.5

16

22

[8-9]

8.5

21

15

[9-10]

9.5

12

10

[10-11]

10.5

8

7

[11-12]

11.5

3

4

[12-13]

12.5

4

3

[13-14]

13.5

2

2

[14-15]

14.5

1

1

[15-16]

15.5

2

0

[16-17]

16.5

0

0

[17-18]

17.5

2

0

 

The results obtained for the simulation in the case of 10 groups is shown in the following table: In the first column you find the intervals used for the χ2 values. In the second column, you find the value used for the calculation. In the last columns you find the observed and the calculated values using 9 degrees of freedom. The mean value of χ2 calculated for the 500 simulations was 9.64 which is consistent with the value we could expect from a χ2 distribution with 9 degrees of freedom. So, the idea that the number of degrees of freedom could be less than nine does not seem reasonable.

 

Χ2 Interval

χ2 Values

Observed

Expected using 9 d.f.

[0-1]

0.5

0

0

[1-2]

1.5

3

4

[2-3]

2.5

9

13

[3-4]

3.5

27

26

[4-5]

4.5

23

39

[5-6]

5.5

40

47

[6-7]

6.5

48

52

[7-8]

7.5

41

52

[8-9]

8.5

48

49

[9-10]

9.5

56

43

[10-11]

10.5

37

37

[11-12]

11.5

34

31

[12-13]

12.5

33

25

[13-14]

13.5

17

20

[14-15]

14.5

25

16

[15-16]

15.5

11

12

[16-17]

16.5

17

9

[17-18]

17.5

8

7

[18-19]

18.5

6

5

[19-20]

19.5

6

4

[20-21]

20.5

2

3

[21-22]

21.5

2

2

[22-23]

22.5

2

1

[23-24]

23.5

1

1

[24-25]

24.5

2

1

[25-26]

25.5

2

1

, but…wait a minute!...there is more!...

Because we were making simulations with a normal distribution, it makes sense to continue in this direction. If we select random samples of dimension N from a population with characteristic that follows a normal distribution, then the sampling distribution for the standard deviation of the samples ( the distribution of Sx!) follows a Chi-square distribution with (N-1) degrees of freedom if we standardize the variable this way:

χ2 = [(N-1)*Sx2] / σ2

where Sx is the standard deviation from the sample, and σ is the standard deviation for the population.

I made simulations for samples of dimension N=5, and N=10 individuals and we asked them about their IQ. We know that the IQ follows a Normal distribution with mean µ=100 and standard deviation σ=15. Here are the results :

For N=5

The results obtained for the simulation in the case of samples with dimension 5 is shown in the following table: In the first column you find the intervals used for the χ2 values. In the second column, you find the value used for the calculation. In the last columns you find the observed and the calculated values using 4 degrees of freedom. The mean value of χ2 calculated for the 500 simulations was 4.08 which is consistent with the value we could expect from a χ2 distribution with 4 degrees of freedom.

 

Χ2 Interval

χ2 Values

Observed

Expected using 4 d.f.

[0-1]

0.5

39

49

[1-2]

1.5

89

88

[2-3]

2.5

81

89

[3-4]

3.5

78

76

[4-5]

4.5

63

59

[5-6]

5.5

49

44

[6-7]

6.5

30

31

[7-8]

7.5

25

22

[8-9]

8.5

15

15

[9-10]

9.5

10

10

[10-11]

10.5

5

7

[11-12]

11.5

7

4

[12-13]

12.5

2

3

[13-14]

13.5

2

2

[14-15]

14.5

4

1

[15-16]

15.5

1

0

 

 

For N=10

The results obtained for the simulation in the case of samples with dimension 10 is shown in the following table: In the first column you find the intervals used for the χ2 values. In the second column, you find the value used for the calculation. In the last columns you find the observed and the calculated values using 4 degrees of freedom. The mean value of χ2 calculated for the 500 simulations was 8.96 which is consistent with the value we could expect from a χ2 distribution with 9 degrees of freedom.

 

 

Χ2 Interval

χ2 Values

Observed

Expected using 9 d.f.

[0-1]

0.5

2

1

[1-2]

1.5

5

4

[2-3]

2.5

10

13

[3-4]

3.5

27

26

[4-5]

4.5

36

39

[5-6]

5.5

48

47

[6-7]

6.5

48

52

[7-8]

7.5

56

52

[8-9]

8.5

55

49

[9-10]

9.5

43

43

[10-11]

10.5

37

37

[11-12]

11.5

26

31

[12-13]

12.5

25

25

[13-14]

13.5

26

20

[14-15]

14.5

14

16

[15-16]

15.5

12

12

[16-17]

16.5

3

9

[17-18]

17.5

6

7

[18-19]

18.5

6

5

[19-20]

19.5

4

4

[20-21]

20.5

1

3

[21-22]

21.5

3

2

[22-23]

22.5

1

1

[23-24]

23.5

3

1

[24-25]

24.5

3

1

 

Just Amazing!!

http://www.jrigol.com/images/Viejito04.gif

BACK