Notes
Data
Applets
Examples

OnLine Help
New User
User's Guide
References

Notes on Topic 8:
Hypothesis Testing

     

  1. Overview
  2. Definition of Hypothesis Testing
    Hypothesis testing is an inferential procedure that uses sample data to evaluate the credibility of a hypothesis about a population
     
    Put simply, the logic underlying the statistical hypothesis testing procedure is:
    1. State the Hypothesis: We state a hypothesis (guess) about a population. Usually the hypothesis concerns the value of a population parameter.
    2. Define the Decision Method: We define a method to make a decision about the hypothesis. The method involves sample data.
    3. Gather Data: We obtain a random sample from the population.
    4. Make a Decision: We compare the sample data with the hypothesis about the population. Usually we compare the value of a statistic computed from the sample data with the hypothesized value of the population parameter.
      • If the data are consistent with the hypothesis we conclude that the hypothesis is reasonable.
      • If there is a big discrepency between the data and the hypothesis we conclude that the hypothesis was wrong.
    We expand on the logic of these four steps in the next section
     
    The purpose of hypothesis testing is to make a decision in the face of uncertainty. We do not have a fool-proof method for doing this: Errors can be made. Specifically, two kinds of errors can be made:
    1. Type I Error: We decide to reject the null hypothesis when it is true.
    2. Type II Error: We decide not to reject the null hypothesis when it is false.

    We present the technical aspects of the steps later in these notes. This part covers non-directional (two-tailed) techniques which are appropriate when the experimenter predicts an effect, but doesn't predict the direction of the effect.
     
    Directional (One-Tailed) tests are used when the experimenter predicts a direction of the effect.
    The power of a hypothesis test is discussed in the last section of these notes.

     

  3. The Logic of Hypothesis Testing
  4. As just stated, the logic of hypothesis testing in statistics involves four steps. We expand on those steps in this section:

    First Step: State the hypothesis
    Stating the hypothesis actually involves stating two opposing hypotheses about the value of a population parameter.

    Example: Suppose we have are interested in the effect of prenatal exposure of alcohol on the birth weight of rats. Also, suppose that we know that the mean birth weight of the population of untreated lab rats is 18 grams.

    Here are the two opposing hypotheses:

    • The Null Hypothesis (Ho). This hypothesis states that the treatment has no effect. For our example, we formally state:
    • The null hypothesis (Ho) is that prenatal exposure to alcohol has no effect on the birth weight for the population of lab rats. The birthweight will be equal to 18 grams. This is denoted

    • The Alternative Hypothesis (H1). This hypothesis states that the treatment does have an effect. For our example, we formally state:

      The alternative hypothesis (H1) is that prenatal exposure to alcohol has an effect on the birth weight for the population of lab rats. The birthweight will be different than 18 grams. This is denoted

    1. Second Step: Set the Criteria for a decision.
      The researcher will be gathering data from a sample taken from the population to evaluate the credibility of the null hypothesis.

      A criterion must be set to decide whether the kind of data we get is different from what we would expect under the null hypothesis.

      Specifically, we must set a criterion about wether the sample mean is different from the hypothesized population mean. The criterion will let us conclude whether (reject null hypothesis) or not (accept null hypothesis) the treatment (prenatal alcohol) has an effect (on birth weight).

      We will go into details later.

    2. Third Step: Collect Sample Data.
      Now we gather data. We do this by obtaining a random sample from the population.

      Example: A random sample of rats receives daily doses of alcohol during pregnancy. At birth, we measure the weight of the sample of newborn rats. We calculate the mean birth weight.

    3. Fourth Step: Evaluate the Null Hypothesis
      We compare the sample mean with the hypothesis about the population mean.
      • If the data are consistent with the hypothesis we conclude that the hypothesis is reasonable.
      • If there is a big discrepency between the data and the hypothesis we conclude that the hypothesis was wrong.

      Example: We compare the observed mean birth weight with the hypothesized values of 18 grams.

      • If a sample of rat pups which were exposed to prenatal alcohol has a birth weight very near 18 grams we conclude that the treatement does not have an effect. Formally we do not reject the null hypothesis.
      • If our sample of rat pups has a birth weight very different from 18 grams we conclude that the treatement does have an effect. Formally we reject the null hypothesis.

  1. Errors in Hypothesis Testing
  2. The central reason we do hypothesis testing is to decide whether or not the sample data are consistent with the null hypothesis.

    In the second step of the procedure we identify the kind of data that is expected if the null hypothesis is true. Specifically, we identify the mean we expect if the null hypothesis is true.

    If the outcome of the experiment is consistent with the null hypothesis, we believe it is true (we "accept the null hypothesis"). And, if the outcome is inconsistent with the null hypothesis, we decide it is not true (we "reject the null hypothesis").

    We can be wrong in either decision we reach. Since there are two decisions, there are two ways to be wrong.

    Errors in Hypothesis Testing
    Actual Situation
    No Effect
    Ho True
    Effect Exists
    Ho False
    Decision:
    Reject Ho
    Type I
    Error
    Decision
    Correct
    Decision:
    Retain Ho
    Decision
    Correct
    Type II
    Error

    1. Type I Error: A type I error consists of rejecting the null hypothesis when it is actually true. This is a very serious error that we want to seldomly make. We don't want to be very likely to conclude the experiment had an effect when it didn't.

      The experimental results look really different than we expect according to the null hypothesis. But it could come out the way it did just because by chance we have a wierd sample.

      Example:We observe that the rat pups are really heavy and conclude that prenatal exposure to alcohol has an effect even though it doesn't really. (We conclude, erroneously, that the alcohol causes heavier pups!) There could be for another reason. Perhaps the mother has unusual genes.

    2. Type II Error: A type II error consists of failing to reject the null hypothesis when it is actually false. This error has less grevious implications, so we are will to err in this direction (of not concluding the experiment had an effect when it, in fact, did).

      The experimental results don't look different than we expect according to the null hypothesis, but they are, perhaps because the effect isn't very big.

      Example: The rat pups weigh 17.9 grams and we conclude there is no effect. But "really" (if we only knew!) alcohol does reduce weight, we just don't have a big enough effect to see it.


  3. Hypothesis Testing Techniques
  4. There is always the possibility of making an inference error --- of making the wrong decision about the null hypothesis. We never know for certain if we've made the right decision. However:
    The techniques of hypothesis testing allow us to know the probability of making a type I error.

    We do this by comparing the sample mean and the population mean hypothesized under the null hypothesis and decide if they are "significantly different". If we decide that they are significantly different, we reject the null hypothesis that .

    To do this we must determine what data would be expected if Ho were true, and what data would be unlikely if Ho were true. This is done by looking at the distribution of all possible outcomes, if Ho were true. Since we usually are concerned about the mean, we usually look at the distribution of sample means for samples of size n that we would obtain if Ho were true.

    Thus, if we are concerned about means we:

    • Assume that Ho is true
    • Divide the distribution of sample means into two parts:
      1. Those sample means that are likely to be obtained if Ho is true.
      2. Those sample means that are unlikely to be obtained if Ho is true.
    To divide the distribution into these two parts -- likely and unlikely -- we define a cutoff point. This cutoff is defined on the basis of the probability of obtaining specific sample means. This (semi-arbitrary) cutoff point is called the alpha level or the level of significance. The alpha level specifies the probability of making a Type I error. It is denoted . Thus:

    = the probability of a Type I error.

    By convention, we usually adopt a cutoff point of either:
    or or occasionally .

    If we adopt a cutoff point of

    then we know that the obtained sample of data is likely to be obtained in less than 5 of 100 samples, if the data were sampled from the population in which Ho is true.

    We decide: "The data (and its sample mean) are significantly different than the value of the mean hypothesized under the null hypothesis, at the .05 level of significance."

    This decision is likely to be wrong (Type I error) 5 times out of 100. Thus, the probability of a type I error is .05.

    The obtained sample of data is likely to be obtained in less than 1 of 100 samples, if the data were sampled from the population in which Ho is true.

    We decide: "The data (and its sample mean) are significantly different than the value of the mean hypothesized under the null hypothesis, at the .01 level of significance."

    This decision is likely to be wrong (Type I error) 1 time out of 100. Thus, the probability of a type I error is .05.

    The obtained sample of data is likely to be obtained in less than 1 of 1000 samples, if the data were sampled from the population in which Ho is true.

    We decide: "The data (and its sample mean) are significantly different than the value of the mean hypothesized under the null hypothesis, at the .001 level of significance."

    This decision is likely to be wrong (Type I error) 1 time out of 1000. Thus, the probability of a type I error is .05.

    Example: We return to the example concerning prenatal exposure to alcohol on birth weight in rats. Lets assume that the researcher's sample has n=16 rat pups. We continue to assume that population of normal rats has a mean of 18 grams with a standard deviation of 4.

    There are four steps involved in hypothesis testing:

    1. State the Hypotheses:
      • Null hypothesis: No effect for alcohol consumption on birth weight. Their weight will be 18 grams. In symbols:
      • Alternative Hypothesis: ALcohol will effect birth weight. The weight will not be 18 grams. In symbols:
    2. Set the decision criteria:
      • Specify the significance level. We specify:
      • Determine the standard error of the mean (standard deviation of the distribution of sample means) for samples of size 16. The standard error is calculated by the formula:

        The value is 4/sqrt(16) = 1.
      • To determine how unusual the mean of the sample we will get is, we will use the Z formula to calculate Z for our sample mean under the assumption that the null hypothesis is true. The Z formula is:

        Note that the population mean is 18 under the null hypothesis, and the standard error is 1, as we just calculated. All we need to calculate Z is a sample mean. When we get the data we will calculate Z and then look it up in the Z table to see how unusual the obtained sample's mean is, if the null hypothesis Ho is true.
    1. Gather Data:
      Lets say that two experimenters carry out the experiment, and they get these data:

      Experiment 1 Experiment 2
      Experiment 1 Experiment 2
      Sample Mean = 13 Sample Mean = 16.5

    2. Evaluate Null Hypothesis:
      We calculate Z for each experiment, and then look up the P value for the obtained Z, and make a decision. Here's what happens for each experiment:
      Experiment 1 Experiment 2
      Sample Mean = 13
      Z = (13-18)/1 = -5.0
      p < .0000
      Reject Ho
      ViSta Applet
      Sample Mean = 16.5
      Z = (16.5-18)/1 = -1.5
      p = .1339
      Do Not Reject Ho
      ViSta Applet

      ViSta's Report for Univariate Analysis of Experiment 1 Data.

      ViSta's Report for Univariate Analysis of Experiment 1 Data.


  5. Directional (One-Tailed) Hypothesis Testing
  6. What we have seen so far is called non-direction, or "Two-Tailed", hypothesis testing. Its called this because the critical region is in both tails of the distribution. It is used when the experimenter expects a change, but doesn't know which direction it will be in.

    Non-directional (Two-Tailed) Hypothesis
    The statistical hypotheses (Ho and H1) specify a change in the population mean score.

    In this section we can consider directional, "One-Tailed", hypothesis testing. This is what is used when the experimenter expects a change in a specified direction.

    Directional (One-Tailed) Hypothesis
    The statistical hypotheses (Ho and H1) specify either an increase or a decrease in the population mean score.

    Example: We return to the survey data that we obtained on the first day of class. Recall that our sample has n=41 students.

    Sample Statistics, Population Parameters
    and Sample Frequency Distribution for SAT Math
    Statistics & Parameters Sample Frequency Distribution
    Sample Statistics
    Samp. Mean = 589.39
    Samp. Stand. Dev. = 94.35

    Population Parameters
    Pop. Mean = 460
    Pop. Stand. Dev. = 100

    Note that red is for males, blue for females.

    The same four steps are involved in both directional and non-directional hypothesis testing. However, some details are different. Here is what we do for directional hypothesis testing:

    1. State the Hypotheses:
      • Alternative Hypothesis: Students in this class are sampled from a restricted selection population whose SAT Math Scores are above the unrestricted population's mean of 460. There is a restrictive selection process for admitting students to UNC that results in SAT Math scores above the mean: Their mean SAT score is greater than 460.
      • Null hypothesis: Students in this class are not sampled from a restricted selection population whose SAT Math Scores are above the unrestricted population's mean of 460. There is an unrestrictive selection process for admitting students to UNC: Their mean SAT score is not greater than 460.
      • Symbols:
    2. Set the decision criteria:
      • Specify the significance level. We specify:
      • Determine the standard error of the mean (standard deviation of the distribution of sample means) for samples of size 41. The standard error is calculated by the formula:

        The value is

      • To determine how unusual the mean of the sample we will get is, we will use the Z formula to calculate Z for our sample mean under the assumption that the null hypothesis is true. The Z formula is:

        Note that the population mean is 460 under the null hypothesis, and the standard error is 15.6, as we just calculated. All we need to calculate Z is a sample mean. When we get the data we will calculate Z and then look it up in the Z table to see how unusual the obtained sample's mean is, if the null hypothesis Ho is true.
    1. Gather Data:
      We gathered the data on the first day of class and observed that the class's mean on SAT Math was 589.39.

    2. Evaluate Null Hypothesis:
      We calculate Z and then look up the P value for the obtained Z, and make a decision. Here's what happens:

      The P value is way below .00001, so we reject the null hypothesis that there is an unrestrictive selection process for admitting students to UNC. We conclude that the selection process results in Math SAT scores for UNC students that are higher than the population as a whole.

      Try the ViSta Applet for carrying out this analysis. You should get the following report.

      ViSta's Report for Univariate Analysis of SAT Math Scores.


  7. Statistical Power
  8. As we have seen, hypothesis testing is about seeing if a particular treatment has an effect. Hypothesis testing uses a framework based on testing the null hypothesis that there is no effect. The test leads us to decide whether or not to reject the null hypothesis.

    We have examined the potential for making an incorrect decision, looking at Type I and Type II errors, and the associated signicance level for making a Type I error.

    We now reverse our focus and look at the potential for making a correct decision. This is refered to as the power of a statistical test.

    Statistical Power
    The power of a statistical test is the probability that the test will correctly reject a false null hypothesis. The more powerful the test is, the more likely it is to detect a treatment effect when one really exists.

    Power and Type II errors:
    When a treatment effect really exists the hypothesis test:
    1. can fail to discover the treatment effect (making a Type II error). The probability of this happening is denoted:
      = P[Type II error]
    2. can correctly detect the treatment effect (rejecting a false null hypothesis). The probabililty of this happening, which is the power of the test, is denoted:
      = power = P[rejecting a false Ho].

    Here is a table summarizing the Power and Significance of a test and their relationship to Type I and II errors and to "alpha" and "beta" the probabilities of a Type I and Type II error, respectively:

    Decisions in Hypothesis Testing
    Actual Situation
    No Effect
    Ho True
    Effect Exists
    Ho False
    Decision:
    Reject Ho
    Type I Error

    Test Significance
    Decision Correct

    Test Power
    Decision:
    Retain Ho
    Decision Correct
    Type II Error

    How to we determine power?

    Unfortunately, we don't know "beta", the exact value of the power of a test. We do know, however, that the power of a test is effected by:
    1. Alpha Level: Reducing the value of alpha also reduces the power. So if we wish to be less likely to make a type I error (conclude there is an effect when there isn't) we are also less likely to see an effect when there is one.
    2. One-Tailed Tests: One tailed tests are more powerful. They make it easier to reject null hypotheses.
    3. Sample Size: Larger samples are better, period. Tests based on larger samples are more powerful and are less likely to lead to mistaken conclusions, including both Type I and Type II errors.