Welcome students,
It's my pleasure to introduce myself. My name is Brian Powers, and I'll be your online instructor for this course in Applied Statistics. I've taught statistics in Chicago at University of Illinois at Chicago as well as privately. In 2016 I moved to Scottsdale, Arizona. This is my Nth term teaching MAT240 (I've lost count - how embarrassing!), I'm very excited to be have a new group of students.
This statistics course will expose you to some of the most important topics and techniques of statistics. We'll be looking at probability distributions and the Central Limit Theorem. These will provide a basis for the estimation and decision making techniques that follow: construction of confidence intervals, and hypothesis testing to name a few. We'll also look at correlation and linear regression. A few things I want to bring to your attention right away:
1. Please be sure to read the course syllabus.
It's in the Course Information section on the left. In particular you should all be aware of how your grade for the course is calculated:
2. Have Access to Excel
In this course we'll be using Excel to do our analysis. It's important that you make sure you have access to Excel (Office 365 is available for free from SNHU). This is a big change from previous terms. I will be sharing resources every week on how to use statistical functions and tools in Excel to help you do your weekly work.
3. Register your account in zyBooks
The textbook resource we'll be using is zyBooks - please be sure to carefully read the announcement about registering for this course section in zyBooks. Please note: You must TYPE IN our course section exactly when registering. Please be careful on this step, it will save us all some headaches!
4. Ask Questions!!
Please ask questions if you're not sure about something. The General Questions discussion board is a great place to put your questions. I am subscribed to it, and if I don't answer your question right away another student may answer before me! Also other students will benefit from the answers you get. You can also email me - My contact info is under the My Instructor announcement.
I hope you all get a lot out of this course in Statistics. I will do whatever I can to help the course material be as relevant to you as possible, and to help you get a better understanding of these fundamental concepts in gathering, analyzing and interpreting data, and using it to make sense of the world.
Dr. Powers
Module 1 is concerned with descriptive statistics, data visualizations and summaries. A statistic is any measurement or calculation resulting from some data. This definition is quite vague, of course. There are some useful statistics and some not-so-useful (depending on the data and context). Descriptive statistics describe the sample data, while inferential statistics attempt to infer something about the larger population. When we say "population" in this context, we do not mean the population of human inhabitants of the U.S. (for example). In statistics, "population" refers to the realm of possibilities from which we sample data (which, of course COULD be the U.S. population).
Descriptive statistics of "central tendency" (mean, median and mode) describe the "average", "middle" or "most common observation" in our data. Statistics of "spread" or "dispersion" (range, variance, standard deviation) describe how much variability there is in our data. Visualizations are charts and graphs that are useful visual summaries of the data. Histograms and box plots will be useful for visualizing numerical (quantitative data) while bar plots (and once in a rare while pie charts) will be useful for categorical (qualitative) data.
Grade Level |
Number of classrooms |
Total Students | Males | Females |
9th Grade | 10 | 200 | 100 | 100 |
10th Grade | 12 | 250 | 150 | 100 |
11th Grade | 13 | 250 | 100 | 150 |
12th Grade | 15 | 300 | 150 | 150 |
Totals | 50 | 1000 | 500 | 500 |
male | female | |
9th grade | 8 | 8 |
10th | 12 | 8 |
11th | 8 | 12 |
12th | 12 | 12 |
This week we attempt to estimate how related two THINGS are. When we talk about "related" in this context, we are talking about a linear relationship. A linear relationship means that if X increases by 1 unit, then Y should increase (or decrease) by some amount of units. You may have at some point seen some of these ideas - scatterplots, linear regression, line of best fit. We'll be looking at all of these and of course we will use Excel tools to make the work a bit easier (so we don't have to stress out about math formulas).
A linear relationship between variables X and Y should follow a linear equation Y=β0+β1X+ε, here β0 is the intercept (sometimes called the intercept coefficient), β1 is the slope coefficient, and ε represents some random "noise". We are not interested in estimating the noise, but what we are interested in estimating is the intercept and the slope. So We're going to be interested in estimating the expected value of Y (E(Y), the average value of Y) for a given X value. E(Y)=β0+β1X
Notice that the epsilon (the noise) is gone - because its average value is zero.
The linear relationship between two variables is called correlation. The term is used loosely in our vernacular, but for a statistician it is important that we only talk about correlation when there is a linear relationship. Whether two variables are positively correlated or negatively correlated, whether the correlation is strong, weak or nonexistent - these are all important questions that we're interested in. But you should always remember - correlation does not mean causation. Just because a linear relationship exists, it does not mean that the one variable CAUSES the response in the other. There are a number of reasons for this - and determining causation is another challenge altogether.
Hello everybody. This week you have your first assignment due. When you are writing it up, please bear in mind these tips which will help you get the best grade possible:
These should help you do really well!
This week we'll be looking at the assumptions of linear regression and the conditions you need to look for in the data. We have a few tools we can use to check those assumptions to see if they are violated by the data. One tool is the residual plot. If we take all of the residuals and plot them (as 'y' variables) against the X variables in a scatterplot, we've created a residual plot. To see how we use this residual plot, let's take a quick look at the simple linear regression model:
Y = β0+β1X+ε
The first assumption, of course, is that the variables Y and X have a linear relationship. You can check this assumption by looking at the shape of the residual plot. Here are two examples:
The residual plot on the left looks fine - this is the kind of shape you'd see when X and Y have a nice linear relationship. The residual plot on the right is indicative of a CURVED relationship between X and Y. When X and Y are related in this way, then simple linear regression is not appropriate.
Another assumption is that the variation around the regression line doesn't depend on the value of X. Take a look at these three residual plots
The first one looks fine- the spread around the 0 line is about the same all across the residual plot. In the second residual plot the spread is noticeably less in the middle (maybe 1/3 the size) of the spread on the ends. The residual plot on the right also shows changing variation of residuals - the variation decreases as the X values increase. Residual plots 2 and 3 exhibit heteroscedasticity, which means different variation. That's a violation of this assumption, and it means that simple linear regression is not appropriate for this data.
We'll look at two new statistics used to measure how well the linear model fits the data. The first is the correlation coefficient, we denote r, and it ranges from -1 to 1. The closer to 0 it is, the weaker the linear relationship. As you may guess, this number will also tell you whether the relationship is positive or negative.
The second is the coefficient of determination, (aka R squared) and it's denoted R2. It is literally r*r, and as such ranges from 0 to 1. This statistic tells you what proportion of the variation of Y is explained by variation in X. So if R2=.423, you can say that 42.3% of the variation of Y is explained by the variation in the X variable.
As you work on Module 3 please keep the following in mind:
Please use the General Questions forum if you have questions on this assignment! I will answer your question and everybody else in the class who might have the same question will benefit from you asking!
Here we will develop our understandings of probability. We first need to understand the concept of a "probability distribution". Essentially, if we consider some numerical variable that has an uncertain value (tomorrow's temperature, how many emails I get today, how long it takes to drive to work, how much I will win on a scratch ticket) and we somehow assign probability to each of the possible values, we have just created a probability distribution. If there are only a few possible values (a scratch ticket could be $0, $5, $50 or $1000, or I could get 0,1,2,3,4,... emails) then we are talking about a discrete random variable. If the value could be any number in a range (the temperature, the length of commute time) it is known as a continuous random variable, and has a continuous probability distribution.
As a rule of thumb: discrete random variables are counted, continuous random variables are measured.
This week we will be looking at a very important probability distribution: The (so-called) Normal Distribution. This is an important continuous probability distribution for a few reasons. First of all, many natural phenomena have normal (or close to normal) distributions:
For instance, take a look at this histogram. You will see the bars form a bell curve shape.
(https://help.plot.ly/excel/histogram/)
Secondly, the normal distribution is important in statistics because of the Central Limit Theorem, which says essentially that when we try to learn something about the mean of a population by sampling, we can use the normal distribution even when the population is NOT normally distributed. This is part of the magic of random sampling!
This week you will be working a lot with the normal distribution. In your problem set, two problems will be calculating normal probabilities, and calculating normal percentiles. The questions will be like this:
Something I'd like you to all know ahead of time: The "Normal distribution" is called normal through an accident of history. It is not "normal" in the sense of ordinary or typical. A normal distribution means bell-shaped and symmetric. You could also call it "Gaussian" or "Bell-Curve" - please don't get confused on this point.
We'll also look at the T distribution, which looks like the normal distribution but with fatter tails (and is always centered at zero). The T distribution is important for statistical inference, but we'll get into that next week. For now you'll need to get some practice calculating probabilities with these distributions.
Hi everybody. On Sunday your Project One is due. It combines the stuff you've been doing in modules 2 and 3. Here are some tips to make sure you get the best score possible
So far in this course we've covered some of the fundamental ideas about probability and uncertainty, studied the normal distribution and now we have two of the most important applications of this distribution (due to the Central Limit Theorem) - statistical estimation and hypothesis testing. When estimating we spend time on interval estimating - confidence intervals. A confidence interval is a good way of saying "Based on the data, I'm pretty confident that the population mean is between X and Y". The primary objective of testing is answering a questions such as "Everybody says the population mean is X. I think it is greater than X. What conclusion should I reach based on the data?"
When we wish to make some decision about the world by looking at only a small part of it, we have to make a conclusion knowing that we could be wrong. In statistical inference this is a theme we will never stop seeing - living in harmony with uncertainty.
A hypothesis test essentially comes down to these components:
For example, suppose a marketing company wants to decide if the average price a customer would pay for a product is $14.50 or if the average price they'd pay is less than $14.50. The null hypothesis is the premise where we specify exactly what the average is: "The average acceptable price is $14.50" (it could even be more, but in this case we simplify the null to be equal) and the alternative is "the average acceptable price is LESS THAN $14.50." Formally, using typical notation, we write
H0: μ = 14.50 (some textbooks use μ ≥ 14.50)
H1: μ < 14.50
Suppose they collect good quality survey data from 100 people and of the sample the average price people say they'd pay is $14.37 with a standard deviation of $0.72. We calculate a test statistic, which we call "t" because it (theoretically) has a t distribution sampling distribution.
We use μ0=14.50 because in the null hypothesis we assume that μ = 14.50.
Finally we determine if this test statistic would be unusual if the null hypothesis WERE true. We can determine this by either
a) calculating a rejection region and rejecting the null hypothesis if t is in this region or
b) calculating a "p-value" for this test statistic and reject the null hypothesis if this "p-value" is smaller than the "significance level" for this test.
In this case, because the alternative hypothesis is "μ < 14.50 ", we call this an lower-tailed test. The p-value is P(T>t) for a T distribution with n-1=99 degrees of freedom, i.e.
P(t<-1.8056) = 0.037 . The p-value is the probability of observing a test statistic like we did IF the null hypothesis is true. We usually specify some significance level like .05 or so. This .037 p-value means the data collected is not plausible if the null hypothesis were true.
A rejection region for a significance level of .05 would be any value t less than t*, a critical value such that P(T<t*)=.05 exactly. For this distribution (a t distribution with 99 degrees of freedom) this is -1.6604. This picture shows the rejection region as well as the test statistic. You can see that our test statistic is IN the rejection region:
Our conclusion: We reject the null hypothesis. We would end our test with a concluding sentence, something like this: "From the data collected, we have substantial evidence that the true average price that people would pay for this product is less than $14.50."
The assignment this week is worded very strangely. So I want to try to clarify the premise.
One of the company’s Pacific region salespeople just returned to the office with a newly designed advertisement. It states that the average cost per square foot of his home sales is above the average cost per square foot in the Pacific region. He wants you to make sure he can make that statement before approving the use of the advertisement. The average cost per square foot of his home sales is $275.
So let's take it as a given that this particular agent's average cost per square foot (CPSF) is $275 exactly. What's unknown to us is the (population) average CPSF for the Pacific region, μ. His claim is that $275 is above the average (i.e. $275 > μ)
We are used to putting μ on the left side. So this claim can be stated as μ < $275. Because this claim is the inequality, it is the ALTERNATIVE hypothesis. Thus the null hypothesis is that μ = $275. So I'll restate:
H0: μ = 275
Ha: μ < 275
I hope this helps clarify!
Other tips:
Last week you got some practice in statistical inference, doing hypothesis tests. The tasks of statistical inference (inferring something about a population from just a sample) broadly fall into two categories: estimation and decision making. Right now we are concerned with estimation. If you are given the task of estimating what mean number of recharge cycles of laptop batteries is, how would you go about doing this? Certainly taking a large sample of laptop batteries is a good start, and calculating the sample mean is probably a good estimate.
Likely the mean in your sample is not EXACTLY the same as the population. If it's not exact, then how far off is it likely to be? That's what margins of error and confidence intervals are all about. A confidence interval may be reported something like this:
A 95% confidence interval for the mean number of recharge cycles of these batteries is 324.2 to 376.7.
We are saying that there is a 95% chance that the range from 324.2 to 376.7 contains the true population mean. If you'd like a tighter interval you have two options: 1) Take a bigger sample or 2) be more willing to be wrong (i.e. lower the confidence level from 95%).
We'll typically use the t distribution for the sampling distribution of x-bar, as we did with hypothesis testing. There are some very specific situations where it's ok to use a standard normal (but honestly, they're completely unrealistic)
This week you have a discussion post due. You're asked to consider what sample size and resulting margin of error would be best for the situation. Specifically, you're imagining that you are an analyst at a data science company. The B&K real estate company wants to get an estimate for the median listing price for the Northeast region. You have different options for sampling. The discussion prompt gives only two options: n=100 for $2000 or n=1000 for $10000.
I would like you to consider this expansion. Suppose your company offers three data collection/analysis packages:
I was thinking to really get students to make considerations about the cost of the sampling, if they offered tiered packages, something like this:
Bronze Package: $2000 for 100 sample size + $1000 for each additional 100
Silver Package: $10000 for 1000 sample size + $1000 for each additional 200
Gold Package: $25000 for 4000 sample size + $1000 for each additional 250
Suppose the population standard deviation is approximately $125,000. You can calculate the approximate margin of error for any sample size using: ME=1.96 × 125000/sqrt(n). For instance, the margin of error for n=100 would be about $24,500, the margin of error for n=1000 would be about $7,750 (these numbers are more realistic than the ones in the prompt, I'd like you to use these numbers instead).
You are a consultant trying to explain the pros and cons of a larger sample size. Come up with a recommended sample size for the B&K company and explain why this is the right choice for them.
THERE IS NOT JUST ONE RIGHT ANSWER. Try to act in an impartial way, even though in this scenario you're acting as a salesperson as well as an analyst.
Hi everybody. When making your discussion post, I want you to think about the two concepts of accuracy vs precision.
If I am predicting with 90% accuracy, then there's a 90% chance that I am correct. If I guess a person's age to be between 34 and 37, and I have 90% accuracy, then there's a 90% chance that I'm right. The precision of my estimate is how wide of a margin of error I have. If I guess someone else's age to be between 55 and 65, that's a far less precise estimate, though it may be just as accurate, if there's a 90% chance that it's correct.
A 95% confidence interval has 95% accuracy regardless of how wide it is - it's going to correctly guess the population parameter 95% of the time; but it's precision depends on the sample size. The larger the sample size, the more precise of an estimate is possible.
Hope this helps!
Dr. Powers
So far we've looked at the two main branches of statistical inference - decision making and estimation - as they relate to the population mean. Now we will turn our attention to a different population parameter, the population proportion. For example, suppose a marketing company wants to decide if more than 65% of the public would purchase a new product. The null hypothesis is "no, only 65% of public would purchase it" (it could even be less, but we simplify the null to be equal) and the alternative is "yes, more than 65% would purchase it." Formally, using typical notation, we write
H0: p = .65Suppose they collect good quality survey data from 100 people and of the sample 68 people say they would purchase it. The sample proportion then would be denoted p̂=0.68. We calculate a test statistic, which we call "z" because it (theoretically) has a standard normal sampling distribution.
We use p0=.65 because in the null hypothesis we assume that p=.65.
Finally we determine if this test statistic would be unusual if the null hypothesis WERE true. We can determine this by either
a) calculating a rejection region and rejecting the null hypothesis if z is in this region orIn this case, because the alternative hypothesis is "p > .65", we call this an upper-tailed test. The p-value is P(Z>z) i.e. P(Z>.629) = .2647.
The p-value is the probability of observing a test statistic like we did IF the null hypothesis is true. We usually specify some significance level like .05 or so. This .26 p-value means the data collected is quite plausible if the null hypothesis were true (if the null hypothesis were true, about 1 out of 4 samples would give data like we see, that's not unusual at all).
Our conclusion: We do not reject the null hypothesis. Even though we observed .68 (which is more than .65 from the null hypothesis) from a sample size of 100 it's simply not big enough to conclude that the population proportion is more than .65!
For confidence intervals, because there's no hypothesis, we use the sample proportion for calculating standard error (and thus the margin of error). The formula for confidence interval is just like for the population mean:
The z* is the critical z value for the confidence level that we are using. For example, a 95% confidence interval will use z*=1.96. As you can see, the larger the sample size, or the more extreme the sample proportion, the smaller the margin of error will be. But if you increase your confidence level, the interval will widen.
Here are some tips to help you get the best grade on this project:
This week's module continues with the ideas of hypothesis testing, now extending it to comparing two populations or groups. Remember that we use the term "population" very broadly. In the statistical sense, a population could be anything you want to make inferences about. Two sample hypothesis tests are used all the time, especially in clinical testing. Say we want to decide whether a new allergy medicine is better than the leading brand (whatever that happens to be) or a placebo. Specifically, we want to decide between two hypotheses:
Null Hypothesis: The new drug is essentially the same as the leading brand (or placebo)
Alternative Hypothesis: The new drug is better
We get two groups of people, Group A (takes the new drug) and Group B (takes the leading brand or placebo). (You may see 1 and 2 used instead of A and B, by the way). We want to conclude whether the new drug results in fewer allergy attacks (on average) than the leading brand. Let μA represent the average number of allergy attacks for people on the new drug, and μB the average number of allergy attacks for people using the leading brand. Now we can state these hypotheses more specifically:
H0: μA = μBAlthough you may equivalently state it like this (you will see both forms):
H0: μA - μB = 0As with hypothesis testing in general, we collect some data, calculate a test statistic, and decide (based on the sampling distribution of the test statistic) if it is significant enough to reject the null hypothesis. In this case the test statistic will be based on the difference between two sample means, what we might call x̄A and x̄B
There are essentially THREE types of two sample inferences we will be looking at: