Why is regression superior to correlation




















Pearson's r measures the linear relationship between two variables, say X and Y. A correlation of 1 indicates the data points perfectly lie on a line for which Y increases as X increases. A value of -1 also implies the data points lie on a line; however, Y decreases as X increases.

The formula for r is. To calculate Pearson correlation, we can use the cor function. The default method for cor is the Pearson correlation. Getting a correlation is generally only half the story, and you may want to know if the relationship is statistically significantly different from 0. To assess statistical significance, you can use cor. Pearson's product-moment correlation. As age increases so does Brozek percent body fat. Spearman's rank correlation is a nonparametric measure of the correlation that uses the rank of observations in its calculation, rather than the original numeric values.

It measures the monotonic relationship between two variables X and Y. That is, if Y tends to increase as X increases, the Spearman correlation coefficient is positive. If Y tends to decrease as X increases, the Spearman correlation coefficient is negative. A value of zero indicates that there is no tendency for Y to either increase or decrease when X increases.

The Spearman correlation measurement makes no assumptions about the distribution of the data. No need to memorize this formula!

Spearman's rank correlation rho. As age increases so does percent body fat. Correlation, useful though it is, is one of the most misused statistics in all of science.

People always seem to want a simple number describing a relationship. Yet data very, very rarely obey this imperative. It is clear what a Pearson correlation of 1 or -1 means, but how do we interpret a correlation of 0. It is not so clear. Here the relationship between x and y isn't just "correlated," in the colloquial sense, it is totally deterministic! If we generate data for this relationship, the Pearson correlation is 0!

The third measure of correlation that the cor command can take as argument is Kendall's Tau T. Some people have argued that T is in some ways superior to the other two methods, but the fact remains, everyone still uses either Pearson or Spearman. Regression analysis is commonly used for modeling the relationship between a single dependent variable Y and one or more predictors. When we have one predictor, we call this "simple" linear regression:.

That is, the expected value of Y is a straight-line function of X. The betas are selected by choosing the line that minimizing the squared distance between each Y value and the line of best fit. The betas are chose such that they minimize this expression:. When we have more than one predictor, we call it multiple linear regression :.

The fitted values i. The residuals are the fitted values minus the actual observed values of Y. Instead of the "line of best fit," there is a " plane of best fit. Source: James et al. Introduction to Statistical Learning Springer Let's start with simple regression. In R, models are typically fitted by calling a model-fitting function, in our case lm , with a "formula" object describing the model and a "data.

A typical call may look like. This fitted model can then be subsequently printed, summarized, or visualized; moreover, the fitted values and residuals can be extracted, and we can make predictions on new data values of X computed using functions such as summary , residuals , predict , etc. Next, we will look at how to fit a simple linear regression. The fat data frame contains observations individuals on 19 variables.

Here we don't need all the variables, so let's create a smaller dataset to use. Suppose we are interested in the relationship between body percent fat and neck circumference. Intercept neck. The argument pctfat. The resulting plot is shown in th figure on the right, and the abline function extracts the coefficients of the fitted model and adds the corresponding regression line to the plot. On the contrary, when the two variables move in different directions, in such a way that an increase in one variable will result in a decrease in another variable and vice versa, This situation is known as negative correlation.

For instance : Price and demand of a product. A statistical technique for estimating the change in the metric dependent variable due to the change in one or more independent variables, based on the average mathematical relationship between two or more variables is known as regression. It plays a significant role in many human activities, as it is a powerful and flexible tool which used to forecast the past, present or future events on the basis of past or present events.

In a simple linear regression, there are two variables x and y, wherein y depends on x or say influenced by x. Here y is called as dependent, or criterion variable and x is independent or predictor variable. The regression line of y on x is expressed as under:. With the above discussion, it is evident, that there is a big difference between these two mathematical concepts, although these two are studied together.

Correlation is used when the researcher wants to know that whether the variables under study are correlated or not, if yes then what is the strength of their association. In regression analysis, a functional relationship between two variables is established so as to make future projections on events.

Kindly elaborate how price and demand are negatively correlated. I thought a increase in demand triggers an increase in price…. An increase in price leads to the decrease in the demand for commodity, and that is why, they are negatively correlated.

Great explanation, especially the comparison table. Was able to understand the differences very clearly. Thanks for explaining the difference thoroughly with proper points. I am studying R and faced the difficulty to understand. I am glad that ur this piece of write up helped me to understand the concept. Your email address will not be published. Save my name, email, and website in this browser for the next time I comment.

Key Differences Between Correlation and Regression The points given below, explains the difference between correlation and regression in detail: A statistical measure which determines the co-relationship or association of two quantities is known as Correlation.

Correlation is used to represent the linear relationship between two variables. The P value is 1. For the amphipod data, you'd want to know whether bigger females had more eggs or fewer eggs than smaller amphipods, which is neither biologically obvious nor obvious from the graph. The second goal is to describe how tightly the two variables are associated.

For the exercise data, there's a very tight relationship, as shown by the r 2 of 0. The r 2 for the amphipod data is a lot lower, at 0. The final goal is to determine the equation of a line that goes through the cloud of points.

This is probably the most useful part of the analysis for the exercise data; if I wanted to exercise with a particular level of effort, as measured by pulse rate, I could use the equation to predict the speed I should use.

For most purposes, just knowing that bigger amphipods have significantly more eggs the hypothesis test would be more interesting than knowing the equation of the line, but it depends on the goals of your experiment. There's also one nominal variable that keeps the two measurements together in pairs, such as the name of an individual organism, experimental trial, or location. I'm not aware that anyone else considers this nominal variable to be part of correlation and regression, and it's not something you need to know the value of—you could indicate that a food intake measurement and weight measurement came from the same rat by putting both numbers on the same line, without ever giving the rat a name.

For that reason, I'll call it a "hidden" nominal variable. The main value of the hidden nominal variable is that it lets me make the blanket statement that any time you have two or more measurements from a single individual organism, experimental trial, location, etc.

I think this rule helps clarify the difference between one-way, two-way, and nested anova. If the idea of hidden nominal variables in regression confuses you, you can ignore it. There are three main goals for correlation and regression in biology. One is to see whether two measurement variables are associated with each other; whether as one variable increases, the other tends to increase or decrease.

You summarize this test of association with the P value. In some cases, this addresses a biological question about cause-and-effect relationships; a significant association means that different values of the independent variable cause different values of the dependent.

An example would be giving people different amounts of a drug and measuring their blood pressure. The null hypothesis would be that there was no relationship between the amount of drug and the blood pressure. If you reject the null hypothesis, you would conclude that the amount of drug causes the changes in blood pressure. In this kind of experiment, you determine the values of the independent variable; for example, you decide what dose of the drug each person gets.

The exercise and pulse data are an example of this, as I determined the speed on the elliptical machine, then measured the effect on pulse rate. In other cases, you want to know whether two variables are associated, without necessarily inferring a cause-and-effect relationship.

In this case, you don't determine either variable ahead of time; both are naturally variable and you measure both of them. If you find an association, you infer that variation in X may cause variation in Y , or variation in Y may cause variation in X , or variation in some other factor may affect both Y and X. An example would be measuring the amount of a particular protein on the surface of some cells and the pH of the cytoplasm of those cells.

If the protein amount and pH are correlated, it may be that the amount of protein affects the internal pH; or the internal pH affects the amount of protein; or some other factor, such as oxygen concentration, affects both protein concentration and pH.

Often, a significant correlation suggests further experiments to test for a cause and effect relationship; if protein concentration and pH were correlated, you might want to manipulate protein concentration and see what happens to pH, or manipulate pH and measure protein, or manipulate oxygen and see what happens to both.

The amphipod data are another example of this; it could be that being bigger causes amphipods to have more eggs, or that having more eggs makes the mothers bigger maybe they eat more when they're carrying more eggs? The second goal of correlation and regression is estimating the strength of the relationship between two variables; in other words, how close the points on the graph are to the regression line.

You summarize this with the r 2 value. You would also want to know whether there's a tight relationship high r 2 , which would tell you that air temperature is the main factor affecting running speed; if the r 2 is low, it would tell you that other factors besides air temperature are also important, and you might want to do more experiments to look for them.

You might also want to know how the r 2 for Agama savignyi compared to that for other lizard species, or for Agama savignyi under different conditions. The third goal of correlation and regression is finding the equation of a line that fits the cloud of points.

You can then use this equation for prediction. For example, if you have given volunteers diets with to mg of salt per day, and then measured their blood pressure, you could use the regression line to estimate how much a person's blood pressure would go down if they ate mg less salt per day.

The statistical tools used for hypothesis testing, describing the closeness of the association, and drawing a line through the points, are correlation and linear regression. Unfortunately, I find the descriptions of correlation and regression in most textbooks to be unnecessarily confusing. Some statistics textbooks have correlation and linear regression in separate chapters, and make it seem as if it is always important to pick one technique or the other.

I think this overemphasizes the differences between them. Other books muddle correlation and regression together without really explaining what the difference is. There are real differences between correlation and linear regression, but fortunately, they usually don't matter. Correlation and linear regression give the exact same P value for the hypothesis test, and for most biological experiments, that's the only really important result.

So if you're mainly interested in the P value, you don't need to worry about the difference between correlation and regression. Be aware that my approach is probably different from what you'll see elsewhere. The main difference between correlation and regression is that in correlation, you sample both measurement variables randomly from a population, while in regression you choose the values of the independent X variable.

For example, let's say you're a forensic anthropologist, interested in the relationship between foot length and body height in humans.

If you find a severed foot at a crime scene, you'd like to be able to estimate the height of the person it was severed from. You measure the foot length and body height of a random sample of humans, get a significant P value, and calculate r 2 to be 0. This is a correlation, because you took measurements of both variables on a random sample of people. The r 2 is therefore a meaningful estimate of the strength of the association between foot length and body height in humans, and you can compare it to other r 2 values.

You might want to see if the r 2 for feet and height is larger or smaller than the r 2 for hands and height, for example. As an example of regression, let's say you've decided forensic anthropology is too disgusting, so now you're interested in the effect of air temperature on running speed in lizards. This is a regression, because you decided which temperatures to use. You'll probably still want to calculate r 2 , just because high values are more impressive.

But it's not a very meaningful estimate of anything about lizards. This is because the r 2 depends on the values of the independent variable that you chose. For the exact same relationship between temperature and running speed, a narrower range of temperatures would give a smaller r 2. Here are three graphs showing some simulated data, with the same scatter standard deviation of Y values at each value of X.

As you can see, with a narrower range of X values, the r 2 gets smaller. If you did another experiment on humidity and running speed in your lizards and got a lower r 2 , you couldn't say that running speed is more strongly associated with temperature than with humidity; if you had chosen a narrower range of temperatures and a broader range of humidities, humidity might have had a larger r 2 than temperature.

If you try to classify every experiment as either regression or correlation, you'll quickly find that there are many experiments that don't clearly fall into one category.

For example, let's say that you study air temperature and running speed in lizards. You go out to the desert every Saturday for the eight months of the year that your lizards are active, measure the air temperature, then chase lizards and measure their speed. You haven't deliberately chosen the air temperature, just taken a sample of the natural variation in air temperature, so is it a correlation? But you didn't take a sample of the entire year, just those eight months, and you didn't pick days at random, just Saturdays, so is it a regression?

If you are mainly interested in using the P value for hypothesis testing, to see whether there is a relationship between the two variables, it doesn't matter whether you call the statistical test a regression or correlation. If you are interested in comparing the strength of the relationship r 2 to the strength of other relationships, you are doing a correlation and should design your experiment so that you measure X and Y on a random sample of individuals.

If you determine the X values before you do the experiment, you are doing a regression and shouldn't interpret the r 2 as an estimate of something general about the population you've observed. You have probably heard people warn you, "Correlation does not imply causation.

So if you see a significant association between A and B, it doesn't necessarily mean that variation in A causes variation in B; there may be some other variable, C, that affects both of them. For example, let's say you went to an elementary school, found random students, measured how long it took them to tie their shoes, and measured the length of their thumbs.

I'm pretty sure you'd find a strong association between the two variables, with longer thumbs associated with shorter shoe-tying times. I'm sure you could come up with a clever, sophisticated biomechanical explanation for why having longer thumbs causes children to tie their shoes faster, complete with force vectors and moment angles and equations and 3-D modeling. However, that would be silly; your sample of random students has natural variation in another variable, age, and older students have bigger thumbs and take less time to tie their shoes.

So what if you make sure all your student volunteers are the same age, and you still see a significant association between shoe-tying time and thumb length; would that correlation imply causation? No, because think of why different children have different length thumbs.

Some people are genetically larger than others; could the genes that affect overall size also affect fine motor skills? Nutrition affects size, and family economics affects nutrition; could poor children have smaller thumbs due to poor nutrition, and also have slower shoe-tying times because their parents were too overworked to teach them to tie their shoes, or because they were so poor that they didn't get their first shoes until they reached school age?

I don't know, maybe some kids spend so much time sucking their thumb that the thumb actually gets longer, and having a slimy spit-covered thumb makes it harder to grip a shoelace. But there would be multiple plausible explanations for the association between thumb length and shoe-tying time, and it would be incorrect to conclude "Longer thumbs make you tie your shoes faster. Since it's possible to think of multiple explanations for an association between two variables, does that mean you should cynically sneer "Correlation does not imply causation!

For one thing, observing a correlation between two variables suggests that there's something interesting going on, something you may want to investigate further. For example, studies have shown a correlation between eating more fresh fruits and vegetables and lower blood pressure.

It's possible that the correlation is because people with more money, who can afford fresh fruits and vegetables, have less stressful lives than poor people, and it's the difference in stress that affects blood pressure; it's also possible that people who are concerned about their health eat more fruits and vegetables and exercise more, and it's the exercise that affects blood pressure. But the correlation suggests that eating fruits and vegetables may reduce blood pressure.

You'd want to test this hypothesis further, by looking for the correlation in samples of people with similar socioeconomic status and levels of exercise; by statistically controlling for possible confounding variables using techniques such as multiple regression ; by doing animal studies; or by giving human volunteers controlled diets with different amounts of fruits and vegetables.

If your initial correlation study hadn't found an association of blood pressure with fruits and vegetables, you wouldn't have a reason to do these further studies. Correlation may not imply causation, but it tells you that something interesting is going on.

In a regression study, you set the values of the independent variable, and you control or randomize all of the possible confounding variables. For example, if you are investigating the relationship between blood pressure and fruit and vegetable consumption, you might think that it's the potassium in the fruits and vegetables that lowers blood pressure.

You could investigate this by getting a bunch of volunteers of the same sex, age, and socioeconomic status. You randomly choose the potassium intake for each person, give them the appropriate pills, have them take the pills for a month, then measure their blood pressure.

All of the possible confounding variables are either controlled age, sex, income or randomized occupation, psychological stress, exercise, diet , so if you see an association between potassium intake and blood pressure, the only possible cause would be that potassium affects blood pressure.

So if you've designed your experiment correctly, regression does imply causation. It is also possible to test the null hypothesis that the Y value predicted by the regression equation for a given value of X is equal to some theoretical expectation; the most common would be testing the null hypothesis that the Y intercept is 0.

This is rarely necessary in biological experiments, so I won't cover it here, but be aware that it is possible. When you are testing a cause-and-effect relationship, the variable that causes the relationship is called the independent variable and you plot it on the X axis, while the effect is called the dependent variable and you plot it on the Y axis.

In other cases, both variables exhibit natural variation, but any cause-and-effect relationship would be in one way; if you measure the air temperature and frog calling rate at a pond on several different nights, both the air temperature and the calling rate would display natural variation, but if there's a cause-and-effect relationship, it's temperature affecting calling rate; the rate at which frogs call does not affect the air temperature. Sometimes it's not clear which is the independent variable and which is the dependent, even if you think there may be a cause-and-effect relationship.

For example, if you are testing whether salt content in food affects blood pressure, you might measure the salt content of people's diets and their blood pressure, and treat salt content as the independent variable.

But if you were testing the idea that high blood pressure causes people to crave high-salt foods, you'd make blood pressure the independent variable and salt intake the dependent variable. Sometimes, you're not looking for a cause-and-effect relationship at all, you just want to see if two variables are related. For example, if you measure the range-of-motion of the hip and the shoulder, you're not trying to see whether more flexible hips cause more flexible shoulders, or more flexible shoulders cause more flexible hips; instead, you're just trying to see if people with more flexible hips also tend to have more flexible shoulders, presumably due to some factor age, diet, exercise, genetics that affects overall flexibility.

In this case, it would be completely arbitrary which variable you put on the X axis and which you put on the Y axis. Fortunately, the P value and the r 2 are not affected by which variable you call the X and which you call the Y ; you'll get mathematically identical values either way.

The least-squares regression line does depend on which variable is the X and which is the Y ; the two lines can be quite different if the r 2 is low. If you're truly interested only in whether the two variables covary, and you are not trying to infer a cause-and-effect relationship, you may want to avoid using the linear regression line as decoration on your graph.

Researchers in a few fields traditionally put the independent variable on the Y axis. Oceanographers, for example, often plot depth on the Y axis with 0 at the top and a variable that is directly or indirectly affected by depth, such as chlorophyll concentration, on the X axis.

I wouldn't recommend this unless it's a really strong tradition in your field, as it could lead to confusion about which variable you're considering the independent variable in a linear regression. Linear regression finds the line that best fits the data points. There are actually a number of different definitions of "best fit," and therefore a number of different methods of linear regression that fit somewhat different lines.

By far the most common is "ordinary least-squares regression"; when someone just says "least-squares regression" or "linear regression" or "regression," they mean ordinary least-squares regression. In ordinary least-squares regression, the "best" fit is defined as the line that minimizes the squared vertical distances between the data points and the line. This squared deviate is calculated for each data point, and the sum of these squared deviates measures how well a line fits the data.

The regression line is the one for which this sum of squared deviates is smallest. I'll leave out the math that is used to find the slope and intercept of the best-fit line; you're a biologist and have more important things to think about. Once you know a and b , you can use this equation to predict the value of Y for a given value of X. I could use this to predict that for a speed of 10 kph, my heart rate would be You should do this kind of prediction within the range of X values found in the original data set interpolation.

Predicting Y values outside the range of observed values extrapolation is sometimes interesting, but it can easily yield ridiculous results if you go far outside the observed range of X. Actually, the inter-calling interval would be infinity at that temperature, because all the frogs would be frozen solid. Sometimes you want to predict X from Y. The most common use of this is constructing a standard curve.



0コメント

  • 1000 / 1000