In statistics, a P–P plot (probability–probability plot or percent–percent plot or P value plot) is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other. P-P plots are vastly used to evaluate the skewness of a distribution.
Mean plots are used to see if the mean varies between different groups of the data. The grouping is determined by the analyst. In most cases, the data set contains a specific grouping variable. For example, the groups may be the levels of a factor variable. Define plot in literature: the definition of plot in literature is the sequence of events that made up a storyline. In summary, a plot is the basic storyline of a text. Most plots follow a traditional pattern, where the climax is the turning point of the text.
The Q–Q plot is more widely used, but they are both referred to as 'the' probability plot, and are potentially confused.
Definition[edit]
A P–P plot plots two cumulative distribution functions (cdfs) against each other:[1]given two probability distributions, with cdfs 'F' and 'G', it plots as z ranges from to As a cdf has range [0,1], the domain of this parametric graph is and the range is the unit square
Thus for input z the output is the pair of numbers giving what percentage of f and what percentage of g fall at or below z.
The comparison line is the 45° line from (0,0) to (1,1) – the distributions are equal if and only if the plot falls on this line – any deviation indicates a difference between the distributions.[2]
Example[edit]
As an example, if the two distributions do not overlap, say F is below G, then the P–P plot will move from left to right along the bottom of the square – as z moves through the support of F, the cdf of F goes from 0 to 1, while the cdf of G stays at 0 – and then moves up the right side of the square – the cdf of F is now 1, as all points of F lie below all points of G, and now the cdf of G moves from 0 to 1 as z moves through the support of G. (need a graph for this paragraph)
Use[edit]
As the above example illustrates, if two distributions are separated in space, the P–P plot will give very little data – it is only useful for comparing probability distributions that have nearby or equal location. Notably, it will pass through the point (1/2, 1/2) if and only if the two distributions have the same median.
P–P plots are sometimes limited to comparisons between two samples, rather than comparison of a sample to a theoretical model distribution.[3] However, they are of general use, particularly where observations are not all modelled with the same distribution.
However, it has found some use in comparing a sample distribution from a known theoretical distribution: given n samples, plotting the continuous theoretical cdf against the empirical cdf would yield a stairstep (a step as z hits a sample), and would hit the top of the square when the last data point was hit. Instead one only plots points, plotting the observed kth observed points (in order: formally the observed kth order statistic) against the k/(n + 1) quantile of the theoretical distribution.[3] This choice of 'plotting position' (choice of quantile of the theoretical distribution) has occasioned less controversy than the choice for Q–Q plots. The resulting goodness of fit of the 45° line gives a measure of the difference between a sample set and the theoretical distribution.
A P–P plot can be used as a graphical adjunct to a tests of the fit of probability distributions,[4][5] with additional lines being included on the plot to indicate either specific acceptance regions or the range of expected departure from the 1:1 line. An improved version of the P–P plot, called the SP or S–P plot, is available,[4][5] which makes use of a variance-stabilizing transformation to create a plot on which the variations about the 1:1 line should be the same at all locations.
See also[edit]
References[edit]
Citations[edit]
- ^Nonparametric statistical inference by Jean Dickinson Gibbons, Subhabrata Chakraborti, 4th Edition, CRC Press, 2003, ISBN978-0-8247-4052-8, p. 145
- ^Derrick, B; Toher, D; White, P (2016). 'Why Welchs test is Type I error robust'. The Quantitative Methods for Psychology. 12 (1): 30–38. doi:10.20982/tqmp.12.1.p030.
- ^ abTesting for Normality, by Henry C. Thode, CRC Press, 2002, ISBN978-0-8247-9613-6, Section 2.2.3, Percent–percent plots, p. 23
- ^ abMichael J.R. (1983) 'The stabilized probability plot'. Biometrika, 70(1), 11–17. JSTOR2335939
- ^ abShorack, G.R., Wellner, J.A (1986) Empirical Processes with Applications to Statistics, Wiley. ISBN0-471-86725-X p248–250
Sources[edit]
- Davidson, Russell; MacKinnon, James (January 1998). 'Graphical Methods for Investigating the Size and Power of Hypothesis Tests'. The Manchester School. 66 (1): 1–26. CiteSeerX10.1.1.57.4335. doi:10.1111/1467-9957.00086.
You ran a linear regression analysis and the stats software spit out a bunch of numbers. The results were significant (or not). You might think that you’re done with analysis. No, not yet. After running a regression analysis, you should check if the model works well for data.
We can check if a model works well for data in many different ways. We pay great attention to regression results, such as slope coefficients, p-values, or R2 that tell us how well a model represents given data. That’s not the whole picture though. Residuals could show how poorly a model represents data. Residuals are leftover of the outcome variable after fitting a model (predictors) to data and they could reveal unexplained patterns in the data by the fitted model. Using this information, not only could you check if linear regression assumptions are met, but you could improve your model in an exploratory way.
In this post, I’ll walk you through built-in diagnostic plots for linear regression analysis in R (there are many other ways to explore data and diagnose linear models other than the built-in base R function though!). It’s very easy to run: just use a plot()
to an lm object after running an analysis. Then R will show you four diagnostic plots one by one. For example:
Tip: It’s always a good idea to check Help page, which has hidden tips not mentioned here! ?plot.lm
By the way, if you want to look at four plots at once rather than one by one:
You will often see numbers next to some points in each plot. They are extreme values based on each criterion and identified by the row numbers in the data set. I’ll talk about this again later.
The diagnostic plots show residuals in four different ways. Let’s take a look at the first type of plot:
1. Residuals vs Fitted
This plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships.
Let’s look at residual plots from a ‘good’ model and a ‘bad’ model. The good model data are simulated in a way that meets the regression assumptions very well, while the bad model data are not.
What do you think? Do you see differences between the two cases? I don’t see any distinctive pattern in Case 1, but I see a parabola in Case 2, where the non-linear relationship was not explained by the model and was left out in the residuals.
2. Normal Q-Q
This plot shows if residuals are normally distributed. Do residuals follow a straight line well or do they deviate severely? It’s good if residuals are lined well on the straight dashed line.
What do you think? Of course they wouldn’t be a perfect straight line and this will be your call. Case 2 definitely concerns me. I would not be concerned by Case 1 too much, although an observation numbered as 38 looks a little off. Let’s look at the next plot while keeping in mind that #38 might be a potential problem.
For more detailed information, see Understanding Q-Q plots.
3. Scale-Location
It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.
What do you think? In Case 1, the residuals appear randomly spread. Whereas, in Case 2, the residuals begin to spread wider along the x-axis as it passes around 5. Because the residuals spread wider and wider, the red smooth line is not horizontal and shows a steep angle in Case 2.
4. Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if any. Not all outliers are influential in linear regression analysis (whatever outliers mean). Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis. They follow the trend in the majority of cases and they don’t really matter; they are not influential. On the other hand, some cases could be very influential even if they look to be within a reasonable range of the values. They could be extreme cases against a regression line and can alter the results if we exclude them from analysis. Another way to put it is that they don’t get along with the trend in the majority of the cases.
Unlike the other plots, this time patterns are not relevant. We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.
Case 1 is the typical look when there is no influential case, or cases. You can barely see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines. In Case 2, a case is far beyond the Cook’s distance lines (the other residuals appear clustered on the left because the second plot is scaled to show larger area than the first plot). The plot identified the influential observation as #49. If I exclude the 49th case from the analysis, the slope coefficient changes from 2.14 to 2.68 and R2 from .757 to .851. Pretty big impact!
The four plots show potential problematic cases with the row numbers of the data in the dataset. If some cases are identified across all four plots, you might want to take a close look at them individually. Is there anything special for the subject? Or could it be simply errors in data entry?
So, what does having patterns in residuals mean to your research? It’s not just a go-or-stop sign. It tells you about your model and data. Your current model might not be the best way to understand your data if there’s so much good stuff left in the data.
Plots Meaning In Slang
In that case, you may want to go back to your theory and hypotheses. Is it really a linear relationship between the predictors and the outcome? You may want to include a quadratic term, for example. A log transformation may better represent the phenomena that you’d like to model. Or, is there any important variable that you left out from your model? Other variables you didn’t include (e.g., age or gender) may play an important role in your model and data. Or, maybe, your data were systematically biased when collecting data. You may want to redesign data collection methods.
Checking residuals is a way to discover new insights in your model and data!
For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu
View the entire collection of UVA Library StatLab articles.
Plot Means In Telugu
Bommae Kim
Statistical Consulting Associate
University of Virginia Library
September 21, 2015