Moved from Google Code:
https://code.google.com/p/httparchive/issues/detail?id=192:
Guypo suggests ways to improve the correlation coefficient to isolate other variables:
On 6/25/11 11:18 AM, Guy Podjarny wrote:
The correlation coeff is accurate in determining the variable that correlates the most to load time. The problem is with showing the numbers of the next items without neutralizing that top var. For those, its simply inaccurate, I don't think it's like the median vs average conversation, where they're indeed all right.
I'm not sure there's an easy way to calculate it quickly. Maybe using a stats tool instead of php calculation?
Maybe only calculate the top variable correlating to load time each run, and occasionally do the full analysis using SPSS? Note that SPSS calculates it quite quickly.
Guy Podjarny | CTO, Blaze | 613-800-0413 x202
On 2011-06-25, at 13:46, Steve Souders [email protected] wrote:
Then what's the purpose of the correlation coeff I use? It seems like you're saying the current formula should never be used (when there are multiple variables). This reminds me of the avg vs median vs 90% debate - they're all valid formulas, it just depends on what you're after and clearly stating what function you're using.
Because the next question is: how do you implement this? Right now the correlation coeff charts take 10x longer than any other chart, and my guess is it's based on the order of N, and N is about to increase several orders of magnitude. So there's the practical aspects, too.
-Steve
On 6/25/11 10:37 AM, Guy Podjarny wrote:
I thought of a simpler example:
Let's say you check how reading skills and age correlate to height.
If you looked at the clean numbers, you'll see reading skills correlate quite well to height. But logically, clearly that's only because reading skill correlates to age, which correlates to height. That's why you have to use partial correlation to see correlations above and beyond variables that may explain the correlation.
Cheers,
Guypo
On Sat, Jun 25, 2011 at 12:58 PM, Guy Podjarny [email protected] wrote:
The main issue is that when you calculate the correlation between A & C and B & C, you don't take into account the correlation between A & B.
For example, if these are the correlations:
1. A & B = 0.8
2. A & C = 0.7
3. B & C = 0.9
Then correlation #2 can be explained by #1 and #3, and you can't really say that A & C correlate.
I feel that's not a good enough explanation, so let me try a real world example.
Let's take these three variables:
W - Load Time
X - Total # Requests
Y - Total # JS Requests
Z - Total # CSS Requests
And that these are the correlations between them:
WX = 0.7
WY = 0.6
WZ = 0.5
XY = 0.8
XZ = 0.3
Logically, total # of requests is the top correlation to load time, followed by # of JS reqs and # CSS reqs.
Also, pages with many requests often have many JS requests too, but often don't have many CSS requests (this is a made up example, just to show stats).
In first pass, you'd say that the # of JS requests is a more significant factor in the load time. However, when you calculate the correlation between # of CSS & JS reqs above and beyond total # of requests, you get these values (this is called "partial correlation", the formula is in the attached doc):
WY.X = 0.09 - Excel Formula: =(0.6 - 0.7*0.8)/(SQRT(1-0.7*0.7)*SQRT(1-0.8*0.8))
WZ.X = 0.42 - Excel Formua: =(0.5 - 0.7*0.3)/(SQRT(1-0.7*0.7)*SQRT(1-0.3*0.3))
So you see that the number of JS requests really doesn't matter much beyond the total number of requests, it only looked that way because pages that had many requests happened to have many JS requests too. On the other hand, the number of CSS requests shows to be quite significant to load time, much more than number of JS requests.
So once you establish the single variable that correlates the most with the number of requests, you have to neutralize it before you say who is second, then neutralize both to say who is third... etc.
The attached forumula shows how to neutralize one variable, but to neutralize many, I used SPSS, as it became to complicated. SPSS also gives you an indication of when the correlation become statistically significant, taking into account the total number of samples .
I hope this makes sense, I had to learn this to prepare my mobile study presentation, so I'm not well rehearsed in explaining it... but I'm confident it's accurate.
Did that help clarify my point?
Cheers,
Guypo
On Sat, Jun 25, 2011 at 2:47 AM, Steve Souders <[email protected]> wrote:
Hi, Guy.
Sorry for the late reply - still unburying.
Yes, I'd like to fix this if it's wrong. I guess I don't understand - there's a formula for calculating the correlation coeff for two variables. That calculation ignores all the other variables. It generates a number. So far it seems like it's correct to say:
- the correlation coeff of A & D is 0.9
- the correlation coeff of B & D is 0.8
- the correlation coeff of C & D is 0.7
Given that you would say A has the highest correlation, B is 2nd highest, and C is 3rd highest.
Where does that analysis break?
-Steve
On 6/19/11 5:19 PM, Guy Podjarny wrote:
Hey,
First off, congrats on wrapping up Velocity - I heard it was a huge hit!
I downloaded many of the presentations, and am looking forward to watching some of the videos too. Hopefully you'll get more sleep now ;)
As you may have seen, as a part of the mobile analysis we did, I reused your HTTP Archive schema (with some minor modifications), and did some statistical analysis.
Doing so made me thing that the analysis you currently have around correlation to speed is wrong. If I'm not mistaken, you measure the correlation of each variable to the load time, but you don't do so while neutralizing the other variables. What you should be doing is correlating the effect of each variable above and beyond the others.
Doing this for one variable (correlate A and B above and beyond C) is a simple formula. As you add more variables, it becomes more complicated.
I did it using SPSS, and I have the SPSS Syntax (sort of a script) that calculates it given a set of data extracted from the HTTP Archive Mobile DB Schema.
So bottom line:
1) I think the "correlation to speed" chart you have is misleading, since you can't say what's the 2nd top correlation to speed the way you did (only the top one)
2) It might be interesting to calculate the correlation to speed of the specific variables above and beyond the others.
Let me know if you're interested in getting into the details of this or not, just figured i'll put the offer out there.
Cheers,
Guypo
https://code.google.com/p/httparchive/issues/detail?id=239
Guypo reports that the HTTP Archive Mobile correlation coefficient stats seem wrong since "Flash Reqs" has the highest correlation. I spent an hour debugging and the math seems right, so either this is accurate or the formula is flawed.
Here's some output from calculating CC for "Oct 1 2011", "All", "iphone" comparing reqFlash to reqImg for correlation to onLoad and renderStart:
=== onLoad
reqFlash: 0.99561730861636, n=3, sumX=4, sumXX=6, sumY=49375, sumYY=1408318281, sumXY=85674
0.99561730861636 = ((257022) - (197500)) / sqrt( ((18) - (16)) * ((4224954843) - (2437890625)) )
0.99561730861636 = ((257022) - (197500)) / sqrt( ((18) - (16)) * ((4224954843) - (2437890625)) )
0.99561730861636 = (59522) / sqrt( 2 * 1787064218 )
0.99561730861636 = (59522) / sqrt( 3574128436 )
0.99561730861636 = (59522) / 59784.014886924
reqImg: 0.73076506791759, n=974, sumX=31093, sumXX=2392141, sumY=9773060, sumYY=189577950110, sumXY=573515096
0.73076506791759 = ((558603703504) - (303873754580)) / sqrt( ((2329945334) - (966774649)) * ((184648923407140) - (95512701763600)) )
0.73076506791759 = ((558603703504) - (303873754580)) / sqrt( ((2329945334) - (966774649)) * ((184648923407140) - (95512701763600)) )
0.73076506791759 = (254729948924) / sqrt( 1363170685 * 89136221643540 )
0.73076506791759 = (254729948924) / sqrt( 1.2150788431614E+23 )
0.73076506791759 = (254729948924) / 348579810540.05
=== renderStart
reqFlash: 0.97504497486968, n=3, sumX=4, sumXX=6, sumY=8087, sumYY=33501411, sumXY=13506
0.97504497486968 = ((40518) - (32348)) / sqrt( ((18) - (16)) * ((100504233) - (65399569)) )
0.97504497486968 = ((40518) - (32348)) / sqrt( ((18) - (16)) * ((100504233) - (65399569)) )
0.97504497486968 = (8170) / sqrt( 2 * 35104664 )
0.97504497486968 = (8170) / sqrt( 70209328 )
0.97504497486968 = (8170) / 8379.1006677328
reqImg: 0.34036415420895, n=974, sumX=31093, sumXX=2392141, sumY=2677584, sumYY=11729831570, sumXY=112091735
0.34036415420895 = ((109177349890) - (83254119312)) / sqrt( ((2329945334) - (966774649)) * ((11424855949180) - (7169456077056)) )
0.34036415420895 = ((109177349890) - (83254119312)) / sqrt( ((2329945334) - (966774649)) * ((11424855949180) - (7169456077056)) )
0.34036415420895 = (25923230578) / sqrt( 1363170685 * 4255399872124 )
0.34036415420895 = (25923230578) / sqrt( 5.8008363586322E+21 )
0.34036415420895 = (25923230578) / 76163221824.134