ES What the heck?

Status
Not open for further replies.
Presenting the average of averages can easily make sense if the data sets compiled in the comparison set are independent but related.

Let’s say I want to know if Hornady brass is as good as Lapua brass. So I buy .30-06, 284 win, 7-08, and 6 creed brass.

I shoot 2 or 3 different brands of bullets from each cartridge in each set, over a couple different powders. Now I have hundreds of not thousands of shots in this multifactorial matrix.

I can simplify presentation of the data set by presenting average SD and average ES’s. If a guy only compared the singular ES for one cartridge loaded in the 2 types of brass, with one bullet, the data set would not be very conclusive. But presenting a comparison of the two matrices of different bullets and powder combinations in each cartridge would be exhaustive. Instead, a comparison of average ES’s and average SD’s might be very meaningful. Even then, taking the ES and SD of the ES’s, and the ES and SD of the SD’s might be very meaningful - in a very concise and palatable results set.

For the comparison shown above, you could perform a 2-sample t-test to confirm if the mean values between tho two distrubitions are significantly different or not (eg if two bullets in one case brand produce different velocities, or if the same bullet in different cases produce different velocities). Furthermore, you could also perform a 2-variance test to confirm if the SD's are significantly different or not.

However, if you want to see how well a load copes with different conditions etc, it would bode well to compare the min, median* and max ES/SD/mean velocity observed over time. If your min and median* are very similar but the max is much higher, then you can figure out what conditions caused that dataset to be different. It could be that all proceeding data were compiled over winter and this is the first sample added over summer, or it could be related to a new batch of powder compared to that old batch of powder that's been sitting on your shelf for years.

*when your data is skewed and non-normal in distrubiotn the median value is more reliable to use than the mean. If you had an n=10 population with nine datapontis as 1 and one data point as 10 your average value is 1.9 but your median is 1. The value of 1.9 does not appear in your dataset and doesn't really tell you anything other than your data is skewed, whereas the median of 1 at least tells you that your median is equal to your min so your data could be mostly grouped around this value.
 
if two bullets in one case brand produce different velocities, or if the same bullet in different cases produce different velocities

Bad stats. We KNOW these distributions should be different, so a test to determine if they were the same or not would largely be coincidence. The velocities achieved by a 95 VLD and a 105 Hybrid would certainly be different, and if they weren’t, it’s not a correlation, it’s coincidence.
 
But lets not forget the important thing no matter what the numbers show, Does it shoot well on paper.

I think this is where things begin to get interesting. If we are using a front/rear rest and the stock has drop in the toe, the faster the bullet, it would leave the muzzle before the rifle recoiled as far thus hitting the target lower but the bullet would have less time to fall towards the ground and a slower bullet would have the rifle recoil more before exit and the barrel would be higher upon exit positively compensating for the extra time it has to fall to earth.

:)
 
Bad stats. We KNOW these distributions should be different, so a test to determine if they were the same or not would largely be coincidence. The velocities achieved by a 95 VLD and a 105 Hybrid would certainly be different, and if they weren’t, it’s not a correlation, it’s coincidence.
I should have made clear the "two bullets" statement. More like comparing a 123gr Scenar against a 123gr A Max. Its obvious that a 95gr and 105gr bullet will have different velocities, no need to do any stats to figure that out.
 
I should have made clear the "two bullets" statement. More like comparing a 123gr Scenar against a 123gr A Max. Its obvious that a 95gr and 105gr bullet will have different velocities, no need to do any stats to figure that out.

Agreed, which is why the t test you proposed comparing two bullets within a given brass type does not fit the spirit of the experiment I proposed, aimed at comparing 2 brands of brass with factors of multiple bullets, multiple cartridges, and multiple powders within each brand of brass. Proving different minor factors to be coincidentally the same is wasted effort in that analysis.

So that’s a good example of what likely happened in the OP’s article. Someone had stats knowledge, how to run the math, but lesser grasp on experimental design and analytics. In a dataset which is truly one dataset, like the OP’s article, then an average of range and an average of deviation doesn’t make sense. Why break up a dataset into subsets if there isn’t any reason to do so? Maybe he WANTED to present the total dataset stats contrasted against the individual string stats, but it was cut in editing? Equally, it doesn’t make sense to run a t test to prove unity between known dissimilar factors, especially when the objective of the experiment is NOT SUCH, but rather to compare a spectrum of factors in a more concise manner to minimize multi-factor influences to the actual comparison being made.

Hammers can build houses. Or they can crush thumbs. A tool is only as useful as the understanding of how it is used.
 
Last edited:
You can take an average of averages of data. In fact, that's an essential step with some tools.

You cannot "legally" average standard deviations, at least not in the usual sense.

You can legally average ranges.

Estimates of means of data, such as the average weight of powder charges, converge pretty quickly. Quite often, with a few dozen samples, you have a very good estimate of central tendency.

Estimates of dispersion typically require more data. It takes much more data to get a good estimate of standard deviation, for example, than to get a good estimate of a mean. Group size is a measure of dispersion. You will not get a good estimate of it with 5 shots.

To test whether the means of two samples represent different populations, we can use the T Test or we can use ANOVA, or any one of a number of other tools. To test whether two standard deviations are different, we square them and take their ratio. This is the F Test. We don't do T Tests on standard deviations.

There is no reason to use the term "extreme spread". Range is well defined and understood, so we don't need to invent a new term for it.

If you want to characterize a firearm's ability to put projectiles close to target, you can shoot a large enough number of samples, and calculate the standard deviation of each shot's distance from the center of the group. You can calculate the standard deviation of that data, multiply by 1.96, and you have the radius of a circle that will contain about 95% of your shots. Or you can take the absolute value of each distance, and find the median. That gives you the Median Absolute Deviation. That defines a circle that will contain half of your shots. Those are standard methodologies.
 
Agreed, which is why the t test you proposed comparing two bullets within a given brass type does not fit the spirit of the experiment I proposed, aimed at comparing 2 brands of brass with factors of multiple bullets, multiple cartridges, and multiple powders within each brand of brass. Proving different minor factors to be coincidentally the same is wasted effort in that analysis.

So that’s a good example of what likely happened in the OP’s article. Someone had stats knowledge, how to run the math, but lesser grasp on experimental design and analytics. In a dataset which is truly one dataset, like the OP’s article, then an average of range and an average of deviation doesn’t make sense. Why break up a dataset into subsets if there isn’t any reason to do so? Maybe he WANTED to present the total dataset stats contrasted against the individual string stats, but it was cut in editing? Equally, it doesn’t make sense to run a t test to prove unity between known dissimilar factors, especially when the objective of the experiment is NOT SUCH, but rather to compare a spectrum of factors in a more concise manner to minimize multi-factor influences to the actual comparison being made.

Hammers can build houses. Or they can crush thumbs. A tool is only as useful as the understanding of how it is used.

I disagree. Your original post stated that you proposed testing multiple brand bullets and powders between Hornady and Lapua brass for given calibres. Let's say you used the same mass bullet for each calibre and you tested two different powders. It makes perfect sense to make the following comparisons:

Powder 1
Bullet Brand 1
Velocity (lapua brass) vs Velocity (Hornady brass)

Powder 1
Bullet brand 2
Velocity (lapua brass) vs Velocity (Hornady brass)

Powder 2
Bullet Brand 1
Velocity (lapua brass) vs Velocity (Hornady brass)

Powder 2
Bullet brand 2
Velocity (lapua brass) vs Velocity (Hornady brass)

Etc.

Once you've completed this type of analysis for all combinations of bullet and powder brands you will see the bigger picture. Is lapua brass always giving significantly higher velocities than Hornady brass whenever you test them head to head using the same powder and bullet combinations? That's what you can figure out by performing the two sample t-test between each subset. If you have ten combinations of bullets and powder then you do these tests ten times. You could, for example, conclude that in all ten instances lapua brass always produces significantly higher velocities when compared with Hornady brass, or perhaps that velocity is only equivalent between both brass brands for one specific combination of bullet and powder brand. In my mind such a comparison is perfectly valid as you're not pooling data from different datasets.
 
As others have implied/said you can quickly run into misrepresentations of the results if you average averages.

The classic problem is the 2 laps on the 1 mile race track. 1st Lap Avg 30MPH. 2nd Lap Avg 90MPH. Average speed for the entire 2 laps (2 miles)? 45MPH.
 
@WelshShooter - here’s the basis for experimentation (in the hypothetical comparison I proposed as justification for using an “average ES” in results presentation quoting myself.

Let’s say I want to know if Hornady brass is as good as Lapua brass.

Enjoy your T test to prove correlation between the velocities of dissimilar bullet weights.
 
WelshShooter has made some important comments about analyzing multiple input variables.

There is an elegant way to test multiple variables that is much more efficient than the T Test. In fact, you can test 5 different input variables and all of their possible interactions for about the same number of samples as a T Test. With extensions of the same basic idea, you can easily do up to 31 different input variables at a time, though there is seldom reason to go beyond 7.

The magic is in using a balanced design. One example is like this....

BRASS......POWDER......PRIMER......MUZZLE VELOCITY
Hornady.....AA5...............CCI
Lapua........AA5...............CCI
Hornady.....PPistol..........CCI
Lapua........PPistol..........CCI
Hornady....AA5................Fed
Lapua.......AA5................Fed
Hornady....PPistol............Fed
Lapua.......PPistol............Fed

For the three input variables, this block contains all possible combinations, and exactly the same number of each combination, that is, 1. This block might be repeated 2-5 times. Basically, you load up a Lapua case with AA5 and a CCI primer and measure its MUZZLE VELOCITY. Then you move on through the list, recording the MUZZLE VELOCITY for each.

The magic is that because of the structure, none of the input variables interfere with the others. It's a slick way to crack multivariate problems. We used this to fix the M855 problems at Lake City.
 
Last edited:
And when you throw 2 or three bullet weights for each of those powder and primer combinations, knowing you’l have disparate results for MV (the key output for comparison) with these different weights, and then add to the experimental set the same multivariate combinations with different cartridges, then a statistical test to show muzzle velocity congruence among a set really isn’t the right tool.

Either you’ll prove different cartridges with different bullet weights coincidentally have the same MV, or you’ll prove the trivial observation that different cartridges with different bullet weights will have different MV’s, and will have done a bunch of math which hasn’t taken any step towards determining - on average - is Hornady brass as good as Lapua brass (with a key metric of MV consistency).
 
Status
Not open for further replies.
Back
Top