so what I gather that you are saying is that the data is crap--it is erroneous. If that is what you are saying, why don't you simply say it?
No, anything but. All data has errors. Only understanding those errors can allow us to draw any meaningful conclusions about it. And only statistics allows us to understand those errors.
Whether or not they are real? They are either real or not real. They can't be either..
All data has errors. Perhaps one morning Joe the Temperature reader went to the station to record the temp, he was distracted and wrote a 5 where a 3 should be. Or the machine had a hiccup (instruments do), the mercury separated and gave an erroneous signal.
Then on some other day there actually
was a cold front moving through and there really
was a difference of several degrees between the two stations.
I will flat out state I think they are NOT real. Do you think they are REAL???? That is what I can't get from you.
I have now clearly explained this at least 2 or 3 times. Some
may be real, but many may be pure error.
And if they are not real, how can you pretend that we are measuring the temperature properly?
Well, again, I must ask you to address the question I have asked you now repeatedly: in the data you work with, do you have absolutely no errors in the data? None whatsoever?
Agreed that 20 deg deltas are rare. But why can't you say they are simply WRONG?
Perhaps I am not making myself clear, but I looked back over my posts I noted I am constantly referring to the word
error.
Error is just that: error. Wrong measurements.
Here's a quote I found discussing error in statistics from Bland and Altman in the British Medical Journal (BMJ):
Several measurements of the same quantity on the same subject will not in general be the same. This may be because of natural variation in the subject, variation in the measurement process, or both.
Statistics Notes: Measurement error -- Bland and Altman 313 (7059): 744 -- BMJ
It is irrational to assume that any system measured and recorded by machines and humans will ever have a perfect score and never measure with error.
You have a real problem saying that this data set is crap. Why is that?
Because it has yet to be proven to be bad. That is why statistics is so vitally important. It can help us differentiate bad from good data.
According to my calculation the SD is 4 deg
In the present case that is likely because of the heavy tails. If this were a normal distribution then it would be likely that about 68% of the observations would be within 1 standard deviation of the mean. In the present case the standard deviation is much more influenced by the outliers and the heavy tails.
I work with lots of different kinds of data sets. Seismic data, which is converted into porosity data via statistical procedures. I work with pressure data, I work with production data, I work with well log data, so I do lots and lots with data. For 5 years I was incharge of reservoir modeling (fluid flow) for Kerr-McGee. Co-kriging properties derived from seismic and well data for input to models is our stock in trade. If I tell you that I was a director of Technology for Kerr-McGee Oil and Gas Corp, Thaumaturgy will scream that I am bragging. But it is a fact that I held that position for 2 years before moving to China as Exploration director--a much better position.
And that is all very impressive. So I am even more curious as to what kind of data sets you use that you don't have any error or that you cannot allow for any error.
I understand that in the oil field often you have people (geologists) sit beside the well and write down what kind of material is coming up from the well to tell them where they are in the drilling process (which formation?) Do you think they are able to inerrantly determine the exact (down to the inch) point where the formation changes from a shale to a siltstone and they record it perfectly in their "log"?
If you send a geophysical probe down the well does it always work perfectly to tell you exactly what you are looking at?
That isn't the issue. I think you have entirely missed the reason I analyze two closely spaced towns. Science is supposed to be repeatable.
All data has errors.
But you can't go back to 1952 and repeat the measurement of temperature in Alice Texas on July 8th, 1952. We don't have time machines. But we can measure the temperature in Corpus Christi on the same day and figure that it ought to be close to that of Alice. I view my closely spaced temperature measurements as a check on how accurate the system is. And I conclude it is highly inaccurate.
In that case the example of the two towns in Iowa is very good. It says that there is a high likelihood that both towns are in agreement by about 1 degree F. BUT, your graph has also shown that one town is consistently higher by that 1 degree or so, so the difference can be corrected and it really isn't all that problematic.
In a sense the two towns are pretty good replicates. Not perfect, but then, all data has errors.
Every day someone reads the temperature in a town. How do you propose verifying that he/she read it correctly if you don't compare him/her to a nearby town?
You can't. People are human. In fact people are the weak link in the chain and probably responsible for much of real error. But also we are using machines and they have problems at times.
Explicit isn't the issue in my mind. It is understanding that the world is not merely statistical. It is also physical and certain situations are ruled out by physical reality.
But you cannot draw physical conclusions based on erroneous data. My point is not dissimilar from yours. But my larger point is that you cannot understand how far off your physical interpretations are unless you understand the statistical nature of the data.
Physical interpretations only make sense when you are sure you are looking a the
real signal. And in the case of a pile of data this size the real signal can only be understood statistically.
I appreciate the education on this site, but the issue isn't an inability to write equations.
If statistics plays a role in parsing this much data then I simply must have a statistical reason why the statistics have failed (or I have failed to run the statistics correctly).
If the data says something, good or bad, then the statistics will bear that out. It won't hide it. Statistics is precisely how we make data work.
Ignoring the statistics is precisely how we go with "gut feelings" about what we think is or isn't there.
It is an inability to get you to understand that if two towns have a 20,000 deg F difference in temperature, it isn't realistic. It isn't governed by normal statistical laws.
If there's error in the data then statistics plays a role. In fact it plays the central role. You cannot draw any conclusions about anything until you understand how bad the error is. And since all data has error, you are stuck with the error terms.
This is why the saying "Liars can't figure, but figures can lie" became popular.
I don't perceive calculating the average to be a "lie". It was certainly not my intention to lie. I was merely posting a comparison.