I haven't read much of the
Philip K. Janert's
Data Analysis with Open Source Tools book yet, but I think I'll get farther into it as I sit in on a statistics class. Even though he says "open source" instead of "free software" I like his attitudes in other areas. He reminds of an endearing comment by Hadley Wickham about caring less about Big Data and more about somebody trying to learn about a swamp fly or something. It's all abot exploring a "sense of wonder" I guess.
One of these quotes I've been trying to find again for years:
“Works are of value only if they give rise to better ones.” (Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)
I thought it was from
Charles K. Ogden of the famed 850-word
Basic English system.
And here is where the old interview with Hadley Wickham came to mind, this appreciation of curiousity of exploring a sense of wonder...
Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it has a name: curiosity. There is always something else to find out and something more to learn. This book is not the last word on the matter; it is merely a snapshot in time: things I knew about and found useful today...
And something I have to keep in my mind while thinking about the Statistics class:
More data analysis efforts seem to go bad because of an excess of sophistication rather
than a lack of it....
I strongly recommend that you make it a habit to avoid all statistical language. Keep it simple and stick to what you know for sure. There is absolutely nothing wrong with speaking of the “range over which points spread,” because this phrase means exactly what it says... I strongly recommend that you make it a habit to avoid all statistical language. Keep it simple and stick to what you know for sure. There is absolutely nothing wrong with speaking of the “range over which points spread,” because this phrase means exactly what it says... Once we start talking about “standard deviations,” this clarity is gone. Are we still talking about the observed width of the distribution? Or are we talking about one specific measure for this width? (The standard deviation is only one of several that are available.) Are we already making an implicit assumption about the nature of the distribution? (The standard deviation is only suitable under certain conditions, which are often not fulfilled in practice.)...
And these lines made me think of recent bubbles for AI (SALAMI), it's all based on probability math right?
Lancelot Hogben of
Mathematics for the Millions fame is funny on "the theory of so-called probability": "... its unsavoury origin is on record. The first impetus came from a situation in which the dissolute nobility of France were competing in a race to ruin at the gaming tables." Hogben's sentence got me thinking of how water- and energy-guzzling AI(Salami like ChatGPT etc) is rushing our race to ruin with the climate, the ecology, ruining our earth's ecology in so many ways... And why the pattern seems to scale
'the complexities of the “solution” overwhelm the original problem, and nothing gets accomplished.'
:
I think the primary reason for this tendency to make data analysis projects more complicated than they are is discomfort: discomfort with an unfamiliar problem space and uncertainty about how to proceed. This discomfort and uncertainty creates a desire to bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of course, the opposite is true: the complexities of the “solution” overwhelm the original problem, and nothing gets accomplished.
I like this attitude too:
In my view, a specific, well-argued position is of greater use than a sterile laundry list of possible algorithms—even if you later decide to disagree with me. The value is not in the opinion but rather in the arguments leading up to it. If your arguments are better than mine, or even just more agreeable to you, then I will have achieved my purpose!
And the rules seem reasonable as guidance in a variety of fields, endeavors:
- Simple is better than complex.
- Cheap is better than expensive.
- Explicit is better than opaque.
- Purpose is more important than process.
- Insight is more important than precision.
- Understanding is more important than technique.
- Think more, work less. --- Philip K. Janert
I only ready the preface so far but it has reminded me of one of the "classic articles on statistics" on
Edward Tufte's site, Austin Bradford Hill's
Association or Causation?:
... the glitter of the t table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contracted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers.’...
... I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret the data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’Like fire, the chi-squared test is an excellent servant and a bad master. --- Austin Bradford Hill
#
DataAnalysis #
StatisticsClass #
StatisticsAdvice #
DataAnalysisRules #
PhilipKJanert #
AustinBradfordHill #
BradfordHill