- Autocorrelation, serial correlation, bang-bang duplicate, or pseudo-replications are when a piece of data follows another without new measures
- There is a difference between having questions to be solved by a database and poking around a database looking for interesting answers to question
- Early screening boosts “survival time” in useless ways
- Adjust measures to avoid errors rather than modeling them away
- “Guide to Bad Data” by Chris Groskopf
- Values missing
- Zeros replace missing values
- Data missing that you know should be there
- Rows or values duplicated
- Total differ from aggregates
- Suspicious values present
- Spreadsheet have 65536 rows or 255 columns
- Margin-of-error to large or unknown
- Benford’s Law fails
- Too good to be true
- Fix bad names immediately
- Survivor bias: “Most medieval castles were made of wood”
Leave a Reply
You must be logged in to post a comment.