quartz / bad-data-guide Goto Github PK
View Code? Open in Web Editor NEWAn exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
bad-data-guide is a success (congratulations!!) and there are many translations... How about add a "translations" folder here?
Technically the yearly total divided by 12 is the monthly average. Explain why this is a terrible idea.
Explanation from an email I wrote:
With regard to your specific question about "average per month," I think I do need to clarify that section of the guide. Of course the yearly total divided by 12 is technically the monthly average. My point was that if you have no idea of the variation from month to month creating an average can be extremely misleading. Without knowing the distribution of the data it's entirely possible the entire total happened in the first day of the year, which would make expressing it as a monthly average absolutely incorrect.
Excel can interpret a text symbol as scientific notation. For example, if something had the serial code 100E10 this will be converted to 1.0xe12
Via Chris Wright: "I wanted to add that the date 1960-01-01 can also be a suspicious date as this is the 0 date for a data manipulation program called SAS. It looks like your guide is aimed at journalists and amateur data investigators, so probably less common for people in those situations."
Excel has the "feature" that anything remotely looking like a date is forcefully converted to a date. It usually is not possible to figure out which string was originally input. Also the data type of the cell is off then.
My first encounter is with CAS registry numbers (see https://en.wikipedia.org/wiki/CAS_Registry_Number)
The problem is also in genetics: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
btw. open office fully reimplemented this feature - also without being able to turn it off.
A nicety it that this conversions is not only applied to manually input data, but also to pasted, and - to make things fun - to data written by apps using excels COM interface.
I have seen multiple situations where data is logged chronologically or time. A current example is song play counts, or YouTube views. If the data is logged day after day these numbers should stay the same or more likely INCREASE. So when you see them...
The data is suspect.
Sometimes the data can be "repaired" by removing the wild swings. But that is not always simple depending on the bandwdith of "normal" data.
Is the margin of error in the document at the 95% confidence level?
Copying over from emailed note
I did want to add to your ‘Zeros replace missing values’; in my time as an ETL developer (before going back into academia) I have come across the following as common entries for missing values:
-1
99
...
-99999
...
-99
00000
Basically, stuff the field with as many 9s as it will hold or, if possible, make it negative as that is ‘clearly’ a non-sensical value. Never bother documenting any of these decisions. I’ve actually never come across 0 for missing, but that might be the type of data I work with. Point being, I would suggest emphasising that distributions and summaries, as well as specific tests on each row for ‘reasonableness’ (e.g. ‘break if {value} < 0’), are a key part of most processing and analysis regardless of the quality of the data itself.
Nothing Special
Is 24
in 1969-12-31T24:59:59Z
correct? (This date appears twice.)
I suppose that 23
may be correct.
Many geo data issues are already covered in in other sections - especially the entered by humans part - but there are some common quirks that might be worth mentioning?
If a sample "fails to cover the entire population", it's biased. In my opinion "Sample is not random" and "Sample is biased" should be the same point.
I think the Chinese version page: http://djchina.org/2016/07/12/bad_data_guide/ cannot be opened. So, maybe you could check the corresponding URL or remove this raw?
Arn't they actually December 31st 1899 or something similar slightly off. Due to leap year bug or something.
Excel import will convert text that looks somewhat like a date to a date; this has been noted as a common problem with human genetic data, as several gene symbols resemble dates (DEC1, SEPT9, MARCH1). Best defense: be suspicious of columns which have mix of numeric (or date) and text data, particularly if the fraction of numerics (or dates) is very low
Via Chris Wright: "Also, I'm always suspicious of row counts that are multiples of 1000 or 100 as these are often sample data sets and therefore missing records. I generally check with the provider when that happens."
For a university school project I will be making some edits to your documentation which will aim to make writing more technical
Please add edits if you would like :)
In addition to China and Pakistan, Rwanda also mandates that its parliament be 30% women, though it tends to do much better than this figure.
Via Chris Wright: "Finally record counts in general should be checked when receiving and loading data. Excel gives people the option to add in new lines within cells, this is stored as a Line Feed (LF) character (at least under Windows where I work), some applications reading this in will take everything after that as a new record, potentially resulting in data being loaded into wrong columns if you're loading into a database. Another fun trick is when you end up with an end of file character embedded in a text string. I've yet to work out how on earth these end up in the files (it's happened to me maybe 3 times over the last 5 years), but these essentially tell the process reading the data that it has reached the end of the file and to stop reading it there. The ASCII code for it resolves to CTRL+Z, so my current working theory is that the source system is capturing people undoing an typo. I've never been able to replicate this though. In both cases, knowing up front how many records you are expecting, and counting the number of records you've loaded into your working system captures these problems."
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.