quartz / bad-data-guide Goto Github PK

View Code? Open in Web Editor NEW

4.0K 215.0 397.0 128 KB

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

qz-things data guide documentation

bad-data-guide's People

Contributors

Stargazers

Watchers

Forkers

harrisj cmrivers mheadd herdingbats carlosvirgen alizagoldberg wjcjenny silvashih datachand jorol techscientist jasonadp campusdata rippowamlabs hetianch sg-s yosoyubik gomrpickles betanyc alexchaomander johnvlahos philipashlock jellily wikiped libardo1 mhweber jamisonwhite thbar kautilya90 anamican bradparks koifmanz shannonyu stephroark pjpaulpj michaelarnoldowens claudiouzelac jcsilva alex-patton jamalahmedmaaz finiterecursion zkan jgooly lumiqai jprosario kevmo motiteux akkash adeze empia robertoua rhodrid carolusian justinreboullot nanne007 kshvakov iammiika perif sharky93 luiseduardohdbackup kod3r groundwater linusp regina-book yelled3 phpfour tromika oustan marquisthunder deerluluolivia samsongz mikedillion herelab chentsulin handanchen rishirajsurti julwin eagles0607 owenb132 chicago-dave yukaizou2015 philipjadler jbarbosat mrsmartpants minwookim84 rainbowchanrc ortisan miguelpaz dongxu027 ekstroem neilbryant bryanbumgardner hcutler spsaaibi craigwblake nstoll andreschprr nunb jlamcanopy vmsmith

bad-data-guide's Issues

Translations here?

bad-data-guide is a success (congratulations!!) and there are many translations... How about add a "translations" folder here?

Clarify "average per month" in `Data are too coarse` section

Technically the yearly total divided by 12 is the monthly average. Explain why this is a terrible idea.

Explanation from an email I wrote:

With regard to your specific question about "average per month," I think I do need to clarify that section of the guide. Of course the yearly total divided by 12 is technically the monthly average. My point was that if you have no idea of the variation from month to month creating an average can be extremely misleading. Without knowing the distribution of the data it's entirely possible the entire total happened in the first day of the year, which would make expressing it as a monthly average absolutely incorrect.

text incorrectly interpreted as scientific notation and converted to number

Excel can interpret a text symbol as scientific notation. For example, if something had the serial code 100E10 this will be converted to 1.0xe12

Link to / promote / encourage translations

http://cn.gijn.org/2016/01/10/quartz%E5%9D%8F%E6%95%B0%E6%8D%AE%E6%8C%87%E5%8D%97%E7%B2%BE%E9%80%89%EF%BC%9A%E5%A4%84%E7%90%86%E6%95%B0%E6%8D%AE%E7%9A%84%E6%AD%A3%E7%A1%AE%E6%96%B9%E5%BC%8F%E4%B8%80%E8%A7%88/

Via Chris Wright: "I wanted to add that the date 1960-01-01 can also be a suspicious date as this is the 0 date for a data manipulation program called SAS. It looks like your guide is aimed at journalists and amateur data investigators, so probably less common for people in those situations."

these needs to be added

https://source.opennews.org/en-US/learning/distrust-your-data/

Excel conversions

Excel has the "feature" that anything remotely looking like a date is forcefully converted to a date. It usually is not possible to figure out which string was originally input. Also the data type of the cell is off then.

My first encounter is with CAS registry numbers (see https://en.wikipedia.org/wiki/CAS_Registry_Number)

The problem is also in genetics: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

btw. open office fully reimplemented this feature - also without being able to turn it off.

A nicety it that this conversions is not only applied to manually input data, but also to pasted, and - to make things fun - to data written by apps using excels COM interface.

Data should be chronologically increasing (or decreasing)

I have seen multiple situations where data is logged chronologically or time. A current example is song play counts, or YouTube views. If the data is logged day after day these numbers should stay the same or more likely INCREASE. So when you see them...

decrease
go to zero after being positive
GO WAY up then return to a more normal linearly increasing value

The data is suspect.

Sometimes the data can be "repaired" by removing the wild swings. But that is not always simple depending on the bandwdith of "normal" data.

Margin Of Error

Is the margin of error in the document at the 95% confidence level?

More "bad nulls" (from email)

Copying over from emailed note

I did want to add to your ‘Zeros replace missing values’; in my time as an ETL developer (before going back into academia) I have come across the following as common entries for missing values:

-1
99
...
-99999
...
-99
00000

Basically, stuff the field with as many 9s as it will hold or, if possible, make it negative as that is ‘clearly’ a non-sensical value. Never bother documenting any of these decisions. I’ve actually never come across 0 for missing, but that might be the type of data I work with. Point being, I would suggest emphasising that distributions and summaries, as well as specific tests on each row for ‘reasonableness’ (e.g. ‘break if {value} < 0’), are a key part of most processing and analysis regardless of the quality of the data itself.

simple

Nothing Special

Is `24` in `1969-12-31T24:59:59Z` correct?

Is 24 in 1969-12-31T24:59:59Z correct? (This date appears twice.)
I suppose that 23 may be correct.

geographic data?

Many geo data issues are already covered in in other sections - especially the entered by humans part - but there are some common quirks that might be worth mentioning?

lon/lat vs lat/lon
inconsistent or incorrect CRS
inconsistent values used to indicate NULLs

Sample is not random is the same as it's biased

If a sample "fails to cover the entire population", it's biased. In my opinion "Sample is not random" and "Sample is biased" should be the same point.

Chinese version page cannot open.

I think the Chinese version page: http://djchina.org/2016/07/12/bad_data_guide/ cannot be opened. So, maybe you could check the corresponding URL or remove this raw?

Excel dates

Arn't they actually December 31st 1899 or something similar slightly off. Due to leap year bug or something.

Text incorrectly converted to date

Excel import will convert text that looks somewhat like a date to a date; this has been noted as a common problem with human genetic data, as several gene symbols resemble dates (DEC1, SEPT9, MARCH1). Best defense: be suspicious of columns which have mix of numeric (or date) and text data, particularly if the fraction of numerics (or dates) is very low

Row numbers that are multiple's of 10, 100, or 1000

Via Chris Wright: "Also, I'm always suspicious of row counts that are multiples of 1000 or 100 as these are often sample data sets and therefore missing records. I generally check with the provider when that happens."

Edits on documentation made for university student project

For a university school project I will be making some edits to your documentation which will aim to make writing more technical

Please add edits if you would like :)

GII countries

In addition to China and Pakistan, Rwanda also mandates that its parliament be 30% women, though it tends to do much better than this figure.

Records combined around end-of-line characters

Via Chris Wright: "Finally record counts in general should be checked when receiving and loading data. Excel gives people the option to add in new lines within cells, this is stored as a Line Feed (LF) character (at least under Windows where I work), some applications reading this in will take everything after that as a new record, potentially resulting in data being loaded into wrong columns if you're loading into a database. Another fun trick is when you end up with an end of file character embedded in a text string. I've yet to work out how on earth these end up in the files (it's happened to me maybe 3 times over the last 5 years), but these essentially tell the process reading the data that it has reached the end of the file and to stop reading it there. The ASCII code for it resolves to CTRL+Z, so my current working theory is that the source system is capturing people undoing an typo. I've never been able to replicate this though. In both cases, knowing up front how many records you are expecting, and counting the number of records you've loaded into your working system captures these problems."