Giter VIP home page Giter VIP logo

bad-data-guide's People

Contributors

adamantnz avatar aechase avatar agrueneberg avatar bryant1410 avatar bycoffe avatar chentsulin avatar denis-sokolov avatar herdingbats avatar jellily avatar jeremybmerrill avatar joegermuska avatar laurence001 avatar liwenyip avatar mdlincoln avatar o-i avatar onyxfish avatar philipashlock avatar smnorris avatar tkb avatar yanofsky avatar zhaoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bad-data-guide's Issues

Translations here?

bad-data-guide is a success (congratulations!!) and there are many translations... How about add a "translations" folder here?

Clarify "average per month" in `Data are too coarse` section

Technically the yearly total divided by 12 is the monthly average. Explain why this is a terrible idea.

Explanation from an email I wrote:

With regard to your specific question about "average per month," I think I do need to clarify that section of the guide. Of course the yearly total divided by 12 is technically the monthly average. My point was that if you have no idea of the variation from month to month creating an average can be extremely misleading. Without knowing the distribution of the data it's entirely possible the entire total happened in the first day of the year, which would make expressing it as a monthly average absolutely incorrect.

Suspicious date: 1960-01-01

Via Chris Wright: "I wanted to add that the date 1960-01-01 can also be a suspicious date as this is the 0 date for a data manipulation program called SAS. It looks like your guide is aimed at journalists and amateur data investigators, so probably less common for people in those situations."

Excel conversions

Excel has the "feature" that anything remotely looking like a date is forcefully converted to a date. It usually is not possible to figure out which string was originally input. Also the data type of the cell is off then.

My first encounter is with CAS registry numbers (see https://en.wikipedia.org/wiki/CAS_Registry_Number)

The problem is also in genetics: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

btw. open office fully reimplemented this feature - also without being able to turn it off.

A nicety it that this conversions is not only applied to manually input data, but also to pasted, and - to make things fun - to data written by apps using excels COM interface.

Data should be chronologically increasing (or decreasing)

I have seen multiple situations where data is logged chronologically or time. A current example is song play counts, or YouTube views. If the data is logged day after day these numbers should stay the same or more likely INCREASE. So when you see them...

  • decrease
  • go to zero after being positive
  • GO WAY up then return to a more normal linearly increasing value

The data is suspect.

Sometimes the data can be "repaired" by removing the wild swings. But that is not always simple depending on the bandwdith of "normal" data.

Margin Of Error

Is the margin of error in the document at the 95% confidence level?

More "bad nulls" (from email)

Copying over from emailed note

I did want to add to your ‘Zeros replace missing values’; in my time as an ETL developer (before going back into academia) I have come across the following as common entries for missing values:

-1
99
...
-99999
...
-99
00000

Basically, stuff the field with as many 9s as it will hold or, if possible, make it negative as that is ‘clearly’ a non-sensical value. Never bother documenting any of these decisions. I’ve actually never come across 0 for missing, but that might be the type of data I work with. Point being, I would suggest emphasising that distributions and summaries, as well as specific tests on each row for ‘reasonableness’ (e.g. ‘break if {value} < 0’), are a key part of most processing and analysis regardless of the quality of the data itself.

geographic data?

Many geo data issues are already covered in in other sections - especially the entered by humans part - but there are some common quirks that might be worth mentioning?

  • lon/lat vs lat/lon
  • inconsistent or incorrect CRS
  • inconsistent values used to indicate NULLs

Excel dates

Arn't they actually December 31st 1899 or something similar slightly off. Due to leap year bug or something.

Text incorrectly converted to date

Excel import will convert text that looks somewhat like a date to a date; this has been noted as a common problem with human genetic data, as several gene symbols resemble dates (DEC1, SEPT9, MARCH1). Best defense: be suspicious of columns which have mix of numeric (or date) and text data, particularly if the fraction of numerics (or dates) is very low

Row numbers that are multiple's of 10, 100, or 1000

Via Chris Wright: "Also, I'm always suspicious of row counts that are multiples of 1000 or 100 as these are often sample data sets and therefore missing records. I generally check with the provider when that happens."

GII countries

In addition to China and Pakistan, Rwanda also mandates that its parliament be 30% women, though it tends to do much better than this figure.

Records combined around end-of-line characters

Via Chris Wright: "Finally record counts in general should be checked when receiving and loading data. Excel gives people the option to add in new lines within cells, this is stored as a Line Feed (LF) character (at least under Windows where I work), some applications reading this in will take everything after that as a new record, potentially resulting in data being loaded into wrong columns if you're loading into a database. Another fun trick is when you end up with an end of file character embedded in a text string. I've yet to work out how on earth these end up in the files (it's happened to me maybe 3 times over the last 5 years), but these essentially tell the process reading the data that it has reached the end of the file and to stop reading it there. The ASCII code for it resolves to CTRL+Z, so my current working theory is that the source system is capturing people undoing an typo. I've never been able to replicate this though. In both cases, knowing up front how many records you are expecting, and counting the number of records you've loaded into your working system captures these problems."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.