Comments (10)
I like this! I originally had a short little .csv to go with this package, to illustrate dirty features and cleaning them. It wasn't truly nasty though.
I agree that the dirty_iris.csv
file is not real-world dirty. That set has this as dirtiness:
– The petal length of an iris is at least 2 times its petal width.
– The sepal length of an iris cannot exceed 30 cm.
– The sepals of an iris are longer than its petals
That might be called semantic dirtiness? Or something. I'm more interested in, say, lines of whitespace used horizontally and vertically as delimiters in an Excel file, for instance. Or the bad NA value in a numeric column that you mention. Dates are often screwy like this.
I will assign this to you, thanks for offering!
from janitor.
It won't let me assign this to you. Not sure why, after some Googling. Maybe you need to be a collaborator on the repo? I just sent you an invite for that.
from janitor.
I think you're right - just assigned myself (a KIPP-ism 😄 )
from janitor.
Been thinking about this - it would be nice for each row to identify how/why it was dirty, to help surface test success/failure.
from janitor.
I was thinking more about this yesterday. It would make for a cool vignette to have a dirty Excel file that gets cleaned one step at a time. We could make a .csv too but then is that really as dirty? 😈
from janitor.
whoops I accidentally closed this
from janitor.
I am working on this right now and it is cathartic to create some ugly data - I think you will like it. If it's not in a spreadsheet made by a novice, it's not really dirty data.
from janitor.
there's this gnarly spreadsheet of spending on journals.
@jennybc's jailbreakr project points to the Enron corpus - maybe grab a subset of those?
from janitor.
The spreadsheets that underpin the gapminder package have some lovely features 😂 and, since I use it to teach data cleaning, the whole glorious mess is laid out here:
https://github.com/jennybc/gapminder/tree/master/data-raw#readme
from janitor.
Filing this under, I would love to see this happen, but realistically I am not going to get to it - so closing as I tidy up issues.
from janitor.
Related Issues (20)
- adorn_ns() adds excluded values to a adorn_totals() in a pipe HOT 3
- German transliterations in `make_clean_names()` HOT 4
- Feature suggestion: allow multiple rows input to `row_to_names()` HOT 16
- Feature Request: `paste_skip_na()` function that skips NA values when pasting HOT 4
- Feature suggestion: `most()` and `assert_count_true()` HOT 6
- Add paste_skip_NA to catalog vignette
- Edge case for `janitor::remove_emtpy()`: dataframe row dimension remains after columns removed HOT 1
- `get_one_to_one()` errors with duplicated dttm HOT 4
- Possible to enrich the get_dupes() HOT 1
- Upkeep proposition / spring cleaning HOT 10
- CRAN notification re: janitor/man/janitor.Rd
- Submit 2.3.0 to CRAN
- Remove `%>%` in favor of `|>`? HOT 5
- Set old names as labels in `clean_names` HOT 6
- Feature Request: A function for quick basic standardization of an otherwise tidy (almost) df HOT 2
- Feature request: A `rename`-like function that keeps the original names as an attribute HOT 1
- `cutoff` argument in `remove_empty()` is being implemented confusingly HOT 2
- Unexpected adorn_totals("col") HOT 1
- make_clean_names: Case conversions are wierd HOT 1
- [Feature Request] Allow `tabyl()` to accept character vectors as column names HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from janitor.