librarycarpentry / lc-spreadsheets Goto Github PK
View Code? Open in Web Editor NEWTidy data for librarians
Home Page: https://librarycarpentry.github.io/lc-spreadsheets/
License: Other
Tidy data for librarians
Home Page: https://librarycarpentry.github.io/lc-spreadsheets/
License: Other
There is an issue on this page - the formatting is broken
The flag link for the "Date formatting" section in the Formatting Problems episode is broken (probably because that section has been removed).
Not sure if it's meant to be deleted, or re-populated?
@shlake has made these changes on her branch. When issue is agreed to be "fixed", then will ask for PR.
====================================
This lesson isn't about data "wrangling". It is really about data "organization" and best practices. Propose Title change from
Using spreadsheet programs for data wrangling
TO
Using spreadsheet programs for data organization
Under "What this lesson will not teach you" the lesson points to a link to an O'Reilly Media book on Amazon. If there's an OA alternative we should link to that instead, but it's also okay to just keep it as-is...
Here: https://librarycarpentry.org/lc-spreadsheets/00-intro/index.html#what-this-lesson-will-not-teach-you
At least on a Mac - The "Error Alert" tab contains the text that is shown when an incorrect value is entered.
The "Input Message" is a popup message that appears in the cell that has "Data Validation" set - notifying what the correct values for entry are.
"If Excel was to be believed, this person had been collecting bugs IN THE FUTURE. Now, we have no doubt this person is highly capable, but I believe time travel was beyond even his grasp."
--The most recent example in the table is 2017. Include a new example table, maybe with a date of Jul-31 - the "How it was interpreted" column will be 1-Jul-31 so it'll be a few years before this is outdated again. :)
The Ecology lesson has a great exercise on what happens when Excel files are saved as CSV and then re-opened in Excel (when the "viewable" part of the date column does not have year explicit).
See last exercise on this episode: https://datacarpentry.org/spreadsheet-ecology-lesson/03-dates-as-data/index.html
At least, it doesn't look like it's being rendered correctly (and there's an unresolved FIXME) at: https://librarycarpentry.github.io/lc-spreadsheets/reference/
#3 Does not appear to have fixed the links. They may need to be absolute paths instead of relative paths and should probably not include the .md file extension. See https://github.com/LibraryCarpentry/lc-spreadsheets/blob/gh-pages/_episodes/00-intro.md
Things to consider:
1 - keep as "episode" and update episode page that this is "Reference"
2 - Move out to Discussion or the Reference section and include brief paragraph on Episode 5
In the Caveats of popular data and file formats part of the lesson (https://librarycarpentry.org/lc-spreadsheets/06-data-formats-caveats/index.html) under the heading Dealing with commas as part of data values in .csv files there is a link to an example data file (https://ndownloader.figshare.com/files/3299483). This linked file is different from the excerpt described in this section.
I would suggest removing the link. The lesson could just work with the data given in text form in this section.
In data as dates (https://librarycarpentry.org/lc-spreadsheets/03-dates-as-data/index.html) I would suggest to change the order of sections.
The section Preferred date format is more logical further down after the explanation of how Data formats in spreadsheets work and before going into the Advantages of Alternative Date Formatting
I noticed that Dates as data includes this line: "Let’s try with a simple challenge."
Can we eliminate the word "simple"?
Please delete the text below before submitting your contribution.
Thanks for contributing! If this contribution is for instructor training, please send an email to [email protected] with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution.
Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact Kate Hertweck ([email protected]).
New image within an Excel Window.
Check all sections in this lesson, but "More than one piece of information in a cell" and "Field name problems" needs to reflect the training_attendance sample file.
In reviewing some of the Library Carpentry lessons as part of the check-out process for the instructor training I noticed an issue.
In the following lesson:
https://librarycarpentry.org/lc-spreadsheets/05-exporting-data/index.html
and this may be the case in others, it would be useful to include screen snapshots of Windows version of software. I think it would be especially useful in these lessons where Excel is being used, the Mac and Windows interface can be different (depending on what you're doing). Also, my experience is that a good number of people taking Library Carpentry are developing their computer skills and it is a stretch for some to figure out the difference between the Mac and Windows interfaces.
Understanding that the page is already loaded with information, I'm not sure where this would fit on the page. It may not be necessary for all screen snapshots, just for the ones where the Windows and Mac versions differ.
Another alternative is that when teaching the lessons live, suggest that the instructors have access to and are able to display different software versions (Mac v. Windows).
I hope this makes sense. It's based off of my experience working with people who are novices.
Hi, we're planning to teach this on Tuesday, but there's an exercise I'm not sure if I understand myself under Sorting :O
The question is "When you do this sort, do you notice anything strange?", but when I sort the date column, it seems to sort perfectly, I don't see anything strange. Is it perhaps just supposed to be strange that it sorts perfectly without showing the year..?
Or could it be that the exercise does not match the current version of the spreadsheet? The screenshot does not match the current spreadsheet at all.
The Quality Control section contains a couple points of confusion. Under the Sorting exercise, it says to sort the date by smallest to largest and also the image shows a different variable (wgt). Additionally, for the Conditional Formatting section it would be helpful to say which tab format is under and to clarify the rule for the conditional formatting. However, this may have been a difference between MS Excel on Mac OS which is what I was using for the tutorial.
In the "Caveats of popular data and file formats" episode there are 3 broken links:
The recommendations pertaining to text documentation about the data are inconsistent.
About "Keeping track of your analyses," module 2, Formatting data tables in Spreadsheets states:
[K]eep track of the steps you took in your clean up or analysis. You should track these steps as a scientist would each step in an experiment. You can do this in another text file, or a good option is to create a new tab in your spreadsheet with your notes. This way the notes and data stay together.
About "Inclusion of metadata," module 3, Formatting problems states:
Unlike a table in a paper or a supplemental file, metadata (in the form of legends) should not be included in a data file since this information is not data, and including it can disrupt how computer programs interpret your data file. Rather, metadata should be stored as a separate file in the same directory as your data file, preferably in plain text format with a name that clearly associates it with your data file.
While not referring to exactly the same documentation, both pieces of advice do refer to text documentation about the dataset rather than the data themselves. It seems like best practice would be to store documentation about the dataset in a single location.
There are broken links in 'Preferred date format' section under "Notes" subsection:
I will create pull request fixing this.
In the second sheet, "2017", column J includes a formula which looks to be used to generate data. It fills the value of the cell with a random number between 15 and the number of registrants (column I). I couldn't find a place in the lesson where this formula is used, and in copy/pasting, it seemed to just cause confusion in the workshop. Perhaps the spreadsheet could be updated so numerical values, rather than a formula, are present in cells I4:I12 of sheet "2017"?
Here are set of broken links on this page:
The formatting color is "red" which may not be seen by some. Changing the color to "grey". I noticed this is what the Ecology data file now has as a formatting color.
This exercise is challenging.
The whole episode is challenging. Propose to just to talk about the problem "dates as data" within excel without talking about "time".
This is a confusing section.
Suggesting to just "NOTE" that there might be problems, but don't go into detail as it is here.
Or change the character endings "\r, \n" to CR LF?
Like the DC Ecology Lesson did: https://datacarpentry.org/spreadsheet-ecology-lesson/05-exporting-data/index.html
Then add links to where/how to fix this.
In the "storing dates as a single string" section, I would suggest modifying the last sentence as follows:
"You could also use the 'YYYYMMDD' format. YYYYMMDD is an international standard (ISO 8601) and can eliminate ambiguity when sharing spreadsheets. Such strings will be correctly sorted in ascending or descending order, and by knowing the format they can then be correctly processed by the receiving software."
The word ascending is also currently misspelled (missing the "i").
Setup file refers to 4.4.2 of LibreOffice but the latest version is 6.0.4. Is it worth listing software version at all if LibreOffice is updated regularly?
When trying to use the Text-to-Columns feature to separate the PGR/PDRA/Other column into three new columns, Excel get confused and thinks the "/" is part of a date (and we know how dates are not easy or totally understood in Excel).
Suggesting that a new delimiter, possibly "|" is used.
Using spreadsheet programs for data wrangling (https://librarycarpentry.org/lc-spreadsheets/00-intro/index.html)
Suggested Objectives:
i. Discuss the variety of ways that spreadsheets function as tools in our daily work
ii. Distinguish machine readable tidy data from data that is easy to read for humans
iii. Articulate the purpose and importance of data cleaning as a primary and substantial step in working with data
If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.
To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:
When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).
Typo in Key Points section of "Basic Quality Assurance and Control, and Data Manipulation in Spreadsheets" episode:
As we look to move Library Carpentry to becoming a formal part of the Carpentries community, we need to make sure that each of our lessons are properly supported, including 3 maintainers for each one. It will also be obvious to anyone paying attention that I haven't had much time to work on this lesson recently, and that will be the case for a while, so I'm looking to step down from that responsibility for now.
I'm particularly pinging the following people who have contributed issues and pull requests to this repository but anyone with an understanding of the content and the library world would be welcome.
@duchessanne @cmacdonell @niamhwallace @shlake @yvonnemery @libcce @fmichonneau @stephlopezq @StephanJanosch @fionajones @jcoliver
The links to the episodes in the Instructor's Notes, on this page:
https://librarycarpentry.github.io/lc-spreadsheets/guide/
Are broken.
Hi,
did you ever consider using KNIME as tool for data manipulation? I think the approach via Knime is way more reproducible and also visually more appealing.
Just my 2c,
Stephan
Hi there! Thank you for all the work you do to make this resource available to everyone!
This is my first issue submission as part of the Carpentries trainer check-out process.
For the episode "Basic Quality Assurance and Control, and Data Manipulation in Spreadsheets", I have a few suggestions. (Edit: At the guidance of the Carpentries onboarding team I'm breaking this up into separate issues 1/2/2020.)
Episode title:
For the episode "Basic Quality Assurance and Control, and Data Manipulation in Spreadsheets", I would suggest these edits to the Conditional Formatting section:
Thanks for your consideration!
In the sorting section, the the exercise is:
Let’s try this with the Date tab in our messy spreadsheet. Go to that tab. Select Data then select Sort
Sort by
date
in the order Smallest to Largest
The screenshot though, shows sorting on Column "wgt". Ideally the screenshot would be updated with the "Date" column.
_This activity is confusing - and maybe distracting from the main learning objective of formatting dates correctly with month, day, and year in separate columns, especially because Excel functions aren't addressed in the earlier lessons.
I would put this text above the exercise:_
Preferred date format
Instead it is much safer to store dates with MONTH, DAY and YEAR in separate columns or as YEAR and DAY-OF-YEAR in separate columns.
And maybe add a faded example, or slightly more detailed example syntax:
Current time and date are best retrieved using the functions NOW(), which returns the current date and time, and TODAY(), which returns the current date. The results will be formatted according to your computer’s settings.
In an empty cell, try to extract the year, month and day from the current date and time string returned by the NOW() function.
--=YEAR(NOW())
--=MONTH( ____ ())
--= ____ ( ____ ())
Calculate the current time using NOW()-TODAY(). Note that the default output will be a decimal, not in the hh:mm:ss format.
Try to extract the hour, minute and second from the current time using functions HOUR(), MINUTE() and SECOND().
press fn + F9 to force the spreadsheet to recalculate the NOW() function, and check that it has been updated.
And then keep the note about Excel and pre-1900 dates where it is.
Used the example of type of workshop attendees in place of multiple instructors.
Will now say this:
Example:
One table recorded attendance by the different types of attendees. This table recorded number of attendees of different types: post-graduate researcher (PGR), post-doctoral research associate (PDRA), and other.
Solution:
Never include more than one piece of information in a cell. Design your data sheet to include a column for each type of attendee, if this information is important to collect, rather than just a total number.
The DC ecology lesson (that this LC lesson is based on) has the following key points, which should be added to this LC version.
Never modify your raw data. Always make a copy before making any changes.
Keep all of the steps you take to clean your data in a plain text file.
Since the tutorial mentiones LibreOffice as one of the tools to use in the workshop, I added the link to the documentation of Libre Office on how to freeze the tables, so if anyone is using LibreOffice or OpenOffice they would be able to keep up.
I'll open a pull request for this
Propose to not include the last part of the current title:
From: Basic quality assurance and control, and data manipulation in spreadsheets
To: Basic quality assurance and control
Merging my changes and Jeff's, episode 02 layout got messed up.
Submission for instructor training checkout:
Every time I talk to a student or a class about dates in Excel and explain how dates are stored as a number, it is inevitable that someone will ask, "What do you do if you want to get something that looks like a date [mm/dd/yyyy] and no matter what you do with formatting you still have a serial number?"
Librarians often see spreadsheets that have suffered many transformations whether it is from a person’s formatting and manipulation or when data is downloaded from older database programs or data sources.
In the “dates as data section”, I would suggest adding section of formatting dates in the other direction.
After the section, Dates stored as integers, I would add a small section on “Transforming serial numbers to dates”
Options to transform serial numbers to dates:
This lesson uses screen shots from the Data Carpentry sample file.
For this Library Carpentry lesson, the sample file is "training_attendance", so this lesson needs equivalent screen shots from the training_attendance file.
Section "Structuring data in spreadsheets" still refers to "weight" or "temperature". Change to "length" or "attendance" for Library Carpentry dataset.
The image does not show the correct values for "Month" "Day" Year", data is off by a row.
Check all sections in this lesson, but "Structuring data in spreadsheets" needs to reflect the training_attendance sample file.
Also need to update the example images to match the training_attendance sample file.
For the episode "Basic Quality Assurance and Control, and Data Manipulation in Spreadsheets", I would suggest these edits to the Quality Control section:
Thank you for your consideration!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.