datacarpentry / spreadsheets-socialsci Goto Github PK
View Code? Open in Web Editor NEWData Organization in Spreadsheets for Social Scientists
Home Page: http://datacarpentry.github.io/spreadsheets-socialsci/
License: Other
Data Organization in Spreadsheets for Social Scientists
Home Page: http://datacarpentry.github.io/spreadsheets-socialsci/
License: Other
Just to help bridge the curriculum to social sciences, its worth mentioning familiar forms of metadata in the social sciences. Survey researchers often write codebooks that explain their questions. Still, a codebook does not have other essential metadata info, like when/where a dataset was collected from.
Learners who are using LibreOffice for the workshop will have problems with the dataset as the default for LibreOffice is to treat tabs, commas, AND semicolons as delimiters. Need to add this to the instructor notes or a callout. Learners can select to use just commas as delimiters.
If learners are using a non-English language version of Excel, the =MONTH()
, =DAY()
, etc functions won't work for them. They will need to type in their language's equivalent of that word in the formula.
Add to instructor notes?
Suggestions for edits/additional content
Data Organization in Spreadsheets for Social Scientists
Formatting data tables in Spreadsheets
Metadata
Some of this information may be familiar to learners who collect or analyze survey data or data sets accompanied with additional data documentation, such as codebooks. Codebooks will often describe the original survey or interview questions associated with particular variables, the way variables have been constructed, response categories and their associated values, and the notations for missing values throughout the data. For example, the General Social Survey maintains their entire codebook online. Looking at an entry for a particular variable, such as the variable SEX, provides valuable information about the original question wording, scales or response categories, the years covered for that variable, the sample or sub-samples surveyed, and the meaning of particular values. Descriptions of missing values are important in cleaning survey data because they describe the various reasons why respondents did not answer a question (i.e., not applicable, didn't know, refused to answer, etc.), which leaves blank cells in the data. For example, in the General Social Survey missing values are numbered as 8, 9, 0 and sometimes other numbers that might be interpreted later on as integers that could interfere with accurate queries and analyses.
The beginning of episode 02-reformatting discusses two report_data Excel spreadsheets. There should be screenshots here to illustrate the tables.
Episode 02-reformatting includes an Excel spreadsheet which is intended to be an example of good formatting. However, it includes special formatting like colored celled and text coloration that should be removed.
The CSV file specification makes it clear that fields with coma should be enclosed with double-quotes, and are therefore supported. A better recommendation might be to not change the default used by the spreadsheet program in how the text fields are handled.
This episode has two sections that aren't explicitly tied together. It would be good to add an exercise to the metadata portion of this episode that asks the learners to discuss what type of metadata would be needed to support the data that is shown in the spreadsheet from the first exercise in that episode.
References in this instruction: "Select the cell b4 and insert the formula '=IF($F4="Year",$G4,$B3)" need to be adjusted to reflect the column added in a previous step. Should reference G4... H4.
I'm working through the lesson and noticed that there are multiple "toy" datasets that are used for different parts of the lesson. Generally, in Data Carpentry lessons, we try to build all examples around a single dataset (as much as possible) and to have that dataset be somewhat realistic in terms of what the learners would be working with in their everyday workflow.
This lesson currently uses:
A toy dataset about "readings" in two different regions ("North" and "South") over nine days in January and February 2015. This is fictional data that doesn't represent anything in particular and is just being used as an example of bad formatting. (see conversation in #32). This data is included as the data files report_data_o.xlsx and report_data_c.xlsx.
A metadata example from a UK audit of political engagement (http://www.datacarpentry.org/spreadsheets-socialsci/03-formatting-problems/). This is included as this file: audit_of_political_engagement_11_ukda_data_dictionary.rtf.
A SAFI dataset about house types and livestock types in Mozambique and Tanzania. This dataset is included as the data file SAFI_results_2. No information about the source of this data appears to be included in the lesson.
A small toy dataset consisting of the words "Hello world, how are you" in various permutations. (http://www.datacarpentry.org/spreadsheets-socialsci/06-exporting-data/).
Thus use of multiple different datasets is confusing and potentially distracting to the learners. I propose that this lesson is reorganized such that all of the exercises use the SAFI dataset. This reorganization involves the following:
they are exactly repeated for the 3 OSes
depending on where the data end up, we'll need to update the download link
Please, consider adding a lesson
topic to the repository. To do so you can follow the help about how to add topics to the repository. Check out the topics that the Genomics R intro lesson has gotten to add others that may be relevant to this lesson.
This will help people to know which repositories are lessons and also could be used to automate analysis of the repositories.
Episode 4 has the text:
This regional variation is handled automatically by the spreadsheet program so that when you are typing in dates they appear as you would expect. If you try to type in a US format date into a UK version of Excel, it may or may not be treated as a date.
The phrase "they appear as you would expect" is somewhat misleading. They may or may not appear as you would expect! If, for example, you are a researcher from the US working in the UK and using your new lab's computer and software, you may type in a US date and have it render according to UK rules. Alternatively, if that same person is using their personal computer and sharing files back and forth with a collaborator who uses a UK version of Excel, there will also be problems!
I think it is key here to stress that because Excel is manipulating the values that you enter (and storing them in some non-transparent way) you can't be sure exactly what you're getting (ie it's not reproducible). That's why it's always best to store dates not as dates but as numbers using multiple columns to separate the components (e.g. month, day, year).
In Episode 4 it would be good to mention somewhere in the section "dates Excel doesn't recognize" or "dates Excel gets wrong" that different versions of Excel will perform differently. For example, different versions of Excel treat different dates as "day 1" for the purpose of calculating the numerical value of a date.
The current quality assurance section states that "data validation rules are not applied retroactively to data that is already present in the cell".
However, in Excel, it's possible to highlight sections of your data that don't meet the data validation rules by clicking the drop-down next to "Data Validation" in Excel and selecting "Circle Invalid Data".
Showing learners how to find and locate incorrectly-validated data would be a good addition to the curriculum.
@sheraaron suggests the following change:
On this page: https://datacarpentry.org/spreadsheets-socialsci/, I would like to add: (Under - "In this lesson, you will learn:" To enable us to use data successfully)
@ErinBecker and/or @fmichonneau - do changes to the outcomes need to be routed through the CAC?
This was originally submitted as issue #71
For each lesson release, copy this checklist to an issue and check off
during preparation for release
Scheduled Freeze Date: 2018-04-27
Scheduled Release Date: 2018-04-30
Checklist of tasks to complete before release:
In the dates as data episode, the spreadsheet has two tabs. The first tab is data stored as DD-MM-YYYY, the second is MM-DD-YYYY. If learners use the wrong tab for their location, they will get a #VALUE
error. This should be added to Instructor notes.
The organization of the beginning of this lesson as written feels either repetitive or mysterious. I taught it for the second time today and I switched the order some that made it flow a little more smoothly I think. I'd like to recommend a fairly significant reorganization. of the content in the first 3 episodes.
Introduction [new ep 1]
( material that's there already)
Data Description
overview of the SAFI data (from ep 2)
Formatting Data in Spreadsheets [new ep 2]
intro with goals and basics of tidy data (from ep 2)
exercise to find errors in messy data (from ep 2)(common formatting errors sections from ep 3)
MetaData [new ep 3]
current material on metadata (from ep 2)
new material on how to setup meta data// links to resources,
the suggested new material came up because the host site I taught at today the librarians in the room (hosts) mentioned that some repositories have standards for metadata and standards for it, we don't need a deep coverage of that, but links or points to some of that and noting that format of meta data is based on disciplinary standards I think is valuable add to make that discussion more concrete.
Describe data being used for report_data_c.xlsx and report_data_o.xlsx
What is the "Reading" variable and what units is it in?
Where are "North" and "South" regions of?
Is this modified from real data that is available somewhere?
These questions should be answered when the data is first introduced.
Hi I'm planning to teach this fantastic workshop in a couple of weeks at the University of Bath. As I've been going through the lessons a couple of things struck me. The first was the need for a readme file for the data files so that we can model good research data management practice to students (I'm a research data librarian). The second is that SAFI_clean.csv has a couple of variables (items_owned and months_lack_food) that would be problematic to analyse in most statistical analysis programmes because they have multiple values that are all in a different order. My concern is that by calling this file 'clean' we are then saying that it is ready to go. I have attached a readme file for the data files that you are very welcome to use (you will need to remove references to the University of Bath). I have also attached a version of SAFI_clean where there is a worksheet where all text variables have been coded to numerical variables and items_owned and months_lack_food have also been coded into yes / no variables using and IF statement that will search for text variables. There is also a worksheet in the workbook with documentation to accompany the changes I've made to SAFI_clean. This could be provided as a resource for tutors if you think it would be useful.
All the best
Alison Nightingale (University of Bath, UK)
SAFI_clean_processed.xlsx
SAFI_readme.txt
There is no glossary in http://www.datacarpentry.org/socialsci-workshop/reference/
It's a well-written, comprehensive, concise paper that covers many of the same points made in the lesson: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989
Can be added in lesson 1, along with Hadley Wickham's tidy data paper.
the Quality assurance section suggest to open the clean CSV and apply Validation Rules.
Excel will refuse to do that until the CSV is re-saved as xlsx.
As we don't really use CSV until later, could/should(?) the default file be an XLSX ?
In the Quality assurance episode, the example input message given ("Invalid number: Number of household members must be a whole number between 1 and 30.") would be more appropriate as an error alert message.
Episode 02-reformatting includes an Excel spreadsheet titled "report_data_o.xlsx". This spreadsheet has a number of individual tables, with each having another header bar off to the right side of that table. It wasn't clear to me at first, but that second header bar is meant to apply to the table to it's left (as extra metadata) and not to be a header for another independent table. Those headers can be moved to immediately below the header line for each table to clarify that they belong with that data.
Note that this is an example of "bad formatted" data so it's ok to have two header lines per table.
In episode one there is this text:
consistency when representing boolean values
The term "boolean" is probably not familiar to many people. Could remove this jargon and/or add a parenthesis saying something like "True" vs "TRUE".
The first episode has some text like:
Spreadsheets outline
In this lesson, we’re going to talk about:
Using best practice to create your own data tables in spreadsheets
Reformatting existing spreadsheets (using Excel)
Recognising and reformatting dates in spreadsheets
Basic quality control; using data validation and Data entry forms (in Excel)
Exporting data from spreadsheets
This can be moved to the index page and used in the lesson introduction.
The text for Episode 2's key points is a bit unclear:
Heading lines may contain information relevant to the rows below it and needs to added to each of them.
There may be multiple heading lines with different information in each.
I saw this when I was reviewing #34
For the first point, we are referring to plural and singular heading lines in the same sentence.
The file for this activity appears to be present in the repo but is not linked anywhere in the episode:
https://github.com/data-lessons/spreadsheets-socialsci/blob/gh-pages/data/dates_1.csv
http://www.datacarpentry.org/spreadsheets-socialsci/setup/ is out of date. Needs link to SAFI data and for links to old datasets to be removed.
metadata is more useful if it follows a standard. A quick search indicates that DDI (http://www.ddialliance.org/training/why-use-ddi) is used in the social sciences. Also indicate that librarians might be able to provide help and guidance with this.
Index is missing a title, a intro description paragraph, and prerequisites.
The first episode has:
Free spreadsheet programs such as LibreOffice and OpenOffice are available.
This should include links to LibreOffice and OpenOffice download pages.
... Looking at an entry for a particular variable, such as the variable SEX, provides valuable information...
Link lead to a 404.
The Social Sciences CAC ([email protected]) met June 15th and 19th to discuss the full Social Sciences curriculum and provide recommendations to the Maintainers about work for these lessons between now and their next publication (December 2018). Their specific action items for this lesson are as follows:
Please see the meeting minutes for more details.
Learners working on Numbers will need to install LibreOffice. This should be added to Instructor Notes.
Currently, the data for this lesson lives in this repo ((here)[https://github.com/datacarpentry/spreadsheets-socialsci/tree/gh-pages/data]), but it should be hosted on Figshare.
The first episode has a section with a lot of one sentence paragraphs. These look very strange to me. It would be better to collapse this to one or two paragraphs.
Running through this lesson on a Mac running Excel v 15.40 (2017), I've encountered several problems. Some minor, some major. Will continue to list these here as I find them. So far:
Reading over the lesson, I found conflicting a advice about whether it's advisable to use multiple tabs in Excel. In "Formatting data tables in Spreadsheets"/"Keeping track of your analyses", it's suggested to keep notes about your experiment in a new tab to keep the information together with your data. However, when you export an Excel spreadsheet to CSV for preservation/interoperability as recommended in the next section, each tab/sheet will be saved as a separate file. Also, the section about Metadata specifies that metadata should not be contained in the data file itself (experiment notes could be considered a form of descriptive metadata).
I like the suggestion to keep track of experiment notes along with your data, but there could be some clarification about how to manage this information once the dataset/analyses are finalized (e.g. save your notes as a separate CSV or paste them into a .txt file; preserve the link with the dataset by using a strategic file name and/or including information about the dataset at the beginning of the file).
Episode 2 has a good exercise for fixing formatting issues in a badly formatted spreadsheet, however this may lead the learner to infer that manually formatting a spreadsheet is good practice (despite the fact that it's not reproducible). It would be good to note here the following (with a little exposition of each):
If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.
To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:
When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).
When DDI (Data Documentation Initiative) is introduced, I think it would be beneficial to state that this standard is applicable across disciplines because it generally applies to being used for data from surveys and observational research methods commonly used in the social, behavioral, economic, and health sciences.
It can also be added that metadata files are commonly created and saved as XML or JSON files, which are machine-readable but also easily read by humans.
The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.
This repository has now been converted to use the standard set of labels.
If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:
SWC legacy labels | New 'The Carpentries' labels |
---|---|
bug | type:bug |
discussion | type:discussion |
enhancement | type:enhancement |
help-wanted | help wanted |
newcomer-friendly | good first issue |
template-and-tools | type:template and tools |
work-in-progress | status:in progress |
The label instructor-training
was removed as it is not used in the workflow of certifying new instructors anymore. The label question
was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.
The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.
The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.
-- The Lesson Infrastructure subcommittee
PS: we will close this issue in 30 days if there is no activity.
The Introduction section refers to the Numbers spreadsheet program as an example of what students could use during the workshop (under "Things You'll Need To Complete This Tutorial"), but in the Setup instructions, learners are told not to use Apple's Numbers because it does not contain some of the necessary features.
@sheraaron suggests the following change to the intro page:
I also suggest to add to https://datacarpentry.org/spreadsheets-socialsci/00-intro/index.html
Under Things you'll need to complete this tutorial: Open a folder on your computer and save all your spreadsheets necessary for this exercise.
This makes sense to me - any objections to this change @ErinBecker and/or @fmichonneau?
This was originally submitted as issue #71
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.