datacarpentry / spreadsheets-socialsci Goto Github PK

Data Organization in Spreadsheets for Social Scientists

Home Page: http://datacarpentry.github.io/spreadsheets-socialsci/

License: Other

carpentries data-carpentry lesson spreadsheet spreadsheets data-management english social-sciences stable

spreadsheets-socialsci's Issues

Codebooks as social science metadata

Just to help bridge the curriculum to social sciences, its worth mentioning familiar forms of metadata in the social sciences. Survey researchers often write codebooks that explain their questions. Still, a codebook does not have other essential metadata info, like when/where a dataset was collected from.

libreoffice automatically uses semicolons as separators

Learners who are using LibreOffice for the workshop will have problems with the dataset as the default for LibreOffice is to treat tabs, commas, AND semicolons as delimiters. Need to add this to the instructor notes or a callout. Learners can select to use just commas as delimiters.

non-english versions of Excel

If learners are using a non-English language version of Excel, the =MONTH(), =DAY(), etc functions won't work for them. They will need to type in their language's equivalent of that word in the formula.

Add to instructor notes?

suggestion for metadata section on missing values

Suggestions for edits/additional content
Data Organization in Spreadsheets for Social Scientists
Formatting data tables in Spreadsheets
Metadata

Some of this information may be familiar to learners who collect or analyze survey data or data sets accompanied with additional data documentation, such as codebooks. Codebooks will often describe the original survey or interview questions associated with particular variables, the way variables have been constructed, response categories and their associated values, and the notations for missing values throughout the data. For example, the General Social Survey maintains their entire codebook online. Looking at an entry for a particular variable, such as the variable SEX, provides valuable information about the original question wording, scales or response categories, the years covered for that variable, the sample or sub-samples surveyed, and the meaning of particular values. Descriptions of missing values are important in cleaning survey data because they describe the various reasons why respondents did not answer a question (i.e., not applicable, didn't know, refused to answer, etc.), which leaves blank cells in the data. For example, in the General Social Survey missing values are numbered as 8, 9, 0 and sometimes other numbers that might be interpreted later on as integers that could interfere with accurate queries and analyses.

add screenshots for report_data spreadsheets

The beginning of episode 02-reformatting discusses two report_data Excel spreadsheets. There should be screenshots here to illustrate the tables.

remove coloration and special formatting from report_data_c.xlsx

Episode 02-reformatting includes an Excel spreadsheet which is intended to be an example of good formatting. However, it includes special formatting like colored celled and text coloration that should be removed.

reword advice about dealing with data fields that contains commas

The CSV file specification makes it clear that fields with coma should be enclosed with double-quotes, and are therefore supported. A better recommendation might be to not change the default used by the spreadsheet program in how the text fields are handled.

add exercise to integrate metadata and formatting parts of episode 3 (formatting problems)

This episode has two sections that aren't explicitly tied together. It would be good to add an exercise to the metadata portion of this episode that asks the learners to discuss what type of metadata would be needed to support the data that is shown in the spreadsheet from the first exercise in that episode.

Cell references off after inserting column

References in this instruction: "Select the cell b4 and insert the formula '=IF($F4="Year",$G4,$B3)" need to be adjusted to reflect the column added in a previous step. Should reference G4... H4.

reorganization of this lesson

I'm working through the lesson and noticed that there are multiple "toy" datasets that are used for different parts of the lesson. Generally, in Data Carpentry lessons, we try to build all examples around a single dataset (as much as possible) and to have that dataset be somewhat realistic in terms of what the learners would be working with in their everyday workflow.

This lesson currently uses:

A toy dataset about "readings" in two different regions ("North" and "South") over nine days in January and February 2015. This is fictional data that doesn't represent anything in particular and is just being used as an example of bad formatting. (see conversation in #32). This data is included as the data files report_data_o.xlsx and report_data_c.xlsx.
A metadata example from a UK audit of political engagement (http://www.datacarpentry.org/spreadsheets-socialsci/03-formatting-problems/). This is included as this file: audit_of_political_engagement_11_ukda_data_dictionary.rtf.
A SAFI dataset about house types and livestock types in Mozambique and Tanzania. This dataset is included as the data file SAFI_results_2. No information about the source of this data appears to be included in the lesson.
A small toy dataset consisting of the words "Hello world, how are you" in various permutations. (http://www.datacarpentry.org/spreadsheets-socialsci/06-exporting-data/).

Thus use of multiple different datasets is confusing and potentially distracting to the learners. I propose that this lesson is reorganized such that all of the exercises use the SAFI dataset. This reorganization involves the following:

Addition of a description of the SAFI data and a link to the source data
Deletion of other data files
Removal of image of "Hello world" dataset from Episode 1
Moving Episode 3 to immediately after Episode 1.
Replacing the metadata section with an example of metadata from the SAFI data.
Adjustment of the Quality control episode so that the examples refer to the SAFI dataset.
Adjustment of the Dates episode so that the examples refer to the SAFI dataset (if possible - this data may not include dates).

Simplify installation instructions for libreoffice

they are exactly repeated for the 3 OSes

download links in setup are broken

depending on where the data end up, we'll need to update the download link

topic for the repository

Please, consider adding a lesson topic to the repository. To do so you can follow the help about how to add topics to the repository. Check out the topics that the Genomics R intro lesson has gotten to add others that may be relevant to this lesson.

This will help people to know which repositories are lessons and also could be used to automate analysis of the repositories.

caveat for date handling

Episode 4 has the text:

This regional variation is handled automatically by the spreadsheet program so that when you are typing in dates they appear as you would expect. If you try to type in a US format date into a UK version of Excel, it may or may not be treated as a date.

The phrase "they appear as you would expect" is somewhat misleading. They may or may not appear as you would expect! If, for example, you are a researcher from the US working in the UK and using your new lab's computer and software, you may type in a US date and have it render according to UK rules. Alternatively, if that same person is using their personal computer and sharing files back and forth with a collaborator who uses a UK version of Excel, there will also be problems!

I think it is key here to stress that because Excel is manipulating the values that you enter (and storing them in some non-transparent way) you can't be sure exactly what you're getting (ie it's not reproducible). That's why it's always best to store dates not as dates but as numbers using multiple columns to separate the components (e.g. month, day, year).

remove mention of openoffice in setup

more caveats for date handling

In Episode 4 it would be good to mention somewhere in the section "dates Excel doesn't recognize" or "dates Excel gets wrong" that different versions of Excel will perform differently. For example, different versions of Excel treat different dates as "day 1" for the purpose of calculating the numerical value of a date.

Identifying cells that break data validation rules

The current quality assurance section states that "data validation rules are not applied retroactively to data that is already present in the cell".

However, in Excel, it's possible to highlight sections of your data that don't meet the data validation rules by clicking the drop-down next to "Data Validation" in Excel and selecting "Circle Invalid Data".

Showing learners how to find and locate incorrectly-validated data would be a good addition to the curriculum.

Updating learning outcomes

@sheraaron suggests the following change:

On this page: https://datacarpentry.org/spreadsheets-socialsci/, I would like to add: (Under - "In this lesson, you will learn:" To enable us to use data successfully)

@ErinBecker and/or @fmichonneau - do changes to the outcomes need to be routed through the CAC?

This was originally submitted as issue #71

Lesson release checklist

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: 2018-04-27
Scheduled Release Date: 2018-04-30

Checklist of tasks to complete before release:

Dates spreadsheet value errors

In the dates as data episode, the spreadsheet has two tabs. The first tab is data stored as DD-MM-YYYY, the second is MM-DD-YYYY. If learners use the wrong tab for their location, they will get a #VALUE error. This should be added to Instructor notes.

Flow of formatting errors/metadata section (episdoes 2-3)

The organization of the beginning of this lesson as written feels either repetitive or mysterious. I taught it for the second time today and I switched the order some that made it flow a little more smoothly I think. I'd like to recommend a fairly significant reorganization. of the content in the first 3 episodes.

Introduction [new ep 1]

( material that's there already)

Data Description

overview of the SAFI data (from ep 2)

Formatting Data in Spreadsheets [new ep 2]

intro with goals and basics of tidy data (from ep 2)
exercise to find errors in messy data (from ep 2)

(common formatting errors sections from ep 3)

MetaData [new ep 3]

current material on metadata (from ep 2)
new material on how to setup meta data// links to resources,

the suggested new material came up because the host site I taught at today the librarians in the room (hosts) mentioned that some repositories have standards for metadata and standards for it, we don't need a deep coverage of that, but links or points to some of that and noting that format of meta data is based on disciplinary standards I think is valuable add to make that discussion more concrete.

introduce report_data spreadsheets

Describe data being used for report_data_c.xlsx and report_data_o.xlsx

What is the "Reading" variable and what units is it in?
Where are "North" and "South" regions of?
Is this modified from real data that is available somewhere?

These questions should be answered when the data is first introduced.

readme file and SAFI_clean.csv

Hi I'm planning to teach this fantastic workshop in a couple of weeks at the University of Bath. As I've been going through the lessons a couple of things struck me. The first was the need for a readme file for the data files so that we can model good research data management practice to students (I'm a research data librarian). The second is that SAFI_clean.csv has a couple of variables (items_owned and months_lack_food) that would be problematic to analyse in most statistical analysis programmes because they have multiple values that are all in a different order. My concern is that by calling this file 'clean' we are then saying that it is ready to go. I have attached a readme file for the data files that you are very welcome to use (you will need to remove references to the University of Bath). I have also attached a version of SAFI_clean where there is a worksheet where all text variables have been coded to numerical variables and items_owned and months_lack_food have also been coded into yes / no variables using and IF statement that will search for text variables. There is also a worksheet in the workbook with documentation to accompany the changes I've made to SAFI_clean. This could be provided as a resource for tutors if you think it would be useful.

All the best
Alison Nightingale (University of Bath, UK)
SAFI_clean_processed.xlsx
SAFI_readme.txt

A glossary is needed

There is no glossary in http://www.datacarpentry.org/socialsci-workshop/reference/

Add mention of "data organization in spreadsheets"

It's a well-written, comprehensive, concise paper that covers many of the same points made in the lesson: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989

Can be added in lesson 1, along with Hadley Wickham's tidy data paper.

Quality assurance: Excel refuse to validate data on CSV file.

the Quality assurance section suggest to open the clean CSV and apply Validation Rules.
Excel will refuse to do that until the CSV is re-saved as xlsx.

As we don't really use CSV until later, could/should(?) the default file be an XLSX ?

A description of the dataset is needed

The goal of the study needs be articulated
All the variables included in the dataset need to be described, and why they are included in the dataset needs to be explained
The type of questions that can be asked from the data needs to be articulated
The full citation information for the dataset needs to be included (as well as possible links to published studies using the dataset).

Error alert vs. input message

In the Quality assurance episode, the example input message given ("Invalid number: Number of household members must be a whole number between 1 and 30.") would be more appropriate as an error alert message.

clarify report_data_o.xlsx

Episode 02-reformatting includes an Excel spreadsheet titled "report_data_o.xlsx". This spreadsheet has a number of individual tables, with each having another header bar off to the right side of that table. It wasn't clear to me at first, but that second header bar is meant to apply to the table to it's left (as extra metadata) and not to be a header for another independent table. Those headers can be moved to immediately below the header line for each table to clarify that they belong with that data.

Note that this is an example of "bad formatted" data so it's ok to have two header lines per table.

remove unnecessary jargon

In episode one there is this text:

consistency when representing boolean values

The term "boolean" is probably not familiar to many people. Could remove this jargon and/or add a parenthesis saying something like "True" vs "TRUE".

move "spreadsheets outline" to index page

The first episode has some text like:

Spreadsheets outline
In this lesson, we’re going to talk about:
Using best practice to create your own data tables in spreadsheets
Reformatting existing spreadsheets (using Excel)
Recognising and reformatting dates in spreadsheets
Basic quality control; using data validation and Data entry forms (in Excel)
Exporting data from spreadsheets

This can be moved to the index page and used in the lesson introduction.

They key points for Episode 2 are unclear

The text for Episode 2's key points is a bit unclear:

Heading lines may contain information relevant to the rows below it and needs to added to each of them.
There may be multiple heading lines with different information in each.

I saw this when I was reviewing #34

For the first point, we are referring to plural and singular heading lines in the same sentence.

Dates and Date Formatting needs a file download step/link

The file for this activity appears to be present in the repo but is not linked anywhere in the episode:
https://github.com/data-lessons/spreadsheets-socialsci/blob/gh-pages/data/dates_1.csv

fix data section in setup

http://www.datacarpentry.org/spreadsheets-socialsci/setup/ is out of date. Needs link to SAFI data and for links to old datasets to be removed.

Section about metadata should mention metadata standards

metadata is more useful if it follows a standard. A quick search indicates that DDI (http://www.ddialliance.org/training/why-use-ddi) is used in the social sciences. Also indicate that librarians might be able to provide help and guidance with this.

Add items to index page

Index is missing a title, a intro description paragraph, and prerequisites.

link to download pages for openoffice and libreoffice

The first episode has:

Free spreadsheet programs such as LibreOffice and OpenOffice are available.

This should include links to LibreOffice and OpenOffice download pages.

01-format-data/index.html, one link is 404

... Looking at an entry for a particular variable, such as the variable SEX, provides valuable information...

Link lead to a 404.

action items from Curriculum Advisors

The Social Sciences CAC ([email protected]) met June 15th and 19th to discuss the full Social Sciences curriculum and provide recommendations to the Maintainers about work for these lessons between now and their next publication (December 2018). Their specific action items for this lesson are as follows:

Adding list of metadata resources to Spreadsheets lesson
Moving discussion of historical data to main text of Spreadsheets lesson (from callout box)

Please see the meeting minutes for more details.

Numbers does not have data validation

Learners working on Numbers will need to install LibreOffice. This should be added to Instructor Notes.

host data on Figshare

Currently, the data for this lesson lives in this repo ((here)[https://github.com/datacarpentry/spreadsheets-socialsci/tree/gh-pages/data]), but it should be hosted on Figshare.

very short paragraphs look odd

The first episode has a section with a lot of one sentence paragraphs. These look very strange to me. It would be better to collapse this to one or two paragraphs.

PC/Mac issues

Running through this lesson on a Mac running Excel v 15.40 (2017), I've encountered several problems. Some minor, some major. Will continue to list these here as I find them. So far:

"paste values" is instead "paste special -> values and numbers" on a mac
dates on a mac default to January 1 1904 as the date for value 1. This is an additional problem with date formatting compatibility.
Major problem: importing a csv file free of date formatting cannot be done using the Open function on my version. It must instead be accomplished by using File -> Import, followed by selecting csv -> delimited -> comma -> data format. In the final selection there is no option to avoid classifying type so "text" must be selected. (Oddly, selecting "general" does not prevent Excel from applying date formatting.)

Using multiple tabs

Reading over the lesson, I found conflicting a advice about whether it's advisable to use multiple tabs in Excel. In "Formatting data tables in Spreadsheets"/"Keeping track of your analyses", it's suggested to keep notes about your experiment in a new tab to keep the information together with your data. However, when you export an Excel spreadsheet to CSV for preservation/interoperability as recommended in the next section, each tab/sheet will be saved as a separate file. Also, the section about Metadata specifies that metadata should not be contained in the data file itself (experiment notes could be considered a form of descriptive metadata).

I like the suggestion to keep track of experiment notes along with your data, but there could be some clarification about how to manage this information once the dataset/analyses are finalized (e.g. save your notes as a separate CSV or paste them into a .txt file; preserve the link with the dataset by using a strategic file name and/or including information about the dataset at the beginning of the file).

add caveat about manually reformatting spreadsheets

Episode 2 has a good exercise for fixing formatting issues in a badly formatted spreadsheet, however this may lead the learner to infer that manually formatting a spreadsheet is good practice (despite the fact that it's not reproducible). It would be good to note here the following (with a little exposition of each):

This exercise is provided as an example of what good formatting looks like and why that formatting is needed.
A note about reproducibility.
It's best practice to format your data sheets like this from the beginning, but if you can't, at least save a raw version of your data before making any manual edits.
Some note about how we'll be learning about OpenRefine for making more reproducible changes to data sheets later in the workshop.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Suggestion for Metadata Section

When DDI (Data Documentation Initiative) is introduced, I think it would be beneficial to state that this standard is applicable across disciplines because it generally applies to being used for data from surveys and observational research methods commonly used in the social, behavioral, economic, and health sciences.

It can also be added that metadata files are commonly created and saved as XML or JSON files, which are machine-readable but also easily read by humans.

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels	New 'The Carpentries' labels
bug	type:bug
discussion	type:discussion
enhancement	type:enhancement
help-wanted	help wanted
newcomer-friendly	good first issue
template-and-tools	type:template and tools
work-in-progress	status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.

Numbers spreadsheet program

The Introduction section refers to the Numbers spreadsheet program as an example of what students could use during the workshop (under "Things You'll Need To Complete This Tutorial"), but in the Setup instructions, learners are told not to use Apple's Numbers because it does not contain some of the necessary features.

Clarifying data download in "Things You'll Need"

@sheraaron suggests the following change to the intro page:

I also suggest to add to https://datacarpentry.org/spreadsheets-socialsci/00-intro/index.html
Under Things you'll need to complete this tutorial: Open a folder on your computer and save all your spreadsheets necessary for this exercise.

This makes sense to me - any objections to this change @ErinBecker and/or @fmichonneau?

This was originally submitted as issue #71

datacarpentry / spreadsheets-socialsci Goto Github PK

spreadsheets-socialsci's Issues

Lesson Release checklist

Introduction [new ep 1]

Data Description

Formatting Data in Spreadsheets [new ep 2]

(common formatting errors sections from ep 3)

MetaData [new ep 3]

Recommend Projects

Recommend Topics

Recommend Org