Giter VIP home page Giter VIP logo

spreadsheets-socialsci's Issues

Codebooks as social science metadata

Just to help bridge the curriculum to social sciences, its worth mentioning familiar forms of metadata in the social sciences. Survey researchers often write codebooks that explain their questions. Still, a codebook does not have other essential metadata info, like when/where a dataset was collected from.

libreoffice automatically uses semicolons as separators

Learners who are using LibreOffice for the workshop will have problems with the dataset as the default for LibreOffice is to treat tabs, commas, AND semicolons as delimiters. Need to add this to the instructor notes or a callout. Learners can select to use just commas as delimiters.

non-english versions of Excel

If learners are using a non-English language version of Excel, the =MONTH(), =DAY(), etc functions won't work for them. They will need to type in their language's equivalent of that word in the formula.

Add to instructor notes?

suggestion for metadata section on missing values

Suggestions for edits/additional content
Data Organization in Spreadsheets for Social Scientists
Formatting data tables in Spreadsheets
Metadata

Some of this information may be familiar to learners who collect or analyze survey data or data sets accompanied with additional data documentation, such as codebooks. Codebooks will often describe the original survey or interview questions associated with particular variables, the way variables have been constructed, response categories and their associated values, and the notations for missing values throughout the data. For example, the General Social Survey maintains their entire codebook online. Looking at an entry for a particular variable, such as the variable SEX, provides valuable information about the original question wording, scales or response categories, the years covered for that variable, the sample or sub-samples surveyed, and the meaning of particular values. Descriptions of missing values are important in cleaning survey data because they describe the various reasons why respondents did not answer a question (i.e., not applicable, didn't know, refused to answer, etc.), which leaves blank cells in the data. For example, in the General Social Survey missing values are numbered as 8, 9, 0 and sometimes other numbers that might be interpreted later on as integers that could interfere with accurate queries and analyses.

Cell references off after inserting column

References in this instruction: "Select the cell b4 and insert the formula '=IF($F4="Year",$G4,$B3)" need to be adjusted to reflect the column added in a previous step. Should reference G4... H4.

reorganization of this lesson

I'm working through the lesson and noticed that there are multiple "toy" datasets that are used for different parts of the lesson. Generally, in Data Carpentry lessons, we try to build all examples around a single dataset (as much as possible) and to have that dataset be somewhat realistic in terms of what the learners would be working with in their everyday workflow.

This lesson currently uses:

  • A toy dataset about "readings" in two different regions ("North" and "South") over nine days in January and February 2015. This is fictional data that doesn't represent anything in particular and is just being used as an example of bad formatting. (see conversation in #32). This data is included as the data files report_data_o.xlsx and report_data_c.xlsx.

  • A metadata example from a UK audit of political engagement (http://www.datacarpentry.org/spreadsheets-socialsci/03-formatting-problems/). This is included as this file: audit_of_political_engagement_11_ukda_data_dictionary.rtf.

  • A SAFI dataset about house types and livestock types in Mozambique and Tanzania. This dataset is included as the data file SAFI_results_2. No information about the source of this data appears to be included in the lesson.

  • A small toy dataset consisting of the words "Hello world, how are you" in various permutations. (http://www.datacarpentry.org/spreadsheets-socialsci/06-exporting-data/).

Thus use of multiple different datasets is confusing and potentially distracting to the learners. I propose that this lesson is reorganized such that all of the exercises use the SAFI dataset. This reorganization involves the following:

  • Addition of a description of the SAFI data and a link to the source data
  • Deletion of other data files
  • Removal of image of "Hello world" dataset from Episode 1
  • Moving Episode 3 to immediately after Episode 1.
  • Replacing the metadata section with an example of metadata from the SAFI data.
  • Adjustment of the Quality control episode so that the examples refer to the SAFI dataset.
  • Adjustment of the Dates episode so that the examples refer to the SAFI dataset (if possible - this data may not include dates).

caveat for date handling

Episode 4 has the text:

This regional variation is handled automatically by the spreadsheet program so that when you are typing in dates they appear as you would expect. If you try to type in a US format date into a UK version of Excel, it may or may not be treated as a date.

The phrase "they appear as you would expect" is somewhat misleading. They may or may not appear as you would expect! If, for example, you are a researcher from the US working in the UK and using your new lab's computer and software, you may type in a US date and have it render according to UK rules. Alternatively, if that same person is using their personal computer and sharing files back and forth with a collaborator who uses a UK version of Excel, there will also be problems!

I think it is key here to stress that because Excel is manipulating the values that you enter (and storing them in some non-transparent way) you can't be sure exactly what you're getting (ie it's not reproducible). That's why it's always best to store dates not as dates but as numbers using multiple columns to separate the components (e.g. month, day, year).

more caveats for date handling

In Episode 4 it would be good to mention somewhere in the section "dates Excel doesn't recognize" or "dates Excel gets wrong" that different versions of Excel will perform differently. For example, different versions of Excel treat different dates as "day 1" for the purpose of calculating the numerical value of a date.

Identifying cells that break data validation rules

The current quality assurance section states that "data validation rules are not applied retroactively to data that is already present in the cell".

However, in Excel, it's possible to highlight sections of your data that don't meet the data validation rules by clicking the drop-down next to "Data Validation" in Excel and selecting "Circle Invalid Data".

Screen Shot 2019-07-23 at 3 02 50 PM

Showing learners how to find and locate incorrectly-validated data would be a good addition to the curriculum.

Lesson release checklist

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: 2018-04-27
Scheduled Release Date: 2018-04-30

Checklist of tasks to complete before release:

  • check that the learning objectives reflect the content of the lessons
  • check that learning objectives are phrased as statements using action words
  • check for typos
  • check that the live coding examples work as expected
  • if example code generates warnings, explain in narrative and instructor notes
  • check that challenges and their solutions work as expected
  • check that the challenges test skills that have been seen
  • check that the setup instructions are up to date (e.g., update version numbers)
  • check that data is available and mentions of the data in the lessons are accurate
  • check that the instructor guide is up to date with the content of the lessons
  • check that all the links within the lessons work (this should be automated)
  • check that the cheat sheets included in lessons are up to date (e.g., RStudio updates them regularly)
  • check that languge is clear and free of idioms and colloquialisms
  • make sure formatting of the code in the lesson looks good (e.g. line breaks)
  • check for clarity and flow of narrative
  • update README as needed
  • fill out “overview” for each module - minutes needed for teaching and exercises, questions and learning objectives
  • check that contributor guidelines are clear and consistent
  • clean up files (e.g. delete deprecated files, insure filenames are consistent)
  • update the release notes (NEWS)
  • tag release on GitHub

Dates spreadsheet value errors

In the dates as data episode, the spreadsheet has two tabs. The first tab is data stored as DD-MM-YYYY, the second is MM-DD-YYYY. If learners use the wrong tab for their location, they will get a #VALUE error. This should be added to Instructor notes.

Flow of formatting errors/metadata section (episdoes 2-3)

The organization of the beginning of this lesson as written feels either repetitive or mysterious. I taught it for the second time today and I switched the order some that made it flow a little more smoothly I think. I'd like to recommend a fairly significant reorganization. of the content in the first 3 episodes.

Introduction [new ep 1]

( material that's there already)

Data Description

overview of the SAFI data (from ep 2)

Formatting Data in Spreadsheets [new ep 2]

intro with goals and basics of tidy data (from ep 2)
exercise to find errors in messy data (from ep 2)

(common formatting errors sections from ep 3)

MetaData [new ep 3]

current material on metadata (from ep 2)
new material on how to setup meta data// links to resources,

the suggested new material came up because the host site I taught at today the librarians in the room (hosts) mentioned that some repositories have standards for metadata and standards for it, we don't need a deep coverage of that, but links or points to some of that and noting that format of meta data is based on disciplinary standards I think is valuable add to make that discussion more concrete.

introduce report_data spreadsheets

Describe data being used for report_data_c.xlsx and report_data_o.xlsx

What is the "Reading" variable and what units is it in?
Where are "North" and "South" regions of?
Is this modified from real data that is available somewhere?

These questions should be answered when the data is first introduced.

readme file and SAFI_clean.csv

Hi I'm planning to teach this fantastic workshop in a couple of weeks at the University of Bath. As I've been going through the lessons a couple of things struck me. The first was the need for a readme file for the data files so that we can model good research data management practice to students (I'm a research data librarian). The second is that SAFI_clean.csv has a couple of variables (items_owned and months_lack_food) that would be problematic to analyse in most statistical analysis programmes because they have multiple values that are all in a different order. My concern is that by calling this file 'clean' we are then saying that it is ready to go. I have attached a readme file for the data files that you are very welcome to use (you will need to remove references to the University of Bath). I have also attached a version of SAFI_clean where there is a worksheet where all text variables have been coded to numerical variables and items_owned and months_lack_food have also been coded into yes / no variables using and IF statement that will search for text variables. There is also a worksheet in the workbook with documentation to accompany the changes I've made to SAFI_clean. This could be provided as a resource for tutors if you think it would be useful.

All the best
Alison Nightingale (University of Bath, UK)
SAFI_clean_processed.xlsx
SAFI_readme.txt

A description of the dataset is needed

  • The goal of the study needs be articulated
  • All the variables included in the dataset need to be described, and why they are included in the dataset needs to be explained
  • The type of questions that can be asked from the data needs to be articulated
  • The full citation information for the dataset needs to be included (as well as possible links to published studies using the dataset).

Error alert vs. input message

In the Quality assurance episode, the example input message given ("Invalid number: Number of household members must be a whole number between 1 and 30.") would be more appropriate as an error alert message.

clarify report_data_o.xlsx

Episode 02-reformatting includes an Excel spreadsheet titled "report_data_o.xlsx". This spreadsheet has a number of individual tables, with each having another header bar off to the right side of that table. It wasn't clear to me at first, but that second header bar is meant to apply to the table to it's left (as extra metadata) and not to be a header for another independent table. Those headers can be moved to immediately below the header line for each table to clarify that they belong with that data.

Note that this is an example of "bad formatted" data so it's ok to have two header lines per table.

remove unnecessary jargon

In episode one there is this text:

consistency when representing boolean values

The term "boolean" is probably not familiar to many people. Could remove this jargon and/or add a parenthesis saying something like "True" vs "TRUE".

move "spreadsheets outline" to index page

The first episode has some text like:

Spreadsheets outline
In this lesson, we’re going to talk about:
Using best practice to create your own data tables in spreadsheets
Reformatting existing spreadsheets (using Excel)
Recognising and reformatting dates in spreadsheets
Basic quality control; using data validation and Data entry forms (in Excel)
Exporting data from spreadsheets

This can be moved to the index page and used in the lesson introduction.

They key points for Episode 2 are unclear

The text for Episode 2's key points is a bit unclear:

Heading lines may contain information relevant to the rows below it and needs to added to each of them.
There may be multiple heading lines with different information in each.

I saw this when I was reviewing #34

For the first point, we are referring to plural and singular heading lines in the same sentence.

action items from Curriculum Advisors

The Social Sciences CAC ([email protected]) met June 15th and 19th to discuss the full Social Sciences curriculum and provide recommendations to the Maintainers about work for these lessons between now and their next publication (December 2018). Their specific action items for this lesson are as follows:

  • Adding list of metadata resources to Spreadsheets lesson
  • Moving discussion of historical data to main text of Spreadsheets lesson (from callout box)

Please see the meeting minutes for more details.

host data on Figshare

Currently, the data for this lesson lives in this repo ((here)[https://github.com/datacarpentry/spreadsheets-socialsci/tree/gh-pages/data]), but it should be hosted on Figshare.

PC/Mac issues

Running through this lesson on a Mac running Excel v 15.40 (2017), I've encountered several problems. Some minor, some major. Will continue to list these here as I find them. So far:

  1. "paste values" is instead "paste special -> values and numbers" on a mac
  2. dates on a mac default to January 1 1904 as the date for value 1. This is an additional problem with date formatting compatibility.
  3. Major problem: importing a csv file free of date formatting cannot be done using the Open function on my version. It must instead be accomplished by using File -> Import, followed by selecting csv -> delimited -> comma -> data format. In the final selection there is no option to avoid classifying type so "text" must be selected. (Oddly, selecting "general" does not prevent Excel from applying date formatting.)

Using multiple tabs

Reading over the lesson, I found conflicting a advice about whether it's advisable to use multiple tabs in Excel. In "Formatting data tables in Spreadsheets"/"Keeping track of your analyses", it's suggested to keep notes about your experiment in a new tab to keep the information together with your data. However, when you export an Excel spreadsheet to CSV for preservation/interoperability as recommended in the next section, each tab/sheet will be saved as a separate file. Also, the section about Metadata specifies that metadata should not be contained in the data file itself (experiment notes could be considered a form of descriptive metadata).

I like the suggestion to keep track of experiment notes along with your data, but there could be some clarification about how to manage this information once the dataset/analyses are finalized (e.g. save your notes as a separate CSV or paste them into a .txt file; preserve the link with the dataset by using a strategic file name and/or including information about the dataset at the beginning of the file).

add caveat about manually reformatting spreadsheets

Episode 2 has a good exercise for fixing formatting issues in a badly formatted spreadsheet, however this may lead the learner to infer that manually formatting a spreadsheet is good practice (despite the fact that it's not reproducible). It would be good to note here the following (with a little exposition of each):

  • This exercise is provided as an example of what good formatting looks like and why that formatting is needed.
  • A note about reproducibility.
  • It's best practice to format your data sheets like this from the beginning, but if you can't, at least save a raw version of your data before making any manual edits.
  • Some note about how we'll be learning about OpenRefine for making more reproducible changes to data sheets later in the workshop.

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

  • Example code chunks run as expected
  • Challenges / exercises run as expected
  • Challenge / exercise solutions are correct
  • Call out boxes (exercises, discussions, tips, etc) render correctly
  • A schedule appears on the lesson homepage (e.g. not “00:00”)
  • Each episode includes learning objectives
  • Each episode includes questions
  • Each episode includes key points
  • Setup instructions are up-to-date, correct, clear, and complete
  • File structure is clean (e.g. delete deprecated files, insure filenames are consistent)
  • Some Instructor notes are provided
  • Lesson links work as expected

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Suggestion for Metadata Section

When DDI (Data Documentation Initiative) is introduced, I think it would be beneficial to state that this standard is applicable across disciplines because it generally applies to being used for data from surveys and observational research methods commonly used in the social, behavioral, economic, and health sciences.

It can also be added that metadata files are commonly created and saved as XML or JSON files, which are machine-readable but also easily read by humans.

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels New 'The Carpentries' labels
bug type:bug
discussion type:discussion
enhancement type:enhancement
help-wanted help wanted
newcomer-friendly good first issue
template-and-tools type:template and tools
work-in-progress status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.

Numbers spreadsheet program

The Introduction section refers to the Numbers spreadsheet program as an example of what students could use during the workshop (under "Things You'll Need To Complete This Tutorial"), but in the Setup instructions, learners are told not to use Apple's Numbers because it does not contain some of the necessary features.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.