kbenoit / sophistication Goto Github PK

R package associated with Benoit, Munger and Spirling (2017) paper(s)

R 100.00%

sophistication's Introduction

Code for use in measuring the sophistication of political text

“Measuring and Explaining Political Sophistication Through Textual Complexity” by Kenneth Benoit, Kevin Munger, and Arthur Spirling. This package is built on quanteda.

How to install

Using the devtools package:

devtools::install_github("kbenoit/sophistication")

If you have trouble with your sophistication installation using devtools, check that you have pre-installed conda or miniconda and are using the correct version of spacyr. Try installing sophistication with the following steps:

devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
library("spacyr")
spacy_install()
spacy_initialize()
devtools::install_github("kbenoit/sophistication")

For more information please see the spacyr documentation here: https://cran.r-project.org/web/packages/spacyr/readme/README.html .

Included Data

new name	original name	description
`data_corpus_fifthgrade`	`fifthCorpus`	Fifth-grade reading texts
`data_corpus_crimson`	`crimsonCorpus`	Editorials from the Harvard Crimson
`data_corpus_partybroadcast`	`partybcastCorpus`	UK political party broadcasts
`data_corpus_presdebates`	`presDebateCorpus`	US presidential debates 2016

How to use

library("sophistication")
## Loading required package: quanteda
## Package version: 2.1.9000
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
## spacy python option is already set, spacyr will use:
##  condaenv = "spacy_condaenv"
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")

# make the snipepts of one sentence, between 100-350 chars in length
data(data_corpus_sotu, package = "quanteda.corpora")
snippetData <- snippets_make(data_corpus_sotu, nsentence = 1, minchar = 150, maxchar = 250)
# clean up the snippets
snippetData <- snippets_clean(snippetData)
## Cleaning 20,662 snippets...
##    removed 1,166 snippets containing numbers of at least 1,000
##    removed 273 snippets containing ALL CAPS titles
##    ...finished.

# randomly sample three snippets
set.seed(10)
testData <- snippetData[sample(1:nrow(snippetData), 5), ]

# generate pairs for a minimum spanning tree
(snippetPairsMST <- pairs_regular_make(testData))
##           docID1 snippetID1
## 1   Madison-1813    2500042
## 2 Roosevelt-1938   14900134
## 3     Grant-1872    8400141
## 4   Johnson-1966   18100222
##                                                                                                                                                                                                                                                    text1
## 1                                                                         The minister plenipotentiary of the United States at Paris had not been enabled by proper opportunities to press the objects of his mission as prescribed by his instructions.
## 2 We have but to talk with hundreds of small bankers throughout the United States to realize that irrespective of local conditions, they are compelled in practice to accept the policies laid down by a small number of the larger banks in the Nation.
## 3                        Ten additional stations have been established in the United States, and arrangements have been made for an exchange of reports with Canada, and a similar exchange of observations is contemplated with the West India Islands.
## 4                                                                                    We will respond if others reduce their use of force, and we will withdraw our soldiers once South Vietnam is securely guaranteed the right to shape its own future.
##           docID2 snippetID2
## 1 Roosevelt-1938   14900134
## 2     Grant-1872    8400141
## 3   Johnson-1966   18100222
## 4   Clinton-1998   21900127
##                                                                                                                                                                                                                                                    text2
## 1 We have but to talk with hundreds of small bankers throughout the United States to realize that irrespective of local conditions, they are compelled in practice to accept the policies laid down by a small number of the larger banks in the Nation.
## 2                        Ten additional stations have been established in the United States, and arrangements have been made for an exchange of reports with Canada, and a similar exchange of observations is contemplated with the West India Islands.
## 3                                                                                    We will respond if others reduce their use of force, and we will withdraw our soldiers once South Vietnam is securely guaranteed the right to shape its own future.
## 4                                      And I think we should say to all the people we're trying to represent here that preparing for a far-off storm that may reach our shores is far wiser than ignoring the thunder till the clouds are just overhead.

We can also use the package function to generate “gold” questions based on readability differences:

# make a lot of candidate pairs
snippetPairsAll <- pairs_regular_make(snippetData[sample(1:nrow(snippetData), 1000), ])
# make 10 gold from these
pairs_gold_make(snippetPairsAll, n.pairs = 10)
## Starting the creation of gold questions...
##    computing Flesch readability measure
##    selecting top different 10 pairs
##    applying min.diff.quantile thresholds of 2.89, 34.57
##    creating gold_reason text
##    ...finished.
##              docID1 snippetID1
## 1         Taft-1910   12200029
## 2        Grant-1872    8400202
## 3         Polk-1846    5800321
## 4        Obama-2010   23100392
## 5       Monroe-1818    3000043
## 6       Hoover-1929   14100290
## 7  Eisenhower-1953b   16600327
## 8      Carter-1979b   19800517
## 9       Arthur-1884    9600167
## 10       Nixon-1971   18600034
##                                                                                                                                                                                                                                              text1
## 1                                                            In completion of this work, the regulations agreed upon require congressional legislation to make them effective and for their enforcement in fulfillment of the treaty stipulations.
## 2                                                                          The work which in some of them for some years has been in arrears has been brought down to a recent date, and in all the current business is being promptly dispatched.
## 3                                                        The reasons which induced me to recommend the measure at that time still exist, and I again submit the subject for your consideration and suggest the importance of early action upon it.
## 4                                                                 And it lives on in all the Americans who've dropped everything to go some place they've never been and pull people they've never known from rubble, prompting chants of "U.S.A.!
## 5                                                                 Even if the territory had been exclusively that of Spain and her power complete over it, we had a right by the law of nations to follow the enemy on it and to subdue him there.
## 6                                                                           Any other attitude by the Federal Government will undermine one of the most precious possessions of the American people; that is, local and individual responsibility.
## 7  I shall shortly send you specific recommendations for establishing such an appropriate commission, together with a reorganization plan defining new administrative status for all Federal activities in health, education, and social security.
## 8                         I recently announced my intention to submit legislation to Congress protecting the rights of the press, and others preparing materials for publication, from searches and seizures undertaken without judicial approval.
## 9                       The Secretary of War submits the report of the Chief of Engineers as to the practicability of protecting our important cities on the seaboard by fortifications and other defenses able to repel modern methods of attack.
## 10                                                                                     Over the next 2 weeks, I will call upon Congress to take action on more than 35 pieces of proposed legislation on which action was not completed last year.
##             docID2 snippetID2
## 1   Cleveland-1888   10000309
## 2    Coolidge-1927   13900428
## 3  Eisenhower-1960   17400074
## 4        Taft-1912   12400227
## 5     Carter-1978b   19600273
## 6       Obama-2016   23700349
## 7       Grant-1870    8200062
## 8   Roosevelt-1936   14700068
## 9     Lincoln-1861    7300176
## 10    Carter-1980b   20000194
##                                                                                                                                                                                                                                               text2
## 1                                                                                         It remains to make the most of it, and when that shall be done the curse will be lifted, the Indian race saved, and the sin of their oppression redeemed.
## 2                                                                                 Stimson, former Secretary of War, was sent there to cooperate with our diplomatic and military officers in effecting a settlement between the contending parties.
## 3                                                              These qualities of determination are particularly essential because of the fact that the process of improvement will necessarily be gradual and laborious rather than revolutionary.
## 4  The good offices which the commissioners were able to exercise were instrumental in bringing the contending parties together and in furnishing a basis of adjustment which it is hoped will result in permanent benefit to the Dominican people.
## 5                                                                                         This year we will continue our deregulatory efforts in the legislative and administrative areas in order to reduce anti-competitive practices and abuses.
## 6                               I see it in the elderly woman who will wait in line to cast her vote as long as she has to, the new citizen who casts his vote for the first time, the volunteers at the polls who believe every vote should count.
## 7                                                                                   Its possession by us will in a few years build up a coastwise commerce of immense magnitude, which will go far toward restoring to us our lost merchant marine.
## 8                                                                  In March, 1933, I appealed to the Congress of the United States and to the people of the United States in a new effort to restore power to those to whom it rightfully belonged.
## 9                                                             In a storm at sea no one on board can wish the ship to sink, and yet not unfrequently all go down together because too many will direct and no single mind can be allowed to control.
## 10                                                              If unemployment should dramatically increase, I will be prepared to consider actions to counter that increase, consistent with our overriding concern about accelerating inflation.
##        read1     read2  readdiff _golden easier_gold
## 1   14.49885 72.045000 -57.54615    TRUE           2
## 2   71.24875  9.750000  61.49875    TRUE           1
## 3   39.52375 -8.044000  47.56775    TRUE           1
## 4   60.76500 14.649211  46.11579    TRUE           1
## 5   50.44500 -4.101304  54.54630    TRUE           1
## 6    5.49200 58.347727 -52.85573    TRUE           2
## 7  -18.63875 51.958621 -70.59737    TRUE           2
## 8    4.36500 55.377941 -51.01294    TRUE           2
## 9   12.84500 59.528649 -46.68365    TRUE           2
## 10  57.79310  2.700000  55.09310    TRUE           1
##                                                                                                                                                                                                      easier_gold_reason
## 1  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 2  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 3  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 4  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 5  Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 6  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 7  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 8  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 9  Text B is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.
## 10 Text A is "easier" to read because it contains some combination of shorter sentences, more commonly used and more easily understood terms, and is generally less complicated and easier to read and grasp its point.

There is a lot more than this, of course. Our documentation will improve as we develop the package with an aim to eventual CRAN release.

sophistication's People

Contributors

Stargazers

Watchers

Forkers

trinker kmunger myeomans fpli-mbr ruschenpohler lcombs gmhurtado

sophistication's Issues

Additional documentation for predict_readability

Within the predict_readability, it could be useful to add a bit more information or reference materials for:

what lambda represents and its interpretation
what the default values of the parameters mean for reference_top = -2.1763368548 and
reference_bottom = -3.865467.

Note: we have some information on where the parameter values come from (fifth grade, bottom and SOTU, top) within the documentation section, but it could be useful to have a description on how increasing or decreasing this value may impact the analysis.

Documentation Examples for Functions

It could be useful to add examples to the following functions:

snippets_make() & snippets_clean() (could be from README)
pairs_gold_make() & pairs_regular_make()
pairs_gold_browse() & pairs_regular_browse()
covars_make()

*Note: I tried to go through the documentation for all the functions, but this may not be an exhaustive list!

check dependencies and imports

move quanteda from Depends to Imports, use quanteda:: before any quanteda functions
check status of Suggests packages
check that all Imports are used (reshape2 ?)

Data Corpora Exceeding CRAN Size Limit

As far as I can tell, the data corpora currently included in Sophistication (e.g., data_corpus_fifthgrade.RData, data_corpus_partybroadcasts.RData, etc.) are larger than the CRAN limit of 5 MB.

I experimented with using a drat repository (the idea in this article https://journal.r-project.org/archive/2017/RJ-2017-026/index.html shared with me by @ArthurSpirling) as a possible solution. The basic idea is that a code package submitted to CRAN can interact with a larger data package in a drat repository hosted on GitHub.

I implemented the first two steps outlined in the article and created a drat repository and posted a data package with the Sophistication corpora in it (here: https://github.com/gmhurtado/drat/tree/gh-pages), which all worked as expected. This implementation is by no means a ‘formal’ effort and only temporary, as I forked the original drat repository (https://github.com/eddelbuettel/drat) to my own account and did not include documentation.

The next step of this solution would be to update the source code’s DESCRIPTION file to reflect the new dependency by listing the new data package as a 'suggested' package, as well as adding the drat repository address to the file. Instructions for installing the data package might be useful as well, but are not necessary.

Finally, the source code should be modified to customize behavior when the data package is loaded. In particular, the source code should check whether or not the data package has been installed upon loading. Additionally, if any of the source code's functions, tests, examples, or the like are conditional on data from the data package being loaded, the corresponding code should be updated to check for the package's installation.

If a drat repository is the best solution for solving the data size issue, then I can replicate the drat repository implementation formally, and write out the explicit changes needed for the source code package.

covars_make_all returns NAs for baselines

When I run the function covars_make_all on hansard speeches, 29 of the 33 measures are returned correctly, but not the 4 measures related to word rarity.

However, when I run covars_make_baselines, these 4 measures work on the same corpus.

setwd("C:/Users/kevin/Dropbox/Benoit_Spirling_Readability/hansard_data/")
files<-list.files()
  
 
##initialize
all_files<-read.csv(paste0(files[2]), stringsAsFactors = F)
restricted<-filter(all_files, party == "Conservative" | party == "Labour")
speakers<-all_files$speaker
tab<-table(speakers)
speakers_morethan10 <- names(tab[tab > 10])
restricted <- filter(restricted, speaker %in% speakers_morethan10)


restricted<-restricted[which(ntoken(restricted$text)>10),]

data_corpus_speeches66 <- corpus(restricted)

pos<-covars_make_all(data_corpus_speeches66, dependency=F)`

> pos$google_min_2000[100]
[1] NA

> pos$brown_mean[1000]
[1] NA

unclear reference to lambda baseline

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling @kmunger can you recall which one this is supposed to be?

Package installation instructions need to describe spacyr in more detail

I received the following error message when attempting to install the package for the first time:

Finding a python executable with spaCy installed...
Error: package or namespace load failed for 'sophistication':
 .onAttach failed in attachNamespace() for 'sophistication', details:
  call: set_spacy_python_option(python_executable, virtualenv, condaenv, 
  error: spaCy or language model en_core_web_sm is not installed in any of python executables.
Error: loading failed
Execution halted

After troubleshooting, I found that this would be a solution to my problem:

# install and load spacyr
devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
library("spacyr")
spacy_install()
spacy_initialize()
# source: https://cran.r-project.org/web/packages/spacyr/readme/README.html

After this, I was able to install sophistication using devtools::install_github("kbenoit/sophistication"). However, I also needed to install the following dependency first: install.packages('quanteda.textmodels').

Additionally, other team members had new issues on installation:

With both both 32-bit and 64-bit versions of R, we needed to run the following code to successfully install: devtools::install_github("kbenoit/sophistication", INSTALL_opts=c("--no-multiarch")).
There is also a potential compatibility issue with RcppArmadillo, which requires selecting "yes" to compilation question upon install of sophistication.

predict needs to calculate the year a given text was created

in the R package, the predict function first needs to calculate the covariates on the new text

and currently, the google year is hardcoded

compute_google_min_2000 <- function(toks) {
   baseline_word <- "the"
   baseline_year <- 2000

It needs to be able to take as an argument the year (decade) that a given text was written

quanteda.textmodels dependency for data function call

Needed to install install.packages('quanteda.textmodels') prior to running data(data_corpus_irishbudget2010, package = "quanteda.textmodels").

The error message shown is below:

> library(sophistication)
> data(data_corpus_irishbudget2010, package = "quanteda.textmodels")
Error in find.package(package, lib.loc, verbose = verbose) : 
  there is no package called ‘quanteda.textmodels’

Resolved by installing quanteda.textmodels:

> install.packages('quanteda.textmodels')
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/quanteda.textmodels_0.9.1.tgz'
Content type 'application/x-gzip' length 5867161 bytes (5.6 MB)
==================================================
downloaded 5.6 MB


The downloaded binary packages are in
	/var/folders/w5/p8sm469n19n_gtsshmv6_lym0000gn/T//RtmpDiSmbN/downloaded_packages
> data(data_corpus_irishbudget2010, package = "quanteda.textmodels")

Session info is below:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spacyr_1.2.1        sophistication_0.70

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6        lattice_0.20-41     prettyunits_1.1.1   ps_1.3.3           
 [5] gtools_3.8.2        assertthat_0.2.1    rprojroot_1.3-2     digest_0.6.25      
 [9] R6_2.4.1            BradleyTerry2_1.1-2 plyr_1.8.6          backports_1.1.8    
[13] quanteda_2.0.1      ggplot2_3.3.2       pillar_1.4.4        qvcalc_1.0.2       
[17] rlang_0.4.6         curl_4.3            rstudioapi_0.11     minqa_1.2.4        
[21] data.table_1.12.8   callr_3.4.3         nloptr_1.2.2.1      Matrix_1.2-18      
[25] reticulate_1.16     desc_1.2.0          devtools_2.3.0      splines_4.0.0      
[29] lme4_1.1-23         statmod_1.4.34      stringr_1.4.0       munsell_0.5.0      
[33] compiler_4.0.0      pkgconfig_2.0.3     pkgbuild_1.0.8      tidyselect_1.1.0   
[37] tibble_3.0.1        brglm_0.6.2         fansi_0.4.1         crayon_1.3.4       
[41] dplyr_1.0.0         withr_2.2.0         MASS_7.3-51.5       grid_4.0.0         
[45] jsonlite_1.7.0      xtable_1.8-4        nlme_3.1-147        gtable_0.3.0       
[49] lifecycle_0.2.0     magrittr_1.5        scales_1.1.1        RcppParallel_5.0.2 
[53] cli_2.0.2           stringi_1.4.6       reshape2_1.4.4      fs_1.4.1           
[57] remotes_2.1.1       testthat_2.3.2      ellipsis_0.3.1      stopwords_2.0      
[61] vctrs_0.3.1         generics_0.0.2      boot_1.3-24         fastmatch_1.1-0    
[65] tools_4.0.0         glue_1.4.1          purrr_0.3.4         processx_3.4.2     
[69] profileModel_0.6.0  pkgload_1.1.0       colorspace_1.4-1    sessioninfo_1.1.1  
[73] memoise_1.1.0       usethis_1.6.1

Bug in covars_make()

The tests below work as predicted, until we get to pr_noun:

test_that("pr_noun computed the same in predict v component function", {
    txt <- c(test1 = "One two cat.  One two cat.  Always eat apples.")
    frompredict <- as.data.frame(sophistication:::get_covars_from_newdata.character(txt))

    # should be: (1 + 1 + 1) / 9 = 0.33333
    # doc_id sentence_id token_id  token  lemma   pos     entity
    # 1   test1           1        1    One    one   NUM CARDINAL_B
    # 2   test1           1        2    two    two   NUM CARDINAL_B
    # 3   test1           1        3    cat    cat  NOUN           
    # 4   test1           1        4      .      . PUNCT           
    # 5   test1           1        5               SPACE           
    # 6   test1           2        1    One    one   NUM CARDINAL_B
    # 7   test1           2        2    two    two   NUM CARDINAL_B
    # 8   test1           2        3    cat    cat  NOUN           
    # 9   test1           2        4      .      . PUNCT           
    # 10  test1           2        5               SPACE           
    # 11  test1           3        1 Always always   ADV           
    # 12  test1           3        2    eat    eat  VERB           
    # 13  test1           3        3 apples  apple  NOUN           
    # 14  test1           3        4      .      . PUNCT  
    
    expect_equal(covars_make_pos(txt)[, c("pr_noun")], 0.333, tol = .001)
    expect_equal(frompredict[, "pr_noun"], 0.333, tol = .001)    # 0.214
})

# Error: covars_make_pos(txt)[, c("pr_noun")] not equal to 0.333.
# 1/1 mismatches
# [1] 0.214 - 0.333 == -0.119

What is happening is that the total tokens are being used for the denominator in covars_make_pos(), which includes SPACE and PUNCT. In the function get_covars_from_newdata() (used by predict_readability()`) this is being computed correctly. The difference in the example above is the denominator of 9 versus 14.

Remove personal dropbox references

in R/zzz.R

Overall Structure Visualization

For the documentation, it could be helpful to have an overview of the structure of the package in addition to the "How to use" portion of the README.

It looks like there are a couple of work-streams and functions that go together (i.e. regular vs. gold). It could be nice to have a simple tree structure visualization that shows which functions go together and when the user will need to make choices in their workflow.

For example:

R
│  -  data()
│  - snippets_make()  
|  - snippets_clean() 
|
└─── pairs_make_x()
│   │  - pairs_regular_make()
│   │  - pairs_gold_make()

Warning message in bootstrap_readability

I just reinstalled the latest version, ran this code, and got the following warning messages; it looks like things are working, other than the warning message.

library(sophistication)
library(quanteda)

## compute readability and bootstrap the sentences
set.seed(20170308)
results <- bootstrap_readability(data_corpus_SOTU, n = 2, measure = "Flesch", verbose = TRUE)

Bootstrapping readability statistics for 230 documents:
...computing values from original texts
Argument drop not used. ...segmenting the texts into sentences for resampling
...computing values from bootstrapped replicates
1 2
...computing summary statistics
argument is not numeric or logical: returning NA

Dynamic Model not accepting year argument

The results are the same regardless of how the year argument is specified.

Example:

##plot results for SOTU data--figures 2 and 3
library(quanteda.corpora)
library(sophistication)
library(dplyr)
library(quanteda)
require(stringr)
require(data.table)
library(ggplot2)


setwd("C:/Users/kevin/Documents/GitHub/sophistication-papers/")

##load BT model
load("analysis_article/AJPS_replication/data/fitted_BT_model.Rdata")

##calculate year of speeches
year<-lubridate::year(docvars(data_corpus_sotu, "Date"))


##generate continuous scores for SOTU model -- chose dynamic or static

###static
results_static<-predict_readability(BT_best, newdata = data_corpus_sotu, bootstrap_n = 10, verbose = T, 
                             baseline_year = 2000)

static<-results_static$prob 

###dynamic                            
results_dynamic<-predict_readability(BT_best, newdata = data_corpus_sotu, bootstrap_n = 10, verbose = T, 
                             baseline_year = year)

dynamic<-results_dynamic$prob 

static-dynamic

a problem for short texts

Hi,

Thanks for making and maintaining your important package. I've encountered the following error when I applied your package to some short texts.

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-

And I found that the issue is caused by removing the punctuation in the following code in get_covars_new.corpus function.

    # remove punctuation
    result <- result[pos != "PUNCT" & pos != "SPACE"]

Basically, for some cases, it removed some documents without a meaningful text and so there will be a mismatch between the final output and the input text file. I think that it will be better if you assign "NA" to these cases instead of just removing them, or at the very least, I hope that you can make it work for some short texts.

Thanks again for making this important package.

get_covars_new.corpus <- function(x, baseline_year = 2000, verbose = FALSE) {
    google_min <- pos <- `:=` <- nchars <- token <- sentence_id <- years <- NULL
    doc_id <- .N <- NULL

    if (verbose) message("   ...tagging parts of speech")
    suppressMessages(
        spacyr::spacy_initialize()
    )
    result <- data.table(spacyr::spacy_parse(texts(x), tag = FALSE, lemma = FALSE, entity = FALSE, dependency = FALSE))
    # remove punctuation
    result <- result[pos != "PUNCT" & pos != "SPACE"]
    
    # if years is a vector, repeat for each token
    if (length(baseline_year) > 1)
        baseline_year <- rep(baseline_year,
                             result[, list(years = length(sentence_id)), by = doc_id][, years])

    if (verbose) message("   ...computing word lengths in characters")
    result[, nchars := stringi::stri_length(token)]

    if (verbose) message("   ...computing baselines from Google frequencies")
    bl_google <- 
        suppressWarnings(make_baselines_google(result$token, baseline_word = "the",
                                               baseline_year = baseline_year)[, 2])
    result[, google_min := bl_google]

    if (verbose) message("   ...aggregating to sentence level")
    result[,
           list(doc_id = doc_id[1],
                n_noun = sum(pos == "NOUN", na.rm = TRUE),
                n_chars = sum(nchars, na.rm = TRUE),
                google_min = min(google_min, na.rm = TRUE),
                n_token = .N),
           by = c("sentence_id", "doc_id")]
}```

Quanteda v2.0 and snippets_make()

I believe I have found an issue regarding the snippets_make() function and later quanteda versions. With quanteda versions 2.0.0 and later installed, I get the following error when I call snippets_make():

> snippets_make(data_corpus_fifthgrade)
Error in select_docvars(attr(x, "docvars"), field, user = TRUE, system = FALSE,  : 
  field(s) docname not found

Here is the session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_2.1.1      sophistication_0.70

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          lattice_0.20-41     prettyunits_1.1.1  
 [4] ps_1.3.3            gtools_3.8.2        assertthat_0.2.1   
 [7] rprojroot_1.3-2     digest_0.6.25       R6_2.4.1           
[10] BradleyTerry2_1.1-2 plyr_1.8.6          backports_1.1.8    
[13] ggplot2_3.3.2       pillar_1.4.6        qvcalc_1.0.2       
[16] rlang_0.4.7         curl_4.3            rstudioapi_0.11    
[19] minqa_1.2.4         data.table_1.13.0   callr_3.4.3        
[22] nloptr_1.2.2.2      Matrix_1.2-18       reticulate_1.16    
[25] desc_1.2.0          devtools_2.3.1      splines_4.0.2      
[28] lme4_1.1-23         statmod_1.4.34      stringr_1.4.0      
[31] munsell_0.5.0       compiler_4.0.2      spacyr_1.2.1       
[34] pkgconfig_2.0.3     pkgbuild_1.1.0      tibble_3.0.3       
[37] brglm_0.6.2         fansi_0.4.1         crayon_1.3.4       
[40] withr_2.2.0         MASS_7.3-51.6       grid_4.0.2         
[43] nlme_3.1-148        jsonlite_1.7.0      xtable_1.8-4       
[46] gtable_0.3.0        lifecycle_0.2.0     magrittr_1.5       
[49] scales_1.1.1        RcppParallel_5.0.2  cli_2.0.2          
[52] stringi_1.4.6       reshape2_1.4.4      fs_1.5.0           
[55] remotes_2.2.0       testthat_2.3.2      ellipsis_0.3.1     
[58] stopwords_2.0       vctrs_0.3.2         boot_1.3-25        
[61] fastmatch_1.1-0     tools_4.0.2         glue_1.4.1         
[64] processx_3.4.3      profileModel_0.6.0  pkgload_1.1.0      
[67] yaml_2.2.1          colorspace_1.4-1    sessioninfo_1.1.1  
[70] memoise_1.1.0       usethis_1.6.1

When I downloaded earlier quanteda versions (I tried with 1.0.0, 1.1.0, 1.4.0, 1.4.1), snippets_make() worked as expected.

Using quanteda version 1.5.0 also worked, but gave the following warning message:

Warning message:
'[[<-.corpus' is deprecated.
Use 'docvars' instead.
See help("Deprecated")

After looking through the changes associated with the release of quanteda v2.0, I believe the issue lies in the changes made to index operators for core objects, in particular the changes outlined here: https://github.com/quanteda/quanteda/wiki/indexing_core_objects. However, I am not entirely confident that this is the issue.

Computer crashed when computing covars

On a windows 10 machine, when I tried to run

require(sophistication)

require(spacyr)
spacy_initialize()

# sotu addresses
data(data_corpus_sotu, package = "quanteda.corpora")

cbind(covars_make(data_corpus_sotu), 
covars_make_pos(data_corpus_sotu),
covars_make_baselines(data_corpus_sotu, 
                          baseline_year = lubridate::year(docvars(data_corpus_sotu, "Date"))))

My computer froze up completely (to the point where I had to manually restart) and upon rebooting, I saw this error message

However, when I split the call up into

x1<-covars_make(data_corpus_sotu)
x2<-covars_make_pos(data_corpus_sotu)
x3<-covars_make_baselines(data_corpus_sotu, 
                          baseline_year = lubridate::year(docvars(data_corpus_sotu, "Date")))
sotu_covars <- cbind(x1,
                     x2,
                     x3)

Everything worked fine. Not sure if this is just a memory issue on my machine, thought I'd report it.

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] spacyr_0.9.9 sophistication_0.65 quanteda_1.2.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 pillar_1.2.2 compiler_3.5.0 nloptr_1.0.4 plyr_1.8.4
[6] tools_3.5.0 stopwords_0.9.0 lme4_1.1-17 jsonlite_1.5 lubridate_1.7.4
[11] tibble_1.4.2 gtable_0.2.0 nlme_3.1-137 lattice_0.20-35 rlang_0.2.0
[16] Matrix_1.2-14 fastmatch_1.1-0 brglm_0.6.1 BradleyTerry2_1.0-8 stringr_1.3.1
[21] gtools_3.5.0 qvcalc_0.9-1 grid_3.5.0 reticulate_1.7 data.table_1.11.2
[26] profileModel_0.5-9 minqa_1.2.4 ggplot2_2.2.1 reshape2_1.4.3 magrittr_1.5
[31] scales_0.5.0 splines_3.5.0 MASS_7.3-49 colorspace_1.3-2 xtable_1.8-2
[36] stringi_1.1.7 RcppParallel_4.4.0 lazyeval_0.2.1 munsell_0.4.3