Giter VIP home page Giter VIP logo

titlebot's Introduction

Titlebot

I string words together from the titles of scientific papers using Markov chains. Each word is sampled based on the probability that it follows the preceding word (i.e. I am a bigram model).

So far, I tweet about three kinds of titles:

Additionally, @noamross thought it would be funny to create @HarrisBot, which tweets about whatever @davidjayharris tweets about. This repository contains a model based on @kara_woo's tweets as well.

In general, the machine learning titles are harder to distinguish from real titles, but the ecology titles can be much funnier (see below). Real "creation science" is, of course, indistinguishable for Markov chain output.

Praise for Titlebot:

Examples

Machine learning:

ML_bigram = load_bigram("data/StatMLTitles")
replicate(5, generate_title(bigram = ML_bigram))
## [1] "structured signal processing with missing data"                             
## [2] "determining full conditional sparse gradients with distributional estimates"
## [3] "a new york workshop on grouse and their contextual bandits"                 
## [4] "randomized kaczmarz algorithm and response data"                            
## [5] "learning with applications to colombian conflict analysis"

Ecology:

ecology_bigram = load_bigram("data/plos_ecology")
replicate(5, generate_title(bigram = ecology_bigram))
## [1] "nowhere to predict the composition on population persistence of smart urban environment"                     
## [2] "two constraints are not infection"                                                                           
## [3] "randomization modeling of the short-lived annual forb dominated forests"                                     
## [4] "climate change in the himalaya: water and indigenous burning or increase with the high-throughput sequencing"
## [5] "radiographs reveal unexpected fine-scale analysis of biodiversity"

Answers Research Journal:

answers_bigram = load_bigram("data/Answers_Research_Journal")
replicate(5, generate_title(bigram = answers_bigram))
## [1] "numerical simulation of peer review of any kind exist before the dodwell hypothesis" 
## [2] "adam, free choice, and unification theory for studies"                               
## [3] "numerical simulations of retroviruses"                                               
## [4] "numerical simulation of precipitation in yellowstone national park with a warm ocean"
## [5] "more abundant than stars"

davidjayharris

harris_bigram = load_bigram("data/davidjayharris")
replicate(5, generate_title(bigram = harris_bigram))
## [1] "@srsupp you meant to anything today:"                                                                 
## [2] "@johnmyleswhite does #rstats will take."                                                              
## [3] "@algaebarnacle @rstudioapp is there a typical to the word in #rstats' matrix multiplies"              
## [4] "apple could still valuable. we live in daphnia magna. delightful work, documentation, popularity...)."
## [5] "@kara_woo @algaebarnacle"

kara_woo

woo_bigram = load_bigram("data/kara_woo")
replicate(5, generate_title(bigram = woo_bigram))
## [1] "@alexhanna less of trying to get any reason i get a long, multi-state road trip to was going up on an unrelated note, i'm going to shame."
## [2] "@polesasunder talk on reaching quadruple-digit tweets."                                                                                   
## [3] "@queerscientist oh but no sticker to recruit me a lovelier day of negging *shudder*"                                                      
## [4] "@ansonmackay definitely should!"                                                                                                          
## [5] "@bashir9ist @markcc @rachelapaul @dr24hours @mbeisen not for an account in ca."

Licensing

The code is available under The Artistic License 2.0 (see LICENSE).

The machine learning titles in the "data" folder were scraped by Philippe (@PhDP) from ArXiv and are available under a Creative Commons Share Alike license (some of them are CC-BY).

The ecology titles were scraped from PLOS journals using rplos. These titles are all CC-BY.

The Answers titles are copyrighted by Answers In Genesis. Their inclusion and transformation is not an infringement of copyright in the United States, however, as they are covered by the fair use doctrine.

The HarrisBot data are @davidjayharrs's tweets, minus retweets. These are hereby released as CC-BY.

Kara Woo's tweets are used with her permission.

titlebot's People

Contributors

davharris avatar noamross avatar ethanwhite avatar phdp avatar

Stargazers

Zeyad Deeb avatar Vishal Belsare avatar Eli Song avatar Duncan Murray avatar Harsha K N avatar Steve Vissault avatar Vladimir Chupakhin avatar John Ramey avatar DL Miller avatar

Watchers

 avatar  avatar

titlebot's Issues

Make bigram creation more efficient

No reason to have so many indexing operations. Just figure out what should go in each spot and add it in once.

There may also be a more fundamental change to the algorithm that could speed things up even more.

Shiny App?

Enter your twitter handle and automatically create a bot! The main thing to do would be to write a function to retrieve as many tweets as possible, given that you can only get 200 with the standard twitteR functions. greptweets just runs repeated requests with a 1 second delay between them. It should be easy to port this script to R:

https://github.com/kaihendry/greptweet/blob/master/fetch-tweets.sh

Heuristics

[Gonna start dropping a few ideas in as issues. Let me know if what I'm thinking isn't your cuppa tea, and I'll just do this on my fork.]

I was thinking of implementing some heuristics.

  • Treating certain puncuation marks (.,:;) as words - capturing with a regex and then removing extra spaces at the end.
  • Stripping parentheses, or some smarter way of handling them.
  • Titleization/Capitalization?

In addition, I had a couple of thoughts on twitter-specific heuristics

  • Smart handling of @NAMEs. If all previous words in the title are @NAMEs, then the next word could be sampled from words following ANY of those names.
  • Treating periods at the end of a tweet as equivalent to END. Probably can be done by just stripping final periods.
  • Treating "&", "+", and "and" as the same word, though this weakens the connection with the style of the training set.

It may make sense then to make bigrams from twitter data to have a "twitter" class, possibly have an attribute for the text category ("title", "twitter", etc.).

Keep bigram as object?

Now that there aren't explicit transition matrices in the bigram model, why not just have the model returned as an object rather than saved as plain-text files? The index and word list are by definition going to be smaller than the original data, and you have to load the original data into memory at first, anyway.

If you're dealing with very big training sets, there will be a lot of other speed/memory issues to deal with, which probably won't be solved by having the model saved as plain text files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.