Giter VIP home page Giter VIP logo

ai_research's Introduction

The Privatization of AI Research: Causes and Consequences

We analyze the causes and consequences of the ongoing privatization of AI research, particularly the transition of AI researchers from academia to industry. This is a collaborative work between the Aalborg University and Nesta.

How to use the data / repo

  • Clone the repository with

git clone https://github.com/nestauk/ai_research

  • Do cd ai_research to change your working directory to the project's repo and run make create_environment. This will create a new Anaconda environment and install all the project dependencies.
  • conda activate ai_research to activate the newly created anaconda environment.
  • In a jupyter notebook, you can do the following to read a data table:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from ai_research.mag.mag_orm import FieldOfStudy

# Read the configuration file and create a session.
db_config = 'postgres+psycopg2://USER:PASSWORD@HOST:5432/DBNAME'
engine = create_engine(db_config)
Session = sessionmaker(engine)
s = Session()

fields_of_study = pd.read_sql(s.query(FieldOfStudy).statement, s.bind)

Note: mag_orm.py contains the SQLAlchemy mappings (ORMs) used in the database. In the example above, I imported the FieldOfStudy ORM which corresponds to mag_fields_of_study table in the database. You have to import the ORMs for the tables you want to read!

Data

Sources:

Data decisions

  • Timeframe: 2000-2020
  • We collect MAG papers containing one of the following Fields of Study:
    • deep learning
    • machine learning
    • reinforcement learning
  • We keep document with and without a DOI

Project based on the Nesta cookiecutter data science project template.

ai_research's People

Contributors

daniel-hain avatar kstathou avatar rjuro avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ai_research's Issues

Descriptive analysis

Includes statistics about:

  • Level of switching between academia and industry
  • Switching patterns (building on @RJuro's analysis)
  • Evolution of switching behaviours by year
  • Geography of switching

Topic modelling of AI abstracts

Semantic analysis of AI abstracts. Generates topic mixes by paper and a topic map based on eg topic mix co-occurrence in papers we can use to identify the links between regions of AI research and switching outcomes.

  • Text pre-processing
  • Train topic model
  • Topic map

Are we training the topic models for all corpus or for all corpus before a cut-off year or a topic model for each year...?

EDA

  • Install packages for visualisation with altair
  • Script to get relevant datasets
  • EDA

Refactor the data collection and enrichment

The existing scripts are a stripped-down version of my work here. I will update them to reflect the improvements I made in the past few months. This is not a high-priority task as we already have fresh data, however, it should be done so that our work can be executed end-to-end.

Setup slide deck

The setup slide-deck is a google slides dock linked from the repo. It will include:

  • List of literature streams relevant for the project
  • List of research questions
  • List of methods

This issue refers to the creation of the slide deck and adding initial content. The slide-deck will be populated by all team members.

We could also include emerging findings here.

Mini-GitHub tutorial

This is to introduce the git workflow (creating issues, branches and Pull Requests). Would you be happy to convene @kstathou? I would attend too.

Exploratory data analysis

EDA of the enriched MAG data. Some goals:

  1. Spot and report data collection bugs so that we can fix them.
  2. Identify data gaps (and bonus points for proposing what else needs to be collected!).
  3. Understand what's in the data and which of the research questions we have discussed can be tackled.

Especially for 1. and 2., we should ideally be creating new github issues to discuss them.

Collect and parse MAG data

Collect and parse MAG papers from 2000 to 2020 that contain one of the following fields of study: machine learning, reinforcement learning, deep learning.

Update the Microsoft Academic Graph data

I updated the data last week and accounting for MAG's lag in data collection, the database should have papers published up to early April. It's straightforward to schedule bi-weekly updates, however, we can work with this set for now and update it again in the summer.

Some tasks:

  • Update the data schema. I shared this one a few months ago but in the meantime, I expanded the scope of data collection and improved the data processing pipeline.
  • Document the data collection decisions.
  • Add a short SQLAlchemy recipe on how to load the tables in Pandas.

Note: This covers Extraction:[Update, Get data into SQL, Geocoding] from @RJuro trello board.

Generate predictors of switching behaviour

These will include things such as:

  • Researcher interdisciplinarity (based on topic mixes)
  • Researcher specialisation in particular (hot?) topics
    • This may require additional analysis to characterise hot topics
  • Researcher track record
  • Researcher social capital (networks)
    ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.