Light

nestauk / ai_research Goto Github PK

View Code? Open in Web Editor NEW

4.0 11.0 0.0 82.44 MB

AI research work.

License: MIT License

Makefile 23.80% Python 74.82% Shell 0.29% JavaScript 0.70% CSS 0.40%

ai_research's Introduction

The Privatization of AI Research: Causes and Consequences

We analyze the causes and consequences of the ongoing privatization of AI research, particularly the transition of AI researchers from academia to industry. This is a collaborative work between the Aalborg University and Nesta.

How to use the data / repo

Clone the repository with

git clone https://github.com/nestauk/ai_research

Do cd ai_research to change your working directory to the project's repo and run make create_environment. This will create a new Anaconda environment and install all the project dependencies.
conda activate ai_research to activate the newly created anaconda environment.
In a jupyter notebook, you can do the following to read a data table:

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from ai_research.mag.mag_orm import FieldOfStudy

# Read the configuration file and create a session.
db_config = 'postgres+psycopg2://USER:PASSWORD@HOST:5432/DBNAME'
engine = create_engine(db_config)
Session = sessionmaker(engine)
s = Session()

fields_of_study = pd.read_sql(s.query(FieldOfStudy).statement, s.bind)

Note: mag_orm.py contains the SQLAlchemy mappings (ORMs) used in the database. In the example above, I imported the FieldOfStudy ORM which corresponds to mag_fields_of_study table in the database. You have to import the ORMs for the tables you want to read!

Data

Sources:

Data decisions

Timeframe: 2000-2020
We collect MAG papers containing one of the following Fields of Study:
- deep learning
- machine learning
- reinforcement learning
We keep document with and without a DOI

Project based on the Nesta cookiecutter data science project template.

ai_research's People

Contributors

Stargazers

Watchers

ai_research's Issues

Descriptive analysis

Includes statistics about:

Level of switching between academia and industry
Switching patterns (building on @RJuro's analysis)
Evolution of switching behaviours by year
Geography of switching

EDA of drivers of switching behaviour

Which of the drivers identified in #18 are more strongly associated to switching?

Develop a hypothesis to test formally in #20

Disambiguate author and institution names

MAG contains duplicated author and possibly institution names. Microsoft has done an incredible job on entity disambiguation, however, it's not a solved problem.

Given time constraints and in the spirit of being more agile, I would recommend working on this at a later stage and only if we notice that it greatly distorts the results.

This covers Munging and shaping:[Disambiguate people and institutions] from Roman's trello board.

Write up non-technical parts of paper

Includes:

Introduction
Implications
Conclusions

Topic modelling of AI abstracts

Semantic analysis of AI abstracts. Generates topic mixes by paper and a topic map based on eg topic mix co-occurrence in papers we can use to identify the links between regions of AI research and switching outcomes.

Text pre-processing
Train topic model
Topic map

Are we training the topic models for all corpus or for all corpus before a cut-off year or a topic model for each year...?

Scale up semantic analysis

Scale up prototype semantic analysis using more advanced methods

EDA

Install packages for visualisation with altair
Script to get relevant datasets
EDA

Predict the affiliation type

Predict if an affiliation is non-profit / university or corporation.

I'm currently doing this using a hand-crafted set of words, however, we could possibly do better. @RJuro mentioned that a student had created a classifier for it so let's see if we can use it!

Note: This covers Munging and shaping:[Predict if uni or industry] from @RJuro trello board.

Refactor the data collection and enrichment

The existing scripts are a stripped-down version of my work here. I will update them to reflect the improvements I made in the past few months. This is not a high-priority task as we already have fresh data, however, it should be done so that our work can be executed end-to-end.

Setup slide deck

The setup slide-deck is a google slides dock linked from the repo. It will include:

List of literature streams relevant for the project
List of research questions
List of methods

This issue refers to the creation of the slide deck and adding initial content. The slide-deck will be populated by all team members.

We could also include emerging findings here.

Model switching behaviours

Formal model of drivers of switching behaviour

Mini-GitHub tutorial

This is to introduce the git workflow (creating issues, branches and Pull Requests). Would you be happy to convene @kstathou? I would attend too.

Exploratory data analysis

EDA of the enriched MAG data. Some goals:

Spot and report data collection bugs so that we can fix them.
Identify data gaps (and bonus points for proposing what else needs to be collected!).
Understand what's in the data and which of the research questions we have discussed can be tackled.

Especially for 1. and 2., we should ideally be creating new github issues to discuss them.

Devise a data management plan

This covers Extraction:[Data management plan, Joint server access] from @RJuro trello board.

@RJuro, @daniel-hain, let us know of your University's guidelines and if there's something we should do from our side.

Geocode MAG affiliations

Geocode MAG affiliations using Google Places API.

Collect and parse MAG data

Collect and parse MAG papers from 2000 to 2020 that contain one of the following fields of study: machine learning, reinforcement learning, deep learning.

Create draft timeline for project

Update the Microsoft Academic Graph data

I updated the data last week and accounting for MAG's lag in data collection, the database should have papers published up to early April. It's straightforward to schedule bi-weekly updates, however, we can work with this set for now and update it again in the summer.

Some tasks:

Update the data schema. I shared this one a few months ago but in the meantime, I expanded the scope of data collection and improved the data processing pipeline.
Document the data collection decisions.
Add a short SQLAlchemy recipe on how to load the tables in Pandas.

Note: This covers Extraction:[Update, Get data into SQL, Geocoding] from @RJuro trello board.

Generate predictors of switching behaviour

These will include things such as:

Researcher interdisciplinarity (based on topic mixes)
Researcher specialisation in particular (hot?) topics
- This may require additional analysis to characterise hot topics
Researcher track record
Researcher social capital (networks)
...

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.