Giter VIP home page Giter VIP logo

news-threads's Introduction

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Introduction

The News Threads project analyzes news articles to help find similarities between news articles and trace news provenance across time.

Getting Started

Prerequisites

  • Python 3.6 or newer
  • Data in the correct input format.
  • Change the input data path in news_threads.conf.
  • The news_threads.conf file contains information about the config format.

Data Format

  • Input data is a single CSV named documents.csv with news articles in the folder specified in news_threads.conf
  • Format is: (canonical_url, title, doc_id, domain, text, first_scrape_date)
  • You may change the format by modifying the "build_minhash" method in dbscan.py
  • Additional fields are ignored by default
  • Field description:
    • canonical_url: URL of the news article
    • title: Title of the news article
    • doc_id: A unique ID for the article. This may be text or numeric.
    • domain: Domain on which the article was published. "apnews.com", "cnn.com", etc.
    • text: Article text
    • first_scrape_date: Date the web crawler first downloaded the article.

Linux

  1. run "make init" from the root of the folder to build a Python virtual environment with all dependencies
  2. Type "source .env/bin/activate" to launch the virtual environment.
  3. Type "python -m newsthreads" to run the news threads pipeline

Windows

Note: These instructions were verified using Python 3.8 installed from the Microsoft Store

  1. "python -m venv .env"
  2. ".env\Scripts\activate.bat"
  3. "pip3 install -r requirements.txt"
  4. Type "python -m newsthreads" to run the news threads pipeline

Output files

  • document_metadata.csv: "documentId, firstScrapedDate, title" for every document. This puts the document into a standard format so other pieces of code can read it.
  • hierarchy_graph_edges.csv: This file contains two row formats intermingled. The purpose of this file is to determine the hierarchy of clusters by epsilon value (which clusters are parents of other clusters).
    • Leaf-rows: These rows have documents that no longer fall in a cluster at a specific epsilon value. The format is: "previousClusterId_previousEpsilon, documentId, 1"
    • Non-leaf rows: Clusters in these rows are not leafs in the hierarchy. The format is: "previousClusterId_previousEpsilon, clusterId_epsilon, count"
      • previousClusterId: ID of the cluster that is the parent of the current cluster in the hierarchy.
      • previousEpsilon: Epsilon at which the previous cluster was generated.
      • clusterId: ID of the current cluster. This cluster is a subset of the parent cluster.
      • epsilon: Epsilon at which the current cluster was generated. This will be previousEpsilon - 1.
      • count: Weight of the link between previousClusterId and clusterId at their respective epsilons. Higher weight means more documents are in the child cluster.
  • Clusters (DBSCAN)
    • epsilon-cluster.csv: "epsilon, clusterCount" Number of unique clusters at each epsilon.
    • clusters_%(epsilon)s.csv: "clusterId, document id" for all documents that are in clusters at the epsilon specified by the file name.
    • joined_cluster_labels.csv: "clusterid,docid,epsilon" for each document at every epsilon level at which the document has a cluster.
  • Sentences
    • fragment_summaries.csv: "docid,sent_id,sentnum" for every sentence where the sentnum is the zero-based index of the sentence within the document. A sentnum of "0" means that it is the first sentence in the document. A sentnum of "1" means that it is the second sentence in the document. Etc.
    • fragment_metadata.csv: "docid,date,domain" for every document referenced by fragment_summaries.csv. Date is the web crawler scrape date.
    • sentence_id_lookup.csv: "text, sentence_id" for every unique sentence.
  • News/Domain Graph outputs
    • domain_graph_sentence_count.csv: "Source,Target,Weight"
      • Source is a domain that has news articles
      • "Target" is a domain that has news articles that were copied by the "Source" domain.
      • "Weight" is the number of sentences that "Source" copied from "Target".
    • sentence_origin.csv: "sent_id, domain" where the domain is the internet domain where the sentence was first seen by the crawler.
    • domain_graph.csv: "Source,Target,Weight"
      • Source is a domain that has news articles
      • Target is a domain that has news articles that were copied by the "Source" domain.
      • Weight is the number of documents that "Source" copied from "Target".

news-threads's People

Contributors

carolyncb avatar jonmclean avatar microsoft-github-operations[bot] avatar microsoftopensource avatar pgtest2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.