Giter VIP home page Giter VIP logo

big-data-content-similarity's Introduction

Clustering a Big Dataset using Tika-Similarity

Objective: Using a big dataset - UFO[Unidentified Flying Objects] sightings in the US, we aim to merge it with 4 other relevant datasets such that we add more features to the resulting dataset (v1 UFO dataset).

The goal is to understand the unintended consequences of big data and make inferences about UFO sightings and its causes.

Through this project, we also aim to understand similarity between different UFO sighting instances This project uses a UFO sightings dataset.

Four more datasets (air pollution, drug poisoning, us cities, ufo dataset with latitude and longitude, cancer incedences) are used and a comprehensive dataset is formed.

The final dataset has the following columns:-

  • Description
  • Location
  • Sighted At (time)
  • Reported At (time)
  • Shape
  • Duration
  • Cancer incidences count in all races
  • Cancer incidences count in all white people
  • Cancer incidences count in all hispanics
  • SO2 Mean
  • CO Mean
  • O3 Mean
  • County
  • Population per county
  • Age adjusted death rate (due to drug poisoning)
  • Latitude
  • Longitude
  • Airport name (closest airport)
  • Airport distance (distance between current location and closest airport)

File Structure


  • datasets

    • original_datasets : contains original dataset copies for all datasets used

    • intermediate_datasets: contains any intermediate datasets used by us to reach final merged dataset

    • modified_datasets: final dataset

  • code: code files including python scripts and tika-similarity modified files

  • documents: readme and doc file with answers to questions

  • results: contains images and screenshots, graphs and maps to show inferences and conclusions from the analysis on the final dataset

Scripts to run different files:-


  • Files that contribute to data generation and merging of datasets are:-

    • aiprots_pollution_feature.py
    • Helper.py
    • Integrate_cancer.py
    • integrate_drug.py
    • ufo_add_lat_long.py
    • ufo_airport_pollution_merge.py
  • Files that contribute to visualization:-

    • visualization_air_quality.py
    • visualization_cancer.py
    • visualization_demographics.py
    • visualization_ufo.py

Due to large size of dataset and high computation requirement, we divided the initial dataset into 8 parts and creates an instance on Azure cloud. We ran the location based merge of datasets on this instance to achieve max efficiency and relatively quick results

  • The command to run integrate_cancer.py

    python integrate_cancer.py /path/to/cancer_dataset /path/to/ufo_dataset year outputfile

  • The command to run integrate_drug.py

    python integrate_drug.py drug_dataset /path/to/uscities_dataset path/to/ufo_dataset start_year end_year outputfile

  • The command to run visualiztion_cancer.py

    python visualization_cancer.py /path/to/cancer_dataset

  • The command to run visualization_drug.py

    python visualization_drug.py /path/to/merged_dataset year

  • The command to run visualization_ufo.py

    python visualization_ufo.py /path/to/merged_dataset year

  • The command to run visualization_demographics.py

    python visualization_demographics.py path/to/merged_dataset year

  • Tika-Similarity changes:

Jaccard changes - Existing infrastructure ran the algorithm on the metadata, that too on the "key" and not on the "value".

Now, one more functionality has been included which compares all the features of a single dataset using jaccard algorithm.

Basically every entry has a "key" and "value" pair and the jaccard similarity algorithm is applied to the "value" as it makes more sense.

Corresponding changes were made in circle_packing.py and cluster_scores.py so that correct the json object is passed on to circlepacking.html and cluster-D3.html

We have tried this algorithm only on a subset of our final dataset with 100 entries for better clarity and understanding.

Ideal threshold :- 0.0001 t 0.0003

  • running feature_jaccard.py python feature_jaccard.py -f <filename>

  • running cluster_scores.py (ideal threshold value would be 0.0001 to 0.0004 if it is run on reduced_100.json dataset) python cluster-scores.py [-t threshold_value]

For now, we have implemented this as a separate file, but later on we're planning to merge with the existing code and send a pull request to tika-similarity repository

big-data-content-similarity's People

Contributors

srinidhinandakumar avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.