Giter VIP home page Giter VIP logo

comp479-project4's Introduction

COMP479-Project4: Concordia Spider

Description

The concordia_spider.py script is a web scraping tool built using Scrapy, designed to extract and analyze text data from web pages on the Concordia University website (concordia.ca). It collects text content, performs sentiment analysis, and clusters the pages based on their content. Additionally, the script includes two supporting modules, sentiment.py, and clusters.py, which provide sentiment analysis and text clustering functionality, respectively. The main script, main.py, demonstrates how to use these modules to analyze and cluster the scraped data.

Prerequisites

Before running the code, make sure you have the following Python libraries installed:

Scrapy (version 2.11.0)

BeautifulSoup (version 4.12.2)

Afinn (version 0.1)

NLTK (version 3.8.1)

Langdetect (version 1.0.9)

Scikit-learn (version 1.3.2)

Numpy (1.26)

You can install these libraries using pip if you haven't already:

Copy code pip install scrapy beautifulsoup4 afinn-langdetect nltk scikit-learn numpy

Usage

Scraping Data

The main script, concordia_spider.py, is responsible for scraping data from the Concordia University website. You can adjust the maximum number of files to be downloaded by modifying the max_files parameter in the ConcordiaSpider class constructor.

Sentiment Analysis

The sentiment.py module contains the SentimentAnalyzer class, which performs sentiment analysis on text data. It uses the AFINN lexicon for sentiment scoring and can also cluster documents based on sentiment.

Text Clustering

The clusters.py module includes the TextCluster class, which clusters text documents using K-Means clustering. It also calculates the average sentiment for each cluster using the sentiment analyzer. I am currently trying to create a meaningful GUI.

Running the Main Script

main.py demonstrates how to use the SentimentAnalyzer and TextCluster classes to analyze and cluster data from the Concordia University website. It loads data from a JSON file (in this case, scraped.json), performs clustering with different cluster counts, and saves cluster information to text files (cluster_3.txt and cluster_6.txt). To run the main script, execute the following command:

Folder Structure

concordia_spider.py: The main web scraping script.

sentiment.py: Module for sentiment analysis and clustering of documents.

clusters.py: Module for text clustering using K-Means.

main.py: The main script demonstrating the usage of the other modules.

COMP479-Project4: A directory for saving scraped data and cluster information.

Important Notes

The code in the concordia_spider.py script is designed to work specifically with the Concordia University website. Make sure to adapt it for different websites if needed. Data is scraped to a JSON file (scraped.json) in the COMP479-Project4 directory. You may need to create this directory manually. The code may require adjustments or additional error handling for different websites or data sources.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.