digital-monad / ttds Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 48.15 MB

Group coursework for the Text Technologies for Data Science course.

Python 12.78% HTML 0.40% Jupyter Notebook 86.74% Shell 0.08%

search-engine sentimental-analysis lyrics-search

ttds's Issues

Deploy Web Application onto Google Cloud

Rewrite performance critical functions in Julia

Functions to be rewritten:

Phrase Search
Proximity Search
BM25
Parts of the query parser? (More performant NOT and query evaluation maybe?)

Build the search query parser

Parse the input search query. Determine which search is required (boolean, BM25 ranked, proximity etc.) and pass it to the relevant search function.

Indexing Compression?

Compressing the information from CSV file to Binary?

Preprocessing

Create function to apply preprocessing to arbitrary input text. Should include some subset of the following:

Case folding
Stemming
Stopping
Tokenisation

Query Expansion

Use BERT libraries to make script to expand search terms.

Transferring CSV Data

Because data collection script is finally completed, we now need to obtain as much data as possible.
This means collecting artists' initials from JSON files and then translated into CSV.

This would be then used to create index files

Code BM25 ranked search algorithm

The base search algorithm - given a query Q, return a ranked list of documents D ordered by their relevance to the query using BM25.

Create the script to build the index file. Script should generate the term positional inverted index for the song lyric corpus, following a hierarchical format, like the movie quote search group. This should allow us to display actual song lyric lines as results, rather than just the song title/the entirety of the lyrics.

Proposed structure (up for discussion):

term1 : {
    song1 : {
        [(line0, pos2), (line0, pos13), (line12, pos0)]
    },
    song2 : {
        [(line3, pos5), (line10, pos3)]
    }
}

Boolean search

Build the Front-End UI Basics

This should be responsive for both mobile and desktop usage - this feature will be graded for usability

Determine what index format should be stored in the MongoDB

3 collections should be inserted into MongoDB: LyricsMetadata, InvertedIndex, SongsMetaData

SongsMetaData - csv (display frontend)
LyricsMetaData - csv (display frontend)
InvertedIndex - pickle

Missing index_writer.py inserts pickle index file into Pymongo/MongoDB

Obtain song lyric data

Use the Spotify API and web scrape Genius Lyrics to get song lyric data as per this guide. For now I guess just put it into the format from the guide, i.e. a pandas dataframe (not sure how big this will have to be).

Specialised search

Code the specialised search tools using linear merge algorithm

#11
Phrase search
Proximity search

These are grouped into 1 issue so that there is consistency between the 3 algorithms

digital-monad / ttds Goto Github PK

ttds's People

Contributors

Stargazers

Watchers

ttds's Issues

Code the specialised search tools using linear merge algorithm

Recommend Projects

Recommend Topics

Recommend Org