Giter VIP home page Giter VIP logo

azlyrics-scraper's Introduction

Hi there, I'm Albert ๐Ÿ‘‹

  • ๐Ÿ”ญ Iโ€™m currently working as VP of Engineering at @restbai
  • ๐Ÿ‘ฏ Iโ€™m looking to collaborate on Open Source projects.
  • ๐Ÿ’ฌ Ask me about hackathons, I'm fully passionate about them.
  • ๐Ÿ˜„ Pronouns: he/him
  • โšก Life goal: I would love to full-fill this map, along with this one as well, someday.
  • ๐Ÿ“Œ Location: Barcelona.

azlyrics-scraper's People

Contributors

albertsuarez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

azlyrics-scraper's Issues

csv data contains malformed rows for song 6'1

The row in azlyrics_lyrics_l.csv looks like:

"liz phair","https://www.azlyrics.com/p/phair.html","6'1"","https://www.azlyrics.com/lyrics/lizphair/61.html","i bet you fall in bed[....]"

There's an extra double-quote in the song title field, which confuses the parser in Python's csv library (and probably most others). Per the csv RFC:

If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"

(btw, thank you for publishing this dataset! It's sorely needed.)

Some songs have duplicate rows (due to artist aliases?)

In the latest release of the dataset, there are 74 rows corresponding to Liz Phair songs. 61 of those rows are in azlyrics_lyrics_l.csv under the artist name "Liz Phair". 13 are in azlyrics_lyrics_p.csv under "Phair, Liz".

There are 11 songs which appear in both files. As far as I can tell, the lyrics, song url, and song title are identical between the two files - the only field that differs is the artist name.

I guess this is ultimately an issue of jank on the Azlyrics side, since the site directory has separate listings for 'Liz Phair' and 'Phair, Liz' in their artist directory (which both lead to the same url, https://www.azlyrics.com/p/phair.html). But it would be nice if the scraping pipeline handled deduplication.

I did a quick analysis and found 6,513 total rows with duplicate song urls.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.