Create a list of Artist/Band and number of plays on each day.
Replace everything from the Artist/Band name that is not a word character - [A-Z][a-z][0-9]
For this type of analyses, we want to focus on words that carry meaning: names, nouns, and verbs. Words like the, of, and occur more than any other words in English.
Use a list of stop words to filter out those most-frequent tokens. Anything on the list is removed from the stream of input Artist/Band names before further processing.The list is currently maintained in the code but can be read as a cache file from the command line.
The input records have time in Unix timestamp. Convert to YYYY-MM-DD format
In the Mapper phase, Mapper takes records one by one:
- Normalize Artist/Band name
- Convert output date to YYYY-MM-DD format
- Mapper output - Key: Artist Tuple(artist name,date), Value: number of plays per artist
In the Reducer phase:
- For each Artist/Band tuple (Artist name,output date(YYYY-MM-DD)) sum up the count of number of plays.
- Using the tuple of Artist/Band, output date(YYYY-MM-DD) as key ensures that all Artist/Band for a day goes to the same reducer.
- Reducer output - Key: Text(Artist name, Date), Value: Number of plays per artist per day
git clone [email protected]:swathibhat28/artistplays-mapreduce.git
Import as Eclipse Java Project
Go to Project Properties window and in "Java Build Path" section, click on "Add External Jars"
In the JAR Selection dialog, select the following jars from the extracted Hadoop tar.gz file.
- share/hadoop/common/hadoop-common-*.jar
- share/hadoop/mapreduce/hadoop-mapreduce-client-core-*.jar
- share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
Additional jars required are:
- org-apache-commons-logging.jar
- google-collections-0.9.jar
- org.apache.common.collections.jar
- commons-cli-2.0.jar
- com.google.guava_1.6.0 2.jar
hadoop jar ArtistPlaysMapreduce.jar <input> <output> -artistNames <artistNamesFile>
All the fields above are required.
- Assuming 1 reducer which is the default. This can be configured when required as an optional parameter. Suggested number is 1 reducer for 1K records
- Stop words can be dropped - using a stop word list to filter out irrelevant words from artist names
- Normalization applied only through A-Za-z0-9.
- Looking for exact matches with Artist names not performing any confidence matches with this version