Giter VIP home page Giter VIP logo

maprecude_team_adriano_barbosa_and_andre_matsuda's Introduction

MapRecude

Equipe

Adriano de Souza Barbosa

André Matsuda

Atividades:

  1. Encontrar o vocabulário comum de 1.500 palavras entre 2 livros;
  2. Encontrar o vocabulário de palavras diferente de cada livro entre 2 livros removendo as palavras que forem encontradas nos dois livros;

Pré-requisito

1. Executar passos na raiz deste projeto

2. Extrair conteúdo dos livros para arquivos .txt

Rodar o comando: java -jar HtmlToText.jar path_livro

$ java -jar HtmlToText.jar ./HtmlToText/lit2go.ok

3. Extrair conteúdo das legendas das séries para arquivos .txt

Rodar o comando: java -jar SrtToText.jar path_legendas_series

$ java -jar SrtToText.jar ./SrtToText/series

4. Copiar os resultados (arquivos .txt) para hadoop

$ hadoop fs -mkdir lit2go.ok #cria diretório para livro
$ hadoop fs -mkdir series #cria diretório para series

$ hadoop fs -put ./result_books/ ./lit2go.ok/ #copia livros para hadoop
$ hadoop fs -put ./result_series/ ./series/ #copia series para hadoop

5. Configurar Hive para trabalhar recursivamente

Entra no Hive e executar:

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

Executando Atividades pelo Hive:

1. Encontrar o vocabulário comum de 1.500 palavras entre 2 livros;

Criar tabela do livro 'A little princess':

CREATE EXTERNAL TABLE a_little_princess
(text STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/root/lit2go.ok/result_books/A_Little_Princess/';

Trocar '/user/root/' pelo seu caminho onde criou a pasta result_books no hadoop

Testando a tabela 'A little princess'

SELECT * FROM a_little_princess limit 5;

teste_a_little_princess

Criar tabela do livro 'Dracula':

CREATE EXTERNAL TABLE dracula
(text STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/root/lit2go.ok/result_books/Dracula/';

Testando a tabela 'Dracula'

SELECT * FROM dracula limit 5;

teste_dracula

Criar tabela do livro 'A little princess' com word count :

CREATE TABLE a_little_princess_word_count AS
SELECT word, (count(*)) wordcount
FROM a_little_princess LATERAL VIEW explode(split(lower(text), '\\W+')) t1 AS word
GROUP BY word
order by word;

Testando a tabela 'A little princess' com word count

SELECT * FROM a_little_princess_word_count limit 5;

teste_a_little_princess_wordcount

Criar tabela do livro 'Dracula' com word count :

CREATE TABLE dracula_word_count AS
SELECT word, (count(*)) wordcount
FROM dracula LATERAL VIEW explode(split(lower(text), '\\W+')) t1 AS word
GROUP BY word
order by word;

Testando a tabela 'Dracula' com word count

SELECT * FROM dracula_word_count limit 5;

teste_dracula_word_count

Encontrar o vocabulário comum de 1.500 palavras entre 2 livros (intersecção)

select a.word, 
       a.wordcount AS princess_word_count, 
       d.wordcount AS dracula_word_count, 
       (a.wordcount + d.wordcount) AS total
from a_little_princess_word_count a  
join dracula_word_count d on( trim(a.word) = trim(d.word) ) 
where trim(a.word) <> '' 
and trim(d.word) <> ''
order by total desc
limit 1500;

Resultado: interseccao.csv

interseccao

2. Encontrar o vocabulário de palavras diferente de cada livro entre 2 livros removendo as palavras que forem encontradas nos dois livros (disjunção)

select a.word, 
       a.wordcount, 
       d.word, 
       d.wordcount
from a_little_princess_word_count a  
FULL join dracula_word_count d on( trim(a.word) = trim(d.word) ) 
where a.word IS NULL OR d.word IS NULL;

Resultado: disjuncao.csv

disjuncao

Resultdo final para os livros: 'A little princess' e 'Dracula'

  1. Encontrado o vocabulário comum (intersecção) de 1.500 palavras entre 2 livros CSV
  2. Encontrado o vocabulário de palavras diferente (disjunção) de cada livro entre 2 livros removendo as palavras que forem encontradas nos dois livros CSV

maprecude_team_adriano_barbosa_and_andre_matsuda's People

Contributors

adrianosb avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.