Giter VIP home page Giter VIP logo

headlinetextprocessor's Introduction

HeadlineTextProcessor

Problem-Statement:

Given the file headlines.zip which contains a CSV file of the headlines and the date it appeared for:

Sample Input:

These are valid examples:

taiwan hit hard by sars outbreak
united states to play davis cup in bratislava
russia ponders space tourism deal
zabaleta advances in austria

Task:

  • Find headlines with mentions of countries and cities.

  • These should be capitalised

    1. First word of the headline
    2. Countries
    3. Cities

Expected Output:

Examples of capitalised headlines

Taiwan hit hard by sars outbreak.
United States to play davis cup in Bratislava.
Russia ponders space tourism deal.
Zabaleta advances in Austria
  • Return a file of the results in CSV format, and the program used for generating it.

Approach:

  • Generate a dictionary of Countries and Cities from comprehensive list of Countries and Cities data is downloaded from web as json

  • Implementation of text transformation uses Chain of Responsibility pattern

  • Stages:

    • Capitalize first word
    • Search for mentions of Countries and capitalize the mentions
    • Search for mentions of Cities and capitalize the mentions

Maintainability:

  • New stages can be added to chain or existing nodes can be reordered

Open Close Principle:

  • Algorithm to search for mentions and transformation can be enhanced in sub-classes

Algorithm:

  • String-Search Algorithm

    • QGramShiftOr -- A fast algorithm for multi-pattern searching

Design:

  • Cities and Country data are stored in json files -> Cities.json , Countries.json
  • Json data is loaded in to a Map on startup of Application -> JsonDataDictionaryService.java
  • TransformerChain can be built with pre-programmed unit of chain:
    • FirstWordTransformer

    • CityTransformer

    • CountryTransformer

    • Configurability & Maintainability:

      • These transformer units can be joined in any order , which offers flexibility
      • New Transformer units can be plugged-in as well (Open-Close Principle)
      • Existing units in chain can be removed easily (Change in configuration)

Workflow:

  1. Parse json Data (Static)

  2. Load and parse input news-headlines-csv file (Filename provided in command line)

  3. Initiate transformation workflow

    1. Capitalize first word
    2. Capitalize mentions of cities
    3. Capitalize mentions of Countries
  4. render the transformed string to a new output file (same as in input directory)

    1. Workflow and render are repeated for all input new lines
  5. Output is rendered as records in csv file in:

    • Directory: /tmp/transformed/
    • fileName formatting: abcnews-date-text"+"_"+System.currentTimeMillis()+".csv"
      • sample: abcnews-date-text_1552931515616.csv

Usage:

Maven based package and start:

  1. Essentials: maven 3.3.9 + and jdk-8, jre-8

  2. Run command:

```sh
./headline_textProcessor_console.sh
```

OR

```sh
headline_textProcessor_console.sh
```
  1. Command execution from step-4 should initiate maven clean, compile, test-compile, test, install phases

  2. Post Successful build, Spring-boot will startup (on random port) and initiate a command line runner.

  3. Console should display a command line argument entry for datafile

    Please enter the file name in command prompt / console

/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/bin/java -Dmaven.multiModuleProjectDirectory=/Users/lakshmikanth/Desktop/Propine/headlines -Dmaven.home=/usr/local/Cellar/maven/3.6.0/libexec -Dclassworlds.conf=/usr/local/Cellar/maven/3.6.0/libexec/bin/m2.conf "-javaagent:/Applications/IntelliJ IDEA CE.app/Contents/lib/idea_rt.jar=64665:/Applications/IntelliJ IDEA CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /usr/local/Cellar/maven/3.6.0/libexec/boot/plexus-classworlds-2.5.2.jar org.codehaus.classworlds.Launcher -Didea.version=2018.2.7 spring-boot:run
objc[45658]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/bin/java (0x102d514c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x102dc64e0). One of the two will be used. Which one is undefined.
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for io.headlines:headlines:jar:1.0
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ line 150, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] -----------------------< io.headlines:headlines >-----------------------
[INFO] Building headlines 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] >>> spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) > test-compile @ headlines >>>
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.5.201505241946:prepare-agent (jacoco-initialize) @ headlines ---
[INFO] argLine set to -javaagent:/Users/lakshmikanth/.m2/repository/org/jacoco/org.jacoco.agent/0.7.5.201505241946/org.jacoco.agent-0.7.5.201505241946-runtime.jar=destfile=/Users/lakshmikanth/Desktop/Propine/headlines/target/coverage-reports/jacoco-unit.exec
[INFO]
[INFO] --- maven-resources-plugin:3.0.2:resources (default-resources) @ headlines ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] Copying 0 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ headlines ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:3.0.2:testResources (default-testResources) @ headlines ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:testCompile (default-testCompile) @ headlines ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] <<< spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) < test-compile @ headlines <<<
[INFO]
[INFO]
[INFO] --- spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) @ headlines ---
[INFO] Attaching agents: []
16:07:52.431 [main] DEBUG org.springframework.boot.devtools.settings.DevToolsSettings - Included patterns for restart : []
16:07:52.435 [main] DEBUG org.springframework.boot.devtools.settings.DevToolsSettings - Excluded patterns for restart : [/spring-boot-actuator/target/classes/, /spring-boot-devtools/target/classes/, /spring-boot/target/classes/, /spring-boot-starter-[\w-]+/, /spring-boot-autoconfigure/target/classes/, /spring-boot-starter/target/classes/]
16:07:52.435 [main] DEBUG org.springframework.boot.devtools.restart.ChangeableUrls - Matching URLs for reloading : [file:/Users/lakshmikanth/Desktop/Propine/headlines/target/classes/]

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.0.4.RELEASE)

2019-03-21 16:07:52.782  INFO 45659 --- [  restartedMain] i.h.HeadlineTextProcessingApplication    : Starting HeadlineTextProcessingApplication on lakshmikanthMBP with PID 45659 (/Users/lakshmikanth/Desktop/Propine/headlines/target/classes started by lakshmikanth in /Users/lakshmikanth/Desktop/Propine/headlines)
2019-03-21 16:07:52.784  INFO 45659 --- [  restartedMain] i.h.HeadlineTextProcessingApplication    : No active profile set, falling back to default profiles: default
2019-03-21 16:07:52.846  INFO 45659 --- [  restartedMain] s.c.a.AnnotationConfigApplicationContext : Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@4f7de7: startup date [Thu Mar 21 16:07:52 IST 2019]; root of context hierarchy
2019-03-21 16:07:53.266  INFO 45659 --- [  restartedMain] o.s.b.f.s.DefaultListableBeanFactory     : Overriding bean definition for bean 'jsonDataDictionaryService' with a different definition: replacing [Generic bean: class [io.headlines.service.JsonDataDictionaryService]; scope=singleton; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in file [/Users/lakshmikanth/Desktop/Propine/headlines/target/classes/io/headlines/service/JsonDataDictionaryService.class]] with [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=headlineTextProcessorConfiguration; factoryMethodName=jsonDataDictionaryService; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [io/headlines/configuration/HeadlineTextProcessorConfiguration.class]]
2019-03-21 16:07:54.477  INFO 45659 --- [  restartedMain] o.s.b.d.a.OptionalLiveReloadServer       : LiveReload server is running on port 35729
2019-03-21 16:07:54.513  INFO 45659 --- [  restartedMain] o.s.j.e.a.AnnotationMBeanExporter        : Registering beans for JMX exposure on startup
2019-03-21 16:07:54.540  INFO 45659 --- [  restartedMain] i.h.HeadlineTextProcessingApplication    : Started HeadlineTextProcessingApplication in 2.091 seconds (JVM running for 2.604)
2019-03-21 16:07:54.543  INFO 45659 --- [  restartedMain] i.headlines.CommandLineAppStartupRunner  : Application started with command-line arguments: [] .
 To kill this application, press Ctrl + C.
abcnews-date-text.csv
2019-03-21 16:07:56.620  INFO 45659 --- [  restartedMain] i.headlines.CommandLineAppStartupRunner  : parsing fileName - abcnews-date-text.csv
2019-03-21 16:11:40.967  INFO 45659 --- [  restartedMain] i.headlines.CommandLineAppStartupRunner  : Transformed-Headlines with count: 1103665
2019-03-21 16:11:43.911  INFO 45659 --- [  restartedMain] io.headlines.common.FileReaderUtil       : about to create new file: /tmp/transformed/abcnews-date-text_1553164903910.csv at directory: true
2019-03-21 16:11:43.913  INFO 45659 --- [  restartedMain] io.headlines.common.FileReaderUtil       : created new File Object: /tmp/transformed/abcnews-date-text_1553164903910.csv
2019-03-21 16:11:44.240  INFO 45659 --- [  restartedMain] i.h.service.HeadlineRendererServiceImpl  : HeadlineTextProcessing Completed Successfully
2019-03-21 16:11:44.242  INFO 45659 --- [       Thread-8] s.c.a.AnnotationConfigApplicationContext : Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@4f7de7: startup date [Thu Mar 21 16:07:52 IST 2019]; root of context hierarchy
2019-03-21 16:11:44.246  INFO 45659 --- [       Thread-8] o.s.j.e.a.AnnotationMBeanExporter        : Unregistering JMX-exposed beans on shutdown
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:55 min
[INFO] Finished at: 2019-03-21T16:11:44+05:30
[INFO] ------------------------------------------------------------------------

Process finished with exit code 0

Performance:

- Input Dataset: 1103665 records (1.1 million)
- Start-Time: 16:07:56.620
- End-Time:   16:11:44.240
- Duration: 03:48.420

- Average: 03:48 Minutes for 1 Million records

References:

StringSearchAlgorithm

High-performance search Library:

headlinetextprocessor's People

Contributors

kanthgithub avatar

Watchers

 avatar

Forkers

yadavamit9

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.