Given the file headlines.zip which contains a CSV file of the headlines and the date it appeared for:
These are valid examples:
taiwan hit hard by sars outbreak
united states to play davis cup in bratislava
russia ponders space tourism deal
zabaleta advances in austria
-
Find headlines with mentions of countries and cities.
-
These should be capitalised
- First word of the headline
- Countries
- Cities
Examples of capitalised headlines
Taiwan hit hard by sars outbreak.
United States to play davis cup in Bratislava.
Russia ponders space tourism deal.
Zabaleta advances in Austria
- Return a file of the results in CSV format, and the program used for generating it.
-
Generate a dictionary of Countries and Cities from comprehensive list of Countries and Cities data is downloaded from web as json
-
Implementation of text transformation uses Chain of Responsibility pattern
-
Stages:
- Capitalize first word
- Search for mentions of Countries and capitalize the mentions
- Search for mentions of Cities and capitalize the mentions
- New stages can be added to chain or existing nodes can be reordered
- Algorithm to search for mentions and transformation can be enhanced in sub-classes
-
String-Search Algorithm
- QGramShiftOr -- A fast algorithm for multi-pattern searching
- Cities and Country data are stored in json files -> Cities.json , Countries.json
- Json data is loaded in to a Map on startup of Application -> JsonDataDictionaryService.java
- TransformerChain can be built with pre-programmed unit of chain:
-
FirstWordTransformer
-
CityTransformer
-
CountryTransformer
-
Configurability & Maintainability:
- These transformer units can be joined in any order , which offers flexibility
- New Transformer units can be plugged-in as well (Open-Close Principle)
- Existing units in chain can be removed easily (Change in configuration)
-
-
Parse json Data (Static)
-
Load and parse input news-headlines-csv file (Filename provided in command line)
-
Initiate transformation workflow
- Capitalize first word
- Capitalize mentions of cities
- Capitalize mentions of Countries
-
render the transformed string to a new output file (same as in input directory)
- Workflow and render are repeated for all input new lines
-
Output is rendered as records in csv file in:
- Directory: /tmp/transformed/
- fileName formatting: abcnews-date-text"+"_"+System.currentTimeMillis()+".csv"
- sample: abcnews-date-text_1552931515616.csv
-
Essentials: maven 3.3.9 + and jdk-8, jre-8
-
Run command:
```sh
./headline_textProcessor_console.sh
```
OR
```sh
headline_textProcessor_console.sh
```
-
Command execution from step-4 should initiate maven clean, compile, test-compile, test, install phases
-
Post Successful build, Spring-boot will startup (on random port) and initiate a command line runner.
-
Console should display a command line argument entry for datafile
Please enter the file name in command prompt / console
/Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/bin/java -Dmaven.multiModuleProjectDirectory=/Users/lakshmikanth/Desktop/Propine/headlines -Dmaven.home=/usr/local/Cellar/maven/3.6.0/libexec -Dclassworlds.conf=/usr/local/Cellar/maven/3.6.0/libexec/bin/m2.conf "-javaagent:/Applications/IntelliJ IDEA CE.app/Contents/lib/idea_rt.jar=64665:/Applications/IntelliJ IDEA CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /usr/local/Cellar/maven/3.6.0/libexec/boot/plexus-classworlds-2.5.2.jar org.codehaus.classworlds.Launcher -Didea.version=2018.2.7 spring-boot:run
objc[45658]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/bin/java (0x102d514c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_25.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x102dc64e0). One of the two will be used. Which one is undefined.
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for io.headlines:headlines:jar:1.0
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ line 150, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO]
[INFO] -----------------------< io.headlines:headlines >-----------------------
[INFO] Building headlines 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] >>> spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) > test-compile @ headlines >>>
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.5.201505241946:prepare-agent (jacoco-initialize) @ headlines ---
[INFO] argLine set to -javaagent:/Users/lakshmikanth/.m2/repository/org/jacoco/org.jacoco.agent/0.7.5.201505241946/org.jacoco.agent-0.7.5.201505241946-runtime.jar=destfile=/Users/lakshmikanth/Desktop/Propine/headlines/target/coverage-reports/jacoco-unit.exec
[INFO]
[INFO] --- maven-resources-plugin:3.0.2:resources (default-resources) @ headlines ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] Copying 0 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ headlines ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:3.0.2:testResources (default-testResources) @ headlines ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:testCompile (default-testCompile) @ headlines ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] <<< spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) < test-compile @ headlines <<<
[INFO]
[INFO]
[INFO] --- spring-boot-maven-plugin:2.0.4.RELEASE:run (default-cli) @ headlines ---
[INFO] Attaching agents: []
16:07:52.431 [main] DEBUG org.springframework.boot.devtools.settings.DevToolsSettings - Included patterns for restart : []
16:07:52.435 [main] DEBUG org.springframework.boot.devtools.settings.DevToolsSettings - Excluded patterns for restart : [/spring-boot-actuator/target/classes/, /spring-boot-devtools/target/classes/, /spring-boot/target/classes/, /spring-boot-starter-[\w-]+/, /spring-boot-autoconfigure/target/classes/, /spring-boot-starter/target/classes/]
16:07:52.435 [main] DEBUG org.springframework.boot.devtools.restart.ChangeableUrls - Matching URLs for reloading : [file:/Users/lakshmikanth/Desktop/Propine/headlines/target/classes/]
. ____ _ __ _ _
/\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/ ___)| |_)| | | | | || (_| | ) ) ) )
' |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot :: (v2.0.4.RELEASE)
2019-03-21 16:07:52.782 INFO 45659 --- [ restartedMain] i.h.HeadlineTextProcessingApplication : Starting HeadlineTextProcessingApplication on lakshmikanthMBP with PID 45659 (/Users/lakshmikanth/Desktop/Propine/headlines/target/classes started by lakshmikanth in /Users/lakshmikanth/Desktop/Propine/headlines)
2019-03-21 16:07:52.784 INFO 45659 --- [ restartedMain] i.h.HeadlineTextProcessingApplication : No active profile set, falling back to default profiles: default
2019-03-21 16:07:52.846 INFO 45659 --- [ restartedMain] s.c.a.AnnotationConfigApplicationContext : Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@4f7de7: startup date [Thu Mar 21 16:07:52 IST 2019]; root of context hierarchy
2019-03-21 16:07:53.266 INFO 45659 --- [ restartedMain] o.s.b.f.s.DefaultListableBeanFactory : Overriding bean definition for bean 'jsonDataDictionaryService' with a different definition: replacing [Generic bean: class [io.headlines.service.JsonDataDictionaryService]; scope=singleton; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in file [/Users/lakshmikanth/Desktop/Propine/headlines/target/classes/io/headlines/service/JsonDataDictionaryService.class]] with [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=headlineTextProcessorConfiguration; factoryMethodName=jsonDataDictionaryService; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [io/headlines/configuration/HeadlineTextProcessorConfiguration.class]]
2019-03-21 16:07:54.477 INFO 45659 --- [ restartedMain] o.s.b.d.a.OptionalLiveReloadServer : LiveReload server is running on port 35729
2019-03-21 16:07:54.513 INFO 45659 --- [ restartedMain] o.s.j.e.a.AnnotationMBeanExporter : Registering beans for JMX exposure on startup
2019-03-21 16:07:54.540 INFO 45659 --- [ restartedMain] i.h.HeadlineTextProcessingApplication : Started HeadlineTextProcessingApplication in 2.091 seconds (JVM running for 2.604)
2019-03-21 16:07:54.543 INFO 45659 --- [ restartedMain] i.headlines.CommandLineAppStartupRunner : Application started with command-line arguments: [] .
To kill this application, press Ctrl + C.
abcnews-date-text.csv
2019-03-21 16:07:56.620 INFO 45659 --- [ restartedMain] i.headlines.CommandLineAppStartupRunner : parsing fileName - abcnews-date-text.csv
2019-03-21 16:11:40.967 INFO 45659 --- [ restartedMain] i.headlines.CommandLineAppStartupRunner : Transformed-Headlines with count: 1103665
2019-03-21 16:11:43.911 INFO 45659 --- [ restartedMain] io.headlines.common.FileReaderUtil : about to create new file: /tmp/transformed/abcnews-date-text_1553164903910.csv at directory: true
2019-03-21 16:11:43.913 INFO 45659 --- [ restartedMain] io.headlines.common.FileReaderUtil : created new File Object: /tmp/transformed/abcnews-date-text_1553164903910.csv
2019-03-21 16:11:44.240 INFO 45659 --- [ restartedMain] i.h.service.HeadlineRendererServiceImpl : HeadlineTextProcessing Completed Successfully
2019-03-21 16:11:44.242 INFO 45659 --- [ Thread-8] s.c.a.AnnotationConfigApplicationContext : Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@4f7de7: startup date [Thu Mar 21 16:07:52 IST 2019]; root of context hierarchy
2019-03-21 16:11:44.246 INFO 45659 --- [ Thread-8] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:55 min
[INFO] Finished at: 2019-03-21T16:11:44+05:30
[INFO] ------------------------------------------------------------------------
Process finished with exit code 0
- Input Dataset: 1103665 records (1.1 million)
- Start-Time: 16:07:56.620
- End-Time: 16:11:44.240
- Duration: 03:48.420
- Average: 03:48 Minutes for 1 Million records