astralway / webindex Goto Github PK

Apache Fluo application that creates a web index using Common Crawl data

License: Apache License 2.0

Shell 6.78% Java 89.61% FreeMarker 3.61%

webindex's Introduction

Webindex is an example Apache Fluo application that incrementally indexes links to web pages in multiple ways. If you are new to Fluo, you may want start with the Fluo tour as the WebIndex application is more complicated. For more information on how the WebIndex application works, view the tables and code documentation.

Webindex utilizes multiple projects. Common Crawl web crawl data is used as the input. Apache Spark is used to initialize Fluo and incrementally load data into Fluo. Apache Accumulo is used to hold the indexes and Fluo's data. Fluo is used to continuously combine new and historical information about web pages and update an external index when changes occur. Webindex has simple UI built using Spark Java that allows querying the indexes.

Below is a video showing repeatedly querying stackoverflow.com while Webindex was running for three days on EC2. The video was made by querying the Webindex instance periodically and taking a screenshot. More details about this video are available in this blog post.

Running WebIndex

If you are new to WebIndex, the simplest way to run the application is to run the development server. First, clone the WebIndex repo:

git clone https://github.com/astralway/webindex.git

Next, on a machine where Java and Maven are installed, run the development server using the webindex command:

cd webindex/
./bin/webindex dev

This will build and start the development server which will log to the console. This 'dev' command has several command line options which can be viewed by running with -h. When you want to terminate the server, press CTRL-c.

The development server starts a MiniAccumuloCluster and runs MiniFluo on top of it. It parses a CommonCrawl data file and creates a file at data/1000-pages.txt with 1000 pages that are loaded into MiniFluo. The number of pages loaded can be changed to 5000 by using the command below:

./bin/webindex dev --pages 5000

The pages are processed by Fluo which exports indexes to Accumulo. The development server also starts a web application at http://localhost:4567 that queries indexes in Accumulo.

If you would like to run WebIndex on a cluster, follow the install instructions.

Viewing metrics

Metrics can be sent from the development server to InfluxDB and viewed in Grafana. You can either setup InfluxDB+Grafana on you own or use Uno command uno setup metrics. After a metrics server is started, start the development server the option --metrics to start sending metrics:

./bin/webindex dev --metrics

Fluo metrics can be viewed in Grafana. To view application-specific metrics for Webindex, import the WebIndex Grafana dashboard located at contrib/webindex-dashboard.json.

webindex's People

Contributors

Stargazers

Watchers

Forkers

mikewalch keith-turner atavacron

webindex's Issues

Create unit test for Spark indexing code

The unit test should verify the output of Spark indexing to Fluo and Accumulo. It should also verify code that reads from Fluo and rebuilds indexes in Accumulo.

Rename UriMap & UriInfo

PageMap & PageInfo might be easier to understand.

Fix or remove reindexing

There is code in example that can build the Query table from the Fluo table This code was broken in #16

Load no longer running with Accumulo 1.8.0-SNAPSHOT

The load spark jobs are failing with Accumulo 1.8.0-SNAPSHOT. The problem seems to be that Accumulo upgraded to thrift 0.9.3 and spark includes an older version of thrift. The older version of thrift in spark gets picked up firsts and this causes Accumulo client code to fail.

Investigate memory issue

While running a 3 day experiment with Webindex on EC2, I noticed that workers were constantly being killed for using two much memory. I started off with 128 threads and 4G per workers. I increased to 5G and then 6G and workers were still being killed or dying.

When the workers had 4G and were being killed I looked in the YARN logs to determine why and found it was memory usage. After increasing the memory, I noticed they were still dying (based on attempt count in task ID). However I did not inspect YARN logs again to verify it was memory related, I should have.

Big rows in search table

Was seeing the following in Accumulo during a long run of webindex. Would be nice to reorganize the index so there are not huge rows.

2016-05-04 15:23:52,429 [tablet.Tablet] WARN : tserver:worker6 Cannot split tablet 2;d:com.blogger;d:com.blogg it contains a big row : d:com.blogger
2016-05-04 15:25:42,570 [tablet.Tablet] WARN : tserver:worker2 Cannot split tablet 2;d:com.twitter;d:com.tu it contains a big row : d:com.twitter

Update Accumulo indexes using the Fluo IndexExporter

The indexes should match what is created by the the Spark Init job.

Add ability to jump to count

The page of top links could offer the option to jump to a count. For example jump to pages with an inbound link count of 3000.

Create Spark job to load pages into Fluo

While init.sh initializes an empty Fluo and Accumulo, load.sh will add pages to Fluo and combine with previous data. This should be written in Spark to parallelize the loading across several nodes.

update default splits

Need to update default splits after switching to use collision free map.

Update export code to work with new Fluo recipes code

Refactor how webindex obtains Accumulo config

Currently webindex does some odd things because it obtains Accumulo config from Fluo config for its external Accumulo index. I think it would be cleaner if Accumulo info were explicit webindex config. Then webindex could use its Accumulo config to configure Fluo and its external index table. With this approach I think all init could be done Java, avoiding editing the app props file and then calling fluo init.

Centralize code for dealing with query table

I think it would make the code easier to follow if all of the code dealing with the query table were centralized. This would be the code for creating mutations and querying. The spark, Fluo, and web code would all make calls to this code.

@mikewalch and I had a conversation about this when talking about #16 . If this change is made, UriCountExport would be much shorter and could possibly be moved into UriMap.

webindex init is not working

The init script command is failing with the following error :

cp: missing destination file operand after `/home/kturner/fluo-dev/install/fluo-1.0.0-beta-2-SNAPSHOT/apps/webindex/lib'

I think its because $WI_DATA_JAR is not being set and that is causing the following command to fail in bin/impl/init.sh.

cp $WI_DATA_JAR $FLUO_APP_LIB

Semaphore in commit manager blocks large transactions

The new asyn commit manager uses a semaphore to limit the amount transactions committing asynchronously based on memory. I ran into a problem running webindex locally where a single transaction exceeded this limit and blocked forever. I wrote a Limit class for situation, need to use that.

Failed when running on EC2

I tried running webindex copy on EC2 and saw the following failure. I suspect the remote task is trying to access the paths which is on the machine that started the job but not on any of the workers.

Caused by: java.io.FileNotFoundException: File file:/home/ec2-user/webindex/paths/2015-18.wat.paths does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)

Seeing error in getDomain

Repeatedly seeing the following error in getDomain.

16:46:24.087 [pool-10-thread-38] WARN  io.fluo.core.worker.WorkTask - Failed to execute observer CollisionFreeMapObserver notification : um:u:4 fluoRecipes cfm:um  153397
java.lang.RuntimeException: java.text.ParseException: Invalid host: whattoexpect.co.au
        at io.fluo.webindex.data.fluo.UriMap$UriUpdateObserver.getDomain(UriMap.java:148) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.webindex.data.fluo.UriMap$UriUpdateObserver.updatingValues(UriMap.java:135) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.recipes.map.CollisionFreeMap.process(CollisionFreeMap.java:203) ~[fluo-recipes-core-1.0.0-beta-1-SNAPSHOT.jar:1.0.0-beta-1-SNAPSHOT]
        at io.fluo.recipes.map.CollisionFreeMapObserver.process(CollisionFreeMapObserver.java:44) ~[fluo-recipes-core-1.0.0-beta-1-SNAPSHOT.jar:1.0.0-beta-1-SNAPSHOT]
        at io.fluo.core.worker.WorkTask.run(WorkTask.java:69) ~[fluo-core-1.0.0-beta-2-SNAPSHOT.jar:1.0.0-beta-2-SNAPSHOT]
        at io.fluo.core.worker.NotificationProcessor$2.run(NotificationProcessor.java:131) [fluo-core-1.0.0-beta-2-SNAPSHOT.jar:1.0.0-beta-2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_51]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_51]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
Caused by: java.text.ParseException: Invalid host: whattoexpect.co.au
        at io.fluo.webindex.data.util.LinkUtil.createURL(LinkUtil.java:42) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.webindex.data.util.LinkUtil.getHost(LinkUtil.java:78) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.webindex.data.util.LinkUtil.hasIP(LinkUtil.java:87) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.webindex.data.util.LinkUtil.getReverseTopPrivate(LinkUtil.java:98) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        at io.fluo.webindex.data.fluo.UriMap$UriUpdateObserver.getDomain(UriMap.java:146) ~[webindex-data-0.0.1-SNAPSHOT.jar:na]
        ... 8 common frames omitted

Improve Load job to insert data as input is parsed

The Spark Load job currently parses all pages in the input data set before inserting pages into the Fluo table. It would be better if the loader was modified to insert data as the input is parsed.

Encapsulate index table

It would be nice to create an internal API for the external webindex search table. Fluo and Spark code write to this table. Web app code reads from it. The code for interacting with this external table is spread far and wide. Would be nice to bring that code together into once place and have a simple API for it. This would have made a change like #71 easier to make and more important test. I manually tested the changes for #71. If the webindex search table had its own internal API, then that could have test.

Simplify Accumulo pager

Could make use of Java 8 and reduce amount of code

Serialize RowColumn using Kryo in Spark jobs

This should increase performance. See http://spark.apache.org/docs/latest/tuning.html

Link parsing is CPU intensive

While running webindex on EC2 I have noticed the link parsing done by the load task is very CPU intensive. This is usually the bottleneck for loading data when running one load task per node.

For example on a 20 node m3.xlarge EC2 cluster with 20 load task running, the maximum load rate is around 1000 pages/sec. As load increases on the system from having more data (caused by compactions, etc), this takes more CPU and causes the load rate to drop.

Create cluster verification test

The test should perform the following steps:

Using Spark, initialize Fluo & Accumulo table 1 with dataset A
Using Fluo, incrementally add dataset B to Fluo table 1 which will export results to Accumulo table 1
Using Spark, initialize Fluo & Accumulo table 2 with dataset A+B
Verify that the Accumulo and Fluo tables 1 & 2 match

Use versions of software on system

For the shaded jar, need to ensure that the versions of Fluo, Accumulo, Hadoop, and Spark installed on the system are used. For Spark and Hadoop this can accomplished by excluding those from the shaded jar. For Fluo and Accumulo and fluo version and accumulo version commands can be used when building the shaded jar.

Just made changes like this for stress :

astralway/stresso#45
astralway/stresso#46

Allow editing of templates while running dev server

Freemarker templates should be editable while running dev server. This would make development easier and faster.

Refactor code to use 'uri' instead of 'pageID'

Both are used in code currently to refer to same thing. The term 'uri' makes more sense.

Create more task for spark load job

While running a very long webindex run on EC2 I noticed the load job created 20 task each with 100's of files to load. Each file takes a while to process. The finish times of the task were highly skewed. Many more task with less files per task would better.

Script the running of webindex applications

Create a run command that scripts the running of webindex applications.

Need more buckets but not more tablets

While working on apache/fluo#593 I was testing on a 10 node EC2 cluster. I did not want lots of tablets so I lowered that config, however that meant the collision free map and export queue had a few large buckets. Processing these buckets resulted in very large transactions. Would be better to have the option to have a few tablets and lots of buckets.

Use maven 'copy-dependencies' to copy jars to Fluo application lib

Currently webindex init copies needed jars to the Fluo app lib using the Maven command below:

mvn dependency:get -Dartifact=org.apache.fluo:fluo-recipes-core:1.0.0-incubating-SNAPSHOT:jar -Ddest=$FLUO_APP_LIB

It would be better if the version did not have be included using the command below (copied from the change @keith-turner made to stresso):

mvn dependency:copy-dependencies -DincludeArtifactIds=fluo-recipes-core  -DoutputDirectory=$FLUO_APP_LIB

Rewrite spark Job to use POJOs

The spark code deals with RowColumns and Values. I think it would be simpler if it dealt with POJOs, especially after #16. What the spark job is doing made a lot more sense before #16.

If the code was reworked to use POJOs, the RDDs with POJOs could then be converted into the Fluo and Query tables at the end of processing.

Remove 'test-id' and 'test' commands

These were created to simplify running WebIndex. With the creation of the dev server (run via webindex dev), they are no longer needed for several reasons:

For simple testing, users will probably just use webindex dev
For complicated tests, developers can just create a script to replace functionality of test and test-id.
They add a lot of complexity to webindex script.

Reintroduce reindex command

This command was removed in #49. The command did not work when it was removed, it was probably broken by #16.

The command would rebuild the index table using the Fluo table. Need to consider the implications of outstanding notifications when implementing the command. Can the command use the url and domain maps?

Monitor WebIndex application metrics in InfluxDB/Grafana

Code should be added to WebIndex to send metrics to InfluxDB (using dropwizard) and set up dashboard in Grafana to view them.

Below are some possible metrics to send:

Pages ingested
Unique pages (ingested + linked to)
Links found
Inbound links
Outbound links
Unique domains

These metrics could be sent using both dropwizard Meters (for rates) and Counters (for totals):

Fluo code incorrectly computing domain counts

While working on #49 I realized the Fluo code may be computing domain counts incorrectly. I think its computing the number of links to a domain instead of the unique number of URI's seen in a domain.