Giter VIP home page Giter VIP logo

skrusche63 / spark-fsm Goto Github PK

View Code? Open in Web Editor NEW
29.0 29.0 20.0 1.04 MB

This project provides sequential pattern mining for Apache Spark. The algorithms are based on the work of Philippe Fournier-Viger and comprise his SPADE and TSR algorithm. This enables to perform sequential pattern and also sequential rule mining.

Home Page: http://predictiveworks.eu

ApacheConf 2.08% Scala 97.92%

spark-fsm's Introduction

Elasticworks.

Predictiveworks. is an open ensemble of predictive engines and has been made to cover a wide range of today's analytics requirements. Predictiveworks. brings the power of predictive analytics to Elasticsearch.

Reactive Series Analysis Engine

The Series Analysis Engine is one of the nine members of the open ensemble and is built to support sequential pattern mining with a new and redefined mining algorithm. The approach overcomes the well-known "threshold problem" and makes it a lot easier to directly leverage the resulting patterns and rules.

Sequential pattern mining is an important mining technique with a wide range of real-life applications. It has been found very useful in domains such as

  • market basket analysis
  • marketing strategy
  • medical treatment
  • natural disaster
  • user behavior analysis

and more.

It is an extension to the concept of association rule mining and solves the problem of discovering statistically relevant patterns in big datasets that specify (timely ordered) sequences of data.

Market Basket Analysis

In market basket analysis, a sequence is built from the transactions of the customers, ordered by the transaction time. The most common interpretation of a transaction is that of a collection of the items a particular customer ordered (itemset).

Sequential Patterns are very interesting in marketbasket analysis as they specify inter-transaction correlations, and e.g. discover which items are frequently bought one after another.

Product Recommendations

Recommendation engines are often built from rating data, provided by customers that were asked to vote for a certain product or service. From such data, the customer engagement for all products or services of a company can be derived. Items with the highest engagements are then used for a recommendation, hopefully filtered by those that are already in the cart of or have been purchased in recent transactions.

An alternative is to discover those items that were frequently bought together in the past. The respective relations between these products are derived from association rule mining and result in recommendations such as

Customers who looked at or bought these items also looked at or bought those items.

The customer purchase behavior, and here the sequence of buyings, is an excellent indicator for the (hidden) customer's intent. Recommendations that also take the relations between sequences of buyings into account, therefore reflect customer behavior much better than other techniques.

The Sequence Mining Engine discovers the top sequential rules for item sequences that can be often found together and provides product recommendations from these rules.

Web Usage Mining

In the context of web mining, especially web usage mining, companies need to understand what motivates their customers to purchase and how to influence the buying process to develop successful promotional activities.

Evaluating web sessions and the timely ordered sequences of page visits (within a certain time period), e.g. helps to understand similarities of click-streams much better than treating sessions as sets of page visits. As a results, visitors can be clustered or segmented not only by visited content, but also by their timely behavior and signatures.


Sequential Pattern Discovery using Equivalence Classes (SPADE)

SPADE is a fast and efficient algorithm to discover frequent sequential patterns from large databases. It utilizes combinatorial properties to decompose the mining task into smaller sub-tasks that can be independently solved in memory using efficient lattice search techniques, and using simple join operations.

We adapted the implementation of Philippe Fournier-Viger and made the SPADE algorithm availaible for Apache Spark.

Top K Sequential Rules (TSR)

In 2011 Philippe Fournier-Viger proposed a new algorithm to discover the Top-K Sequential Rules from a sequence database, similar to Top-K Association Rules from transaction databases.

We adapted Viger's original implementation and made his Top-K Sequential Rules algorithm available for Apache Spark.


Akka

Akka is a toolkit to build concurrent scalable applications, using the Actor Model. Akka comes with a feature called Akka Remoting, which easily enables to setup a communication between software components in a peer-to-peer fashion.

Akka is leveraged in this software project to enable external software projects to interact with this Series Analysis engine. Besides external communication, Akka is also used to implement the internal interaction between the different functional building blocks of the engine:

  • Administration
  • Indexing & Tracking
  • Training
  • Retrieval

Data Sources

The Reactive Association Analysis Engine supports a rapidly increasing list of applicable data sources. Below is a list of data sources that are already supported:

  • Cassandra,
  • Elasticsearch,
  • HBase,
  • MongoDB,
  • Parquent,

and JDBC database.

spark-fsm's People

Contributors

skrusche63 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-fsm's Issues

Errors Installing ElasticInsight with Spark Core

We ran mvn clean install and are getting the following error.

[WARNING] org.scalatest:scalatest_2.10:2.0.M6-SNAP8 requires scala version: 2.1 0.0
[WARNING] com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4
[WARNING] Multiple versions of scala libraries detected!
[INFO] /home/development/spark-core/src/main/scala:-1: info: compiling
[INFO] Compiling 66 source files to /home/development/spark-core/target/classes at 1432829938878
[INFO] No known dependencies. Compiling everything
[ERROR] error: error while loading CharSequence, class file '/usr/local/java/jdk 1.8.0_45/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken
[INFO](class java.lang.RuntimeException/bad constant pool tag 18 at byte 10)
[ERROR] error: error while loading AnnotatedElement, class file '/usr/local/java /jdk1.8.0_45/jre/lib/rt.jar(java/lang/reflect/AnnotatedElement.class)' is broken
[INFO](class java.lang.RuntimeException/bad constant pool tag 18 at byte 76)
[ERROR] error: error while loading Comparator, class file '/usr/local/java/jdk1. 8.0_45/jre/lib/rt.jar(java/util/Comparator.class)' is broken
[INFO](class java.lang.RuntimeException/bad constant pool tag 18 at byte 20)
[ERROR] error: error while loading ConcurrentMap, class file '/usr/local/java/jd k1.8.0_45/jre/lib/rt.jar(java/util/concurrent/ConcurrentMap.class)' is broken
[INFO](class java.lang.RuntimeException/bad constant pool tag 18 at byte 61)
[ERROR] /home/development/spark-core/src/main/scala/de/kp/spark/core/math/Cosine Similarity.scala:45: error: not found: value ClassTag
[ERROR] private def dotProduct(x:Array[Int],y:Array[Int]):Int = x.zip(y).map(e => e._1 * e.2).sum
[ERROR] ^
[ERROR] /home/development/spark-core/src/main/scala/de/kp/spark/core/math/Cosine Similarity.scala:57: error: could not find implicit value for parameter num: Num eric[Int]
[ERROR] math.sqrt(sqrt.sum)
[ERROR] ^
[ERROR] /home/development/spark-core/src/main/scala/de/kp/spark/core/source/hand ler/SequenceHandler.scala:68: error: overloaded method value replace with altern atives:
ERRORString
ERRORString
[ERROR] cannot be applied to (String, String)
[ERROR] val itemsets = seq.replace("-2", "").split(" -1 ").map(v => v.spli t(" ").map(
.toInt))
[ERROR] ^
[ERROR] 7 errors found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.956 s
[INFO] Finished at: 2015-05-28T09:19:04-07:00
[INFO] Final Memory: 39M/1008M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.3:compi le (default) on project spark-core: wrap: org.apache.commons.exec.ExecuteExcepti on: Process exited with an error: 1(Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE xception
[development@ip-104-238-99-27 spark-core]$
spark-core error

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.