stratosphere / stratosphere.github.io Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 14.0 61.57 MB

This repository hosts the stratosphere.eu website.

Home Page: stratosphere.eu

HTML 32.89% Python 0.45% TeX 1.75% CSS 53.85% JavaScript 10.98% Shell 0.09%

stratosphere.github.io's Introduction

Repository Has Moved

Stratosphere is now an Apache incubator project and has been renamed to Apache Flink.

This repository will not be maintained anymore.

Please move to the following GitHub repository:

git clone https://github.com/apache/incubator-flink.git

Thanks!

If you have an existing clone of the old Stratosphere repository, you can update your remote to point to the new repository:

git remote set-url origin https://github.com/apache/incubator-flink.git

stratosphere.github.io's People

Contributors

Stargazers

Watchers

Forkers

rmetzger twalthr moewex physikerwelt aheise qmlmoon markus-h aalexandrov tillrohrmann mariemayadi ktzoumas parshimers codewithashu

stratosphere.github.io's Issues

Make it clearer that Stratosphere is not build upon Hadoop

The website does not emphasize enough that Stratosphere has its own MapReduce runtime and does not use Hadoop's MapReduce.

At the beginning I also through that Stratosphere uses Hadoop.

Some statements like

"It combines the strengths of MapReduce/Hadoop with powerful programming abstractions in Java" on the first page.

and

"Stratosphere for Hadoop 1/Hadoop 2" in the download section are very misleading.

In my opinion in the download section we should completely leave out the term "Hadoop" and use "HDFS" instead.

Maybe we can prevent questions like the most recent one:
https://groups.google.com/forum/#!topic/stratosphere-dev/-WSxxtsdCSo

Write migration guide from record API to JAPI

The program skeleton section should list the package where the core API classes are contained.

Please mention that the environments and data sets are in eu.stratosphere.api.java.

Move CeBIT and Bitkom events to "past" events.

Add description how to use SNAPSHOT quickstarts

The snapshot quickstarts are only in sonatype, not in maven central. To directly use the 0.5-SNAPSHOT quickstarts, one needs to add the sonatype repo to the known repositories.

The website does currently not describe how to do so.

Text for Introduction

I cannot make a pull request for this section, because I am on mobile internet and cannot afford to clone the stratosphere.github.io repository. I have pasted my text below. Sorry Ufuk, for causing additional work.

Introduction

Analysis programs in Stratosphere's are regular Java Programs that implement transformations on data sets (e.g., filtering, , mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files, or from collections). The results are returned by sinks, which may for example write the data to (distributed) files, or print it to the command line. The sections on the program skeleton and transformations show the general template of a program and describe the available transformations.

Stratosphere programs can run in a variety of contexts, for example locally as standalone programs, locally embedded in other programs, or on clusters of many machines (see [program skeleton] how to define different environments). All programs are executed lazily: When the program is run and the transformation method on the data set is invoked, it creates a specific transformation operation. That transformation operation is only executed once program execution is triggered on the environment. Whether the program is executed locally or on a cluster depends on the environment of the program.

In contrast to the Stratospheres Record API, the Java API is strongly typed: All data sets and transformations accept typed elements rather than generic records. This allows to catch typing errors very early and supports safe refactoring of programs.

Cross-reference inside the Java/Scala guide how to use latest snapshot versions

Debian package for release-0.5

Can someone please confirm that the debian package is updated for release-0.5?

"Edit this page" still points to the old repository

CoGroup example in new JAPI documentation outdated?

In: http://stratosphere.eu/docs/0.5/programming_guides/java.html

remove "public" from MyCoGrouper or insert "static" and wrap plan construction code in main() method
"reduceGroup" should be "with"

Wrong Quickstart script on website linked

http://stratosphere.eu/quickstart/java.html
the curl ... .sh url is wrong

Add information on how to add SNAPSHOT version as a maven dependency to download page

Inconveniences with navbar

I think more people will have noticed the following issue with the navbar on the left:

It still kinda works, but I find it inconvenient. The problem did also occur before, but wasn't so bad. I suspect that now almost everybody working from a Laptop will experience it... especially since people will likely look into the Java API.

The questions are the following:

Did you experience the problem? How inconvenient is this for you?
Should we move the subnavigation the the top of each article (as seen here)? This would also allow to have more fine-grained sub bullets when needed (e.g. data transformations => map, reduce, etc.) But we would loose the option to easily jump around in an article as you would have to go back to the top every time.

GitHub page improvements

Just got this response from GitHub after pushing a small update:

The page build completed successfully, but returned the following warning:

GitHub Pages recently underwent some improvements (https://github.com/blog/1715-faster-more-awesome-github-pages) to make your site faster and more awesome, but we've noticed that stratosphere.eu isn't properly configured to take advantage of these new features. While your site will continue to work just fine, updating your domain's configuration offers some additional speed and performance benefits. Instructions on updating your site's IP address can be found at https://help.github.com/articles/setting-up-a-custom-domain-with-github-pages#step-2-configure-dns-records, and of course, you can always get in touch with a human at [email protected]. For the more technical minded folks who want to skip the help docs: your site's DNS records are pointed to a deprecated IP address.

For information on troubleshooting Jekyll see:

https://help.github.com/articles/using-jekyll-with-pages#troubleshooting

If you have any questions please contact us at https://github.com/contact.

Hadoop compat docs are missing

docs/0.5/programming_guides/hadoop_compatability.html are not finished yet.

Answer how Stratosphere compares to Apache Spark

This message from our mailing list, posted by @fhueske might be a good skeleton:

Similar to Spark, Stratosphere is a complete data processing system, i.e., it has a programming API, a program compiler (optimizer), and an own execution runtime.
It is also an alternative for Hadoop MapReduce and in several design points quite similar to Spark:

Programs are executed as DAGs
Higher-level programming primitives (compared to Hadoop MR)
APIs in Scala and Java
Reads data from external data stores (has no own data storage), e.g., HDFS, S3, RDBMS, ...

However, Stratosphere is also different in some aspects:

Database-inspired processing using pipelining, gradually going to disk if memory is not sufficient (Hybridhash Joins, external sorts)
Sophisticated cost-based optimizer choosing execution strategies (broadcasting vs. partitioning, sort vs. hash joins, ...)
Implemented in Java (in contrast to Spark which uses Scala)
No intermediate result materialization in memory (this is on the roadmap)

Stratosphere and Spark can be rather seen as alternatives.
We do not build on any of Sparks components as we have our own programming API and execution engine.

Cluster execution docs are missing

docs/0.5/program_execution/cluster_execution.html is not finished yet

Add link to stratosphere-javadocs.github.io

Documentation on cli frontend not up-to-date

I think you don't have to specify the "run" command anymore:

http://stratosphere.eu/docs/0.5/program_execution/cli_client.html

Update frontpage for 0.5 release

Highlight Java API section "Powerful Programming Interfaces"
Address #34
Check for correctness

Add download link to source code of a release.

Update Java Quickstart for new Java API

Update Scala API - Delta Iteration

The Delta Iteration documentation in the Scala API should be updated.

It only says:
"This is tad bit prototypical right now. Please contact us through one of the channels here if you are interested in working with it."

We should at least remove the link to the contact page and add links to the general iteration documentation and the Delta Iteration Scala Example.

Update example documentation on website

The example section needs to be updated for the new Java API and refactored Java examples.

We need to update the example documentation on the website for this.
We should also link from the API documentation to examples that show the API features in action.

Links in the new Java API accumulators section point to the 0.4 release

Add "recent news" / "recent blog posts" section to front page

I can help with the required Jekyll code, but I have no clue how to nicely integrate it.

Text for Packaging a Program Section

## Program Packaging & Distributed Execution

As described in the program skeleton section, Stratosphere programs can be executed on clusters (or local mini clusters) by using the RemoteEnvironment. Alternatively, programs can be packaged into JAR Files (Java Archives) for execution. Packaging the program is a prerequisite to executing them through the [command line interface](link to CLI docs) or the [web client](link to web client docs).

Packaging Programs

To support execution from a packaged JAR file via the command line interface or the web client, a program must use the environment obtained by ExecutionEnvironment.getExecutionEnvironment(). This environment will act as the cluster's environment when the JAR is submitted to the command line interface or the web client. If the Stratosphere program is invoked differently than through these interfaces, the environment will act like a local environment.

To package the program, simply export all involved classes as a JAR file. The JAR file's manifest must point to the class that contains the program's entry point (the class with the public void main(String[]) method). The simplest way to do this is by putting the main-class entry into the manifest (such as main-class: eu.stratosphere.example.MyProgram). The main-class attribute is the same one that is used by the Java Virtual Machine to find the main method when executing a JAR files through the command java -jar pathToTheJarFile. Most IDEs offer to include that attribute automatically when exporting JAR files.

Packaging Programs through Plans

The Java API supports additionally packaging programs as Plans. This method resembles the way that the Record API and Scala API package programs. Instead of defining a progam in the main method and calling execute() on the environment, plan packaging returns the Program Plan, which is a description of the program's data flow. To do that, the program must implement the eu.stratosphere.api.common.Program interface, defining the getPlan(String...) method. The strings passed to that method are the command line arguments. The program's plan can be created from the environment via the ExecutionEnvironment#createProgramPlan() method. When packaging the program's plan, the JAR manifest must point to the class implementing the eu.stratosphere.api.common.Program interface, instead of the class with the main method.

Summary

The overall procedure to invoke a packaged program is as follows:

The JAR's manifest is searched for a main-class or program-class attribute. If both attributes are found, the program-class attribute takes precedence over the main-class attribute. Both the command line client and the web client support a parameter to pass the entry point class name manually for cases where the JAR manifest contains neither attribute.
If the entry point class implements the eu.stratosphere.api.common.Program, then the system calls the getPlan(String...) to obtain the program plan and it will execute that plan. The getPlan(String...) method was the only possible way of defining a program in the Record API and is also supported in the new Java API.
If the entry point class does not implement the eu.stratosphere.api.common.Program interface, the system will invoke the class' main method.

Quickstart broken?

$ bin/stratosphere run
--jarfile ./examples/stratosphere-java-examples-0.5-WordCount.jar
--arguments 1 file://pwd/hamlet.txt file://pwd/wordcount-result.txt

should be

$ bin/stratosphere run
--jarfile ./examples/stratosphere-java-examples-0.5-WordCount.jar
file://pwd/hamlet.txt file://pwd/wordcount-result.txt