Giter VIP home page Giter VIP logo

weld's Introduction

Weld

Build Status

Documentation

Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a common intermediate representation, and optimizing across each framework.

Modern analytics applications combine multiple functions from different libraries and frameworks to build complex workflows. Even though individual functions can achieve high performance in isolation, the performance of the combined workflow is often an order of magnitude below hardware limits due to extensive data movement across the functions. Weld’s take on solving this problem is to lazily build up a computation for the entire workflow, and then optimizing and evaluating it only when a result is needed.

You can join the discussion on Weld on our Google Group or post on the Weld mailing list at [email protected].

Contents

Building

To build Weld, you need the latest stable version of Rust and LLVM/Clang++ 6.0.

To install Rust, follow the steps here. You can verify that Rust was installed correctly on your system by typing rustc into your shell. If you already have Rust and rustup installed, you can upgrade to the latest stable version with:

rustup update stable

MacOS LLVM Installation

To install LLVM on macOS, first install Homebrew. Then:

brew install llvm@6

Weld's dependencies require llvm-config on $PATH, so you may need to create a symbolic link so the correct llvm-config is picked up (note that you might need to add sudo at the start of this command):

ln -sf `brew --prefix llvm@6`/bin/llvm-config /usr/local/bin/llvm-config

To make sure this worked correctly, run llvm-config --version. You should see 6.0.x.

Ubuntu LLVM Installation

To install LLVM on Ubuntu, get the LLVM 6.0 sources and then apt-get:

On Ubuntu 16.04 (Xenial):

wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-6.0 main"
sudo apt-get update
sudo apt-get install llvm-6.0-dev clang-6.0

On Ubuntu 14.04 (Trusty):

wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/trusty/ llvm-toolchain-trusty-6.0 main"

# gcc backport is required on 14.04, for libstdc++. See https://apt.llvm.org/
sudo apt-add-repository "deb http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu trusty main"
sudo apt-get update
sudo apt-get install llvm-6.0-dev clang-6.0

Weld's dependencies require llvm-config, so you may need to create a symbolic link so the correct llvm-config is picked up. sudo may be required:

ln -s /usr/bin/llvm-config-6.0 /usr/local/bin/llvm-config

To make sure this worked correctly, run llvm-config --version. You should see 6.0.x or newer.

You will also need zlib:

sudo apt-get install zlib1g-dev

Building Weld

With LLVM and Rust installed, you can build Weld. Clone this repository, set the WELD_HOME environment variable, and build using cargo:

git clone https://www.github.com/weld-project/weld
cd weld/
export WELD_HOME=`pwd`
cargo build --release

Weld builds two dynamically linked libraries (.so files on Linux and .dylib files on Mac): libweld and libweldrt.

Finally, run the unit and integration tests:

cargo test

Documentation

The Rust Weld crate is documented here.

The docs/ directory contains documentation for the different components of Weld.

  • language.md describes the syntax of the Weld IR.
  • api.md describes the low-level C API for interfacing with Weld.
  • python.md gives an overview of the Python API.
  • tutorial.md contains a tutorial for how to build a small vector library using Weld.

Python Bindings

Weld's Python bindings are in python, with examples in examples/python.

Grizzly

Grizzly is a subset of Pandas integrated with Weld. Details on how to use Grizzly are in python/grizzly. Some example workloads that make use of Grizzly are in examples/python/grizzly. To run Grizzly, you will also need the WELD_HOME environment variable to be set, because Grizzly needs to find its own native library through this variable.

Testing

cargo test runs unit and integration tests. A test name substring filter can be used to run a subset of the tests:

cargo test <substring to match in test name>

Tools

This repository contains a number of useful command line tools which are built automatically with the main Weld repository, including an interactive REPL for inspecting and debugging programs. More information on those tools can be found under docs/tools.md.

weld's People

Contributors

bathtor avatar cgmossa avatar cirla avatar deepakn94 avatar dobachi avatar harumichi avatar hustnn avatar hvanhovell avatar jialinding avatar jjthomas avatar kaz7 avatar kumagi avatar mateiz avatar max-meldrum avatar mihai-varga avatar nikhilsimha avatar paddyhoran avatar parimarjan avatar pattern avatar radujica avatar rahulpalamuttam avatar renato2099 avatar rgankema avatar sarutak avatar smacke avatar snakescott avatar sppalkia avatar viirya avatar willcrichton avatar winding-lines avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weld's Issues

Weld APIs for Java

The API should hide the complexity of setting up JNI, etc. from most users.

Composite builder example is broken

When I execute the composite builder example in the repl:

let b1 = appender[i32];
let b2 = appender[i32];
let data = [1, 2, 3];
let bs = for(
  data,
  {b1, b2},
  |bs: {appender[i32], appender[i32]}, i: i64, n: i32| {merge(bs.$0, n), merge(bs.$1, 2 * n)}
);
result(bs)

It fails with the following error:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Abort trap: 6

Also note that the example in the documentation is not correct.

I am on mac OS X 10.11.6.

Grizzly is Python 2.7 only

It will be important to run on Python 3, preferably both 2.7 and 3.5/3.6 with a single codebase (the six module helps with this)

Parlib breaks if run from a different directory

The parlib library is not found by the runtime if a file using weld is run from anything but the root cargo directory. As example, the example C API programs fail unless they're run from the topmost directory.

Code to replicate performance metrics

Hi,
This is a compelling library, do you have any of the code used to generate the reported performance increases over various frameworks mentioned here? I'm particularly curious about the tensorflow benchmark.

WeldValue destructor

@sppalkia @deepakn94
Should we have this destructor for WeldValue in bindings.py?
It's in the pandas code as well but commented out.

Once the WeldValue object goes out of scope in the python runtime, python decides to clean this up
As a result it starts messing with our return value since we call weld_value_free.

Load libweld.so from $WELD_HOME/target/debug dir in binding.py

Hi,
When I followed the tutorial to have a try, an error was triggered as below:

>>> import numpy
>>> from hello_weld import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hello_weld.py", line 2, in <module>
    from weld.weldobject import *
  File "build/bdist.linux-x86_64/egg/weld/weldobject.py", line 12, in <module>
  File "build/bdist.linux-x86_64/egg/weld/bindings.py", line 30, in <module>
  File "/root/SkyDiscovery/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /data/weld/weld/target/debug/libweld.so: cannot open shared object file: No such file or directory
>>> 

I built weld with cargo build --release command, and there is no debug directory under target directory.
I checked the python/weld/bindings.py content as follows:

home = os.environ["WELD_HOME"]
if home[-1] != "/":
    home += "/"

path = home + "target/debug/" + path

# Load the Weld Dynamic Library.
weld = CDLL(path)

Is the binding.py just be used in debug mode? Finding libweld.so automatically may be better.

Perform the lazy encoding conversion

I found that memory usage of grizzle is much larger than pandas. Then I go into it and find that it is may be caused by change the encoding type when calling raw_column = np.array(self.df[key], dtype=str).

Can it be optimized by keeping the original encoding type in dataframe[key].values and perform the conversion at runtime (lazy encoding conversion)

If the way I proposed to optimize it is correct. I can take this issue.
Thanks.

parallel optimizations

  1. parallelize result call for vecbuilder and dictbuilder (need to create tasks for "continuation" of result calls)
  2. use a register merger for inner loops (instead of writing to global thread-local pointer)

Memory usage increase continuously

Hi @deepakn94 ,
When I used grizzly in my python program, I found that the process of the program was killed automatically. Through debugging, I found the reason is memory usage increase continuously.
I extracted a piece of the major logic , just load data from csv and then query operations, details as below:

import pandas as pd
import grizzly.grizzly as gr
import grizzly.numpy_weld as gn

df = pd.read_csv("total_price_completed.csv")
weld_df = gr.DataFrameWeld(df)
price_df = weld_df[weld_df['name'] == '000001.SZ']
price_list = price_df['open']
result_list = price_list.evaluate(verbose=False)

Above code was executed many times in a for loop, so memory usage reached the limit and the process was killed by system.

Separate source files for weld op templates

In the likely case that there are performance - sensitive implementations of common operations, say for example, a matrix decomposition, it would be great to have the weld templates in their own files so that any bindings to other languages or libraries could link to the preferred implementation without needing to have weld-specific knowledge about how sensitive implementation choices are handled internally.

Use flake8 to enforce Python style conventions

$ flake8 python/
python/grizzly/encoders.py:6:1: F401 'subprocess' imported but unused
python/grizzly/encoders.py:8:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/encoders.py:21:1: E302 expected 2 blank lines, found 1
python/grizzly/encoders.py:49:13: E128 continuation line under-indented for visual indent
python/grizzly/encoders.py:138:13: E128 continuation line under-indented for visual indent
python/grizzly/grizzly.py:6:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/grizzly.py:65:80: E501 line too long (84 > 79 characters)
python/grizzly/grizzly.py:67:80: E501 line too long (84 > 79 characters)
python/grizzly/grizzlyImpl.py:7:1: F403 'from encoders import *' used; unable to detect undefined names
python/grizzly/grizzlyImpl.py:8:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/grizzlyImpl.py:81:80: E501 line too long (85 > 79 characters)
python/grizzly/grizzlyImpl.py:201:80: E501 line too long (81 > 79 characters)
python/grizzly/grizzlyImpl.py:202:35: E128 continuation line under-indented for visual indent
python/grizzly/grizzlyImpl.py:208:80: E501 line too long (92 > 79 characters)
python/grizzly/grizzlyImpl.py:241:80: E501 line too long (81 > 79 characters)
python/grizzly/grizzlyImpl.py:242:35: E128 continuation line under-indented for visual indent
python/grizzly/grizzlyImpl.py:274:35: E128 continuation line under-indented for visual indent
python/grizzly/grizzlyImpl.py:357:80: E501 line too long (81 > 79 characters)
python/grizzly/grizzlyImpl.py:358:35: E128 continuation line under-indented for visual indent
python/grizzly/grizzlyImpl.py:359:35: E128 continuation line under-indented for visual indent
python/grizzly/grizzlyImpl.py:388:80: E501 line too long (87 > 79 characters)
python/grizzly/lazyOp.py:3:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/numpyImplWeld.py:8:1: F403 'from encoders import *' used; unable to detect undefined names
python/grizzly/numpyImplWeld.py:9:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/numpyImplWeld.py:52:80: E501 line too long (81 > 79 characters)
python/grizzly/numpyImplWeld.py:53:35: E128 continuation line under-indented for visual indent
python/grizzly/numpyImplWeld.py:86:80: E501 line too long (85 > 79 characters)
python/grizzly/numpyImplWeld.py:120:80: E501 line too long (96 > 79 characters)
python/grizzly/numpyImplWeld.py:128:80: E501 line too long (85 > 79 characters)
python/grizzly/numpyImplWeld.py:129:35: E128 continuation line under-indented for visual indent
python/grizzly/numpyImplWeld.py:129:80: E501 line too long (89 > 79 characters)
python/grizzly/numpyWeld.py:5:1: F403 'from weld.weldobject import *' used; unable to detect undefined names
python/grizzly/numpyWeld.py:83:80: E501 line too long (86 > 79 characters)
python/grizzly/numpyWeld.py:89:80: E501 line too long (86 > 79 characters)
python/weld/__init__.py:2:1: F401 'bindings' imported but unused
python/weld/__init__.py:3:1: F401 'encoders' imported but unused
python/weld/__init__.py:4:1: F401 'types' imported but unused
python/weld/__init__.py:5:1: F401 'weldobject' imported but unused
python/weld/bindings.py:5:1: F403 'from ctypes import *' used; unable to detect undefined names
python/weld/bindings.py:8:1: F401 'os' imported but unused
python/weld/bindings.py:25:1: E302 expected 2 blank lines, found 1
python/weld/bindings.py:25:30: E701 multiple statements on one line (colon)
python/weld/bindings.py:26:1: E302 expected 2 blank lines, found 0
python/weld/bindings.py:26:28: E701 multiple statements on one line (colon)
python/weld/bindings.py:27:1: E302 expected 2 blank lines, found 0
python/weld/bindings.py:27:29: E701 multiple statements on one line (colon)
python/weld/bindings.py:29:1: E302 expected 2 blank lines, found 1
python/weld/bindings.py:32:80: E501 line too long (82 > 79 characters)
python/weld/bindings.py:41:80: E501 line too long (97 > 79 characters)
python/weld/encoders.py:6:1: F403 'from types import *' used; unable to detect undefined names
python/weld/encoders.py:13:1: E302 expected 2 blank lines, found 1
python/weld/encoders.py:25:1: E302 expected 2 blank lines, found 1
python/weld/encoders.py:57:1: E302 expected 2 blank lines, found 1
python/weld/encoders.py:58:6: E111 indentation is not a multiple of four
python/weld/encoders.py:59:10: E111 indentation is not a multiple of four
python/weld/encoders.py:60:10: E111 indentation is not a multiple of four
python/weld/encoders.py:61:10: E111 indentation is not a multiple of four
python/weld/types.py:8:1: F403 'from ctypes import *' used; unable to detect undefined names
python/weld/types.py:17:1: W293 blank line contains whitespace
python/weld/types.py:34:1: W293 blank line contains whitespace
python/weld/types.py:60:1: W293 blank line contains whitespace
python/weld/types.py:79:1: W293 blank line contains whitespace
python/weld/types.py:123:1: W293 blank line contains whitespace
python/weld/types.py:145:1: W293 blank line contains whitespace
python/weld/types.py:162:1: W293 blank line contains whitespace
python/weld/weldobject.py:8:1: F401 'sys' imported but unused
python/weld/weldobject.py:9:1: F401 'os' imported but unused
python/weld/weldobject.py:10:1: F401 'np' imported but unused
python/weld/weldobject.py:15:1: F403 'from types import *' used; unable to detect undefined names
python/weld/weldobject.py:17:1: E302 expected 2 blank lines, found 1
python/weld/weldobject.py:35:1: E302 expected 2 blank lines, found 1
python/weld/weldobject.py:47:1: E302 expected 2 blank lines, found 1
python/weld/weldobject.py:61:80: E501 line too long (82 > 79 characters)
python/weld/weldobject.py:66:80: E501 line too long (84 > 79 characters)
python/weld/weldobject.py:69:80: E501 line too long (84 > 79 characters)
python/weld/weldobject.py:124:80: E501 line too long (100 > 79 characters)
python/weld/weldobject.py:125:17: E128 continuation line under-indented for visual indent
python/weld/weldobject.py:127:80: E501 line too long (90 > 79 characters)
python/weld/weldobject.py:139:80: E501 line too long (88 > 79 characters)
python/weld/weldobject.py:170:80: E501 line too long (97 > 79 characters)
python/weld/weldobject.py:178:80: E501 line too long (93 > 79 characters)
python/weld/weldobject.py:198:1: W391 blank line at end of file

Memory layouts for string processing

hi folks,

I'm excited about the Weld project. I have been looking at the Weld data structures and way that the runtime interacts with memory and have some questions, particularly about non-numeric data.

I see here https://github.com/weld-project/weld/blob/master/python/grizzly/numpy_weld_convertor.cpp#L155 that a Weld string vector is semantically a vector of pointers. While this is one possible way to deal with arrays of variable-length types, I am wondering what it would take to expand to other kinds of non-pointer-based memory layouts, which can yield better processing efficiency for the CPU.

In pandas for example, our likely long term plan is to move toward a "packed" columnar memory model (as specified in Apache Arrow) for strings that is like:

length: 4
validity_bits [0 0 0 0 1 1 1 1] + padding for alignment
offsets: [0, 3, 6, 9, 12] + padding
data: 'foobarbazqux' + padding

Beyond "packing" the strings in a contiguous buffer, you can also dictionary encode for better efficiency. I am curious what are you plans generally along these lines and if there are any opportunities for standardizing different string memory layouts (Weld may need to support more than one memory layout) to make it easier for other systems to integrate with Weld.

cc @julienledem

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.