basho / riak_kv Goto Github PK

Riak Key/Value Store

Makefile 0.01% Erlang 99.86% JavaScript 0.13%

riak_kv's Introduction

riak_kv

Overview

Riak KV is an open source Erlang application that is distributed using the riak_core Erlang library. Riak KV provides a key/value datastore and features MapReduce, lightweight data relations, and several different client APIs.

Quick Start

You must have Erlang/OTP 20 or 22 or later and a GNU-style build system to compile and run riak_kv. The easiest way to utilize riak_kv is by installing the full Riak application available on Github.

Contributing

We encourage contributions to riak_kv from the community.

Fork the riak_kv repository on Github.
Clone your fork or add the remote if you already have a clone of the repository.

git clone [email protected]:yourusername/riak_kv.git
# or
git remote add mine [email protected]:yourusername/riak_kv.git

Create a topic branch for your change.

git checkout -b some-topic-branch

Make your change and commit. Use a clear and descriptive commit message, spanning multiple lines if detailed explanation is needed.
Push to your fork of the repository and then send a pull-request through Github.

git push mine some-topic-branch

A Basho engineer or community maintainer will review your patch and merge it into the main repository or send you feedback.

Testing

# standard tests
./rebar3 do xref, dialyzer, eunit
# property-based tests
./rebar3 as test eqc --testing_budget 600

For a more complete set of tests, update riak_kv in the full Riak application and run any appropriate Riak riak_test groups

riak_kv's People

Contributors

Stargazers

Watchers

Forkers

jonmeredith videlalvaro sbalea seth sonian nicom sgonyea lemenkov websterclay krestenkrab jtuple russelldb footyfish argv0 beerriot taichi dreverri b rampage rzezeski grourk jrecursive trifork carlosbrando dizzyd chids kellymclaughlin truesef-dev automattic fdumontmd sa2ajj erlang-communitivity alinpopa 54chen fygrave ricardobcl mdediana dianandonov rmeritz michajlo s-ben bsparrow435 mborzi nimishzynga jerith booo ericmoritz glickbot gordyt caox semmons99 deadzen hyc macintux clr alepharchives wjossey gktomar jametong chenbk85 shaharbenda d2fn dreyk cmeiklejohn broach nivertech hscic windeye neeraj9 stefan-mees rafaltrojniak arcusfelis titanous lumost chinnurtb uilxy vincenzovitale jcrabtree wudeng dmitrizagidulin waptang nathanaschbacher licenser tothlac chienhsiungyi jplock ryannutley ronmeredithorg kisom mesutcan tiensonqin betashepherd angbrav lixen loucash paulperegud flycodepl tigerzhang enewhuis bmcgavin

riak_kv's Issues

RpbErrorResp errcode should be set more proper value

Once error occurs on PBC API, RpbErrorResp is returned but it doesn't make sense because RIAKC_ERR_GENERAL is enable. It should be set the same value to REST API.

It doesn't matter for client if server returns not found on GET operation. So heavy performance cost exception should not be triggered. If modified is returned, application should choose to trigger to repair data or just to retry get ops. It's not server-side issue.

All human readable fields e.g. errmsg field should be understandable for user. The contents should be tuned in long development cycle, so instability in early release is acceptable. If we requires detailed implementation on error handling, it should be done by reading errcode field.
I suggest we improve current version of PBC API.

HTTP socket isn't closed if 2i or list-keys FSM dies

This problem could be more widespread, but at the least:

If you make an HTTP request for either list-keys or 2i, and the coverage FSM (one of riak_kv_index_fsm or riak_kv_keys_fsm) dies, the HTTP socket remains open. To illustrate the error, try this patch:


diff --git src/riak_kv_keys_fsm.erl src/riak_kv_keys_fsm.erl
index 95608e1..faedaa2 100644
--- src/riak_kv_keys_fsm.erl
+++ src/riak_kv_keys_fsm.erl
@@ -96,6 +96,7 @@ process_results({From, Bucket, Keys},
                 StateData=#state{client_type=ClientType,
                                  from={raw, ReqId, ClientPid}}) ->
     process_keys(ClientType, Bucket, Keys, ReqId, ClientPid),
+    exit(self(), error),
     riak_kv_vnode:ack_keys(From), % tell that vnode we're ready for more
     {ok, StateData};
 process_results({Bucket, Keys},

Provide a more user-friendly CLI

Moved from https://issues.basho.com/show_bug.cgi?id=984, reported by @jpartogi

Currently we need to use the erlang client to interface to riak from command line. But however this is not really user friendly and most of the time enforce the user to understand erlang. Because of this sometimes we need to open the manual.

For example (taking redis CLI as an example): This:

del "groceries" "mine"

is much more user friendly and productive than this:

riakc_pb_socket:delete(Pid, <<"groceries">>, <<"mine">>).

Building deps/erlang_js fails when in 64bit Linux

rebar fails to build nsprpub at riak_kv/deps/erlang_js because of updating CC="$CC -m32" at configure script.

make[3]: Entering directory `/home/kuenishi/src/riak_kv/deps/erlang_js/c_src/nsprpub'
cd config; make -j1 export
make[4]: Entering directory `/home/kuenishi/src/riak_kv/deps/erlang_js/c_src/nsprpub/config'
ccache gcc -m32 -o now.o -c      -Wall -O2 -fPIC  -UDEBUG  -DNDEBUG=1 -DHAVE_VISIBILITY_HIDDEN_ATTRIBUTE=1 -DHAVE_VISIBILITY_PRAGMA=1 -DXP_UNIX=1 -D_GNU_SOURCE=1 -DHAVE_FCNTL_FILE_LOCKING=1 -DLINUX=1 -Di386=1 -D_REENTRANT=1  -DFORCE_PR_LOG -D_PR_PTHREADS -UHAVE_CVAR_BUILT_ON_SEM   now.c
In file included from /usr/include/stdio.h:28:0,
                 from now.c:38:
/usr/include/features.h:323:26: fatal error: bits/predefs.h: No such file or directory
compilation terminated.

Bypassing rebar and directly building nsprpub also fails:

kuenishi@kushana> $ make          # in ~/src/erlang_js/c_src
gunzip -c nsprpub-4.8.tar.gz | tar xf -
patching file nsprpub/lib/tests/Makefile.in
(snip)
checking for pthread_create in -lc... no
checking whether ccache gcc -m32 accepts -pthread... no
checking whether ccache gcc -m32 accepts -pthreads... no
creating ./config.status
creating Makefile
(snip)
creating pr/src/pthreads/Makefile
make[1]: Entering directory `/home/kuenishi/src/erlang_js/c_src/nsprpub'
cd config; make -j1 export
make[2]: Entering directory `/home/kuenishi/src/erlang_js/c_src/nsprpub/config'
ccache gcc -m32 -o now.o -c      -Wall -O2 -fPIC  -UDEBUG  -DNDEBUG=1 -DHAVE_VISIBILITY_HIDDEN_ATTRIBUTE=1 -DHAVE_VISIBILITY_PRAGMA=1 -DXP_UNIX=1 -D_GNU_SOURCE=1 -DHAVE_FCNTL_FILE_LOCKING=1 -DLINUX=1 -Di386=1 -D_REENTRANT=1  -DFORCE_PR_LOG -D_PR_PTHREADS -UHAVE_CVAR_BUILT_ON_SEM   now.c
In file included from /usr/include/stdio.h:28:0,
                 from now.c:38:
/usr/include/features.h:323:26: fatal error: bits/predefs.h: No such file or directory

But building erlang_js itself is always successfully done. So I think rebar is adding --disable-64bit or lacking --enable-64bit ( or both ) when building dependent repositories.

This may be a rebar bug or misconfiguration of rebar.config but it's a unique case for riak_kv so I made a issue here. And I'm not sure whether it's specific in my Linux (Debian wheezy, amd64)

The n_val violation error message is confusing when R > N

See https://github.com/basho/riak_kv/blob/master/src/riak_kv_wm_object.erl#L1023

In this case the user is always told that there w values are wrong, even on a fetch. Try

curl localhost:8098/riak/test/foo?r=10

on a new cluster for response of

Specified w/dw/pw values invalid for bucket n value of 3

Thanks Ingo, for report http://riak-users.197444.n3.nabble.com/Data-loads-td4025122.html#a4025385

Nested keyfilters are not resolved until execution time

The riak_kv_mapred_filters:build_filter/2 function processes the filter expression as if it were a flat list. For nesting expressions, and, or, and not, this means that their arguments go unbuilt until execution time. Unfortunately, this means that they also go unvalidated until execution time.

This delay becomes a problem when a query is submitted with an invalid filter nested inside an and, or, or not. Instead of being detected before execution, the bad query makes its way through to KV vnode workers, which fail attempting to execute them. Those KV vnode worker failures go undetected by the Pipe vnode workers that are waiting on their results, and the MapReduce job that that Pipe is implementing hangs around dormant until its timeout.

Minimal reproduction (at Erlang console):

riak_kv_mrc_pipe:mapred({<<"foo">>,[[<<"not">>,[[[<<"breaking things">>]]]]]},[]).

That query will appear to hang, until the default 60 second timeout, at which time it will return {error,{timeout,[]}}. In the mean time, log messages like the following will print:

08:32:51.006 [error] gen_fsm <0.655.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189
08:32:51.016 [error] CRASH REPORT Process <0.655.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189 in gen_fsm:terminate/7 line 611
08:32:51.022 [error] gen_fsm <0.565.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189
08:32:51.023 [error] CRASH REPORT Process <0.565.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189 in gen_fsm:terminate/7 line 611
08:32:51.024 [error] gen_fsm <0.658.0> in state ready terminated with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189
08:32:51.025 [error] CRASH REPORT Process <0.658.0> with 10 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189 in gen_fsm:terminate/7 line 611
08:32:51.026 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.655.0> exit with reason no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189 in context child_terminated
08:32:51.028 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.565.0> exit with reason no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189 in context child_terminated
08:32:51.029 [error] gen_fsm <0.1150.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"breaking things">>]}} in riak_kv_mapred_filters:logical_not/1 line 189

The behavior should instead be like that of the same query without the nesting under not:

riak_kv_mrc_pipe:mapred({<<"foo">>,[[<<"breaking things">>]]},[]).

That command immediately returns the following, with no other log output:

{error,{error,{bad_filter,<<"breaking things">>}},{ok,[]}}

Consider exposing vector clocks in an easier-to-manipulate format.

Right now we use erlangs term_to_binary to encode the vector clock, which means that without using the erlang client, most people can't do anything with that information even if they wanted to, unless they're willing to write and support a version of binary to term in their target language.

I have a partially written python version, but it isn't small and wasn't fun to write.

Rather than duplicating that effort for all clients, it might be better to decide on an standard output format that we can produce and optionally consume, that would allow the user to actually use the data that is encoded there.

Add request stages timing to get FSM

Add equivalent functionality to the get FSM to track the request stages like the put FSM is able to do.

([email protected])2> C:put(riak_object:new(<<"b">>,<<"k">>,<<"v">>),[details]).
{ok,[{response_usecs,113057},
{stages,[{prepare,45894},
{validate,12677},
{precommit,6448},
{execute_local,64},
{waiting_local_vnode,14082},
{execute_remote,47},
{waiting_remote_vnode,33845}]}]}

It may even be valuable to try and break down wait time by vnode.

Typo in riak_client

riak_client.erl has a reference to a non-existent module riak_kv_mapred_filter.
(without "s")

YAMAMOTO Takashi

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-May/008374.html

riak-admin cluster clear should not cause a new node to stop

If there's a way to differentiate it from a leave, we should do so and leave the node up, it's irritating to have to restart it every time after you cancel your cluster operation.

Like ring_members in /stats but with PB and HTTP

It would be really useful for us automation and monitoring types to be able to connect to any running node of a riak cluster and get the /stats page, then extract out a list of all the nodes in the cluster, with their bound ip and port.

e.g. we currently have in /stats

ring_members: [
"[email protected]",
"[email protected]",
"[email protected]",
"[email protected]",
"[email protected]"
],

I'd really like to see:

ring_members_http: [
"127.0.0.1:11098",
"127.0.0.1:12098",
"127.0.0.1:13098",
"127.0.0.1:14098",
"127.0.0.1:15098"
],

and

ring_members_pb: [
"10.0.2.11:11099",
"10.0.2.12:12099",
"10.0.2.13:13099",
"10.0.2.14:14099",
"10.0.2.15:15099",],

Riak tries to decode erlang binary when gzipped

When using Content-Type: application/x-erlang-binary, Riak calls binary_to_term on a PUT body, even if you've set Content-Encoding: gzip.

Steps to repro:

In an erlang shell:

T = {some, <<"term">>}.
file:write_file("erlang-binary", term_to_binary(T)).

gzip erlang-binary

curl -X PUT -H "Content-Encoding: gzip" -H "Content-Type: application/x-erlang-binary" localhost:8091/riak/gzip/test --data-binary @erlang-binary.gz

Gives a 500 error, the problem appears to be here.

when n_val=1 a call to rand_uniform(1, 1) results in badarg and causes kv vnode to exit

https://github.com/basho/riak_kv/blob/master/src/riak_kv_put_fsm.erl#L211
ssumes that length(Preflist2) > 1 which it isn't when n_val = 1 for that bucket. Calling rand_uniform(1, 1) is a badarg and causes this to fail. The simple fix is to check the length of Prelist2 before using it in rand_uniform/2. A quick audit of other uses of rand_uniform in Riak might find similar mistakes.

'riak-admin vnode-status' formatted incorrectly [JIRA: RIAK-2051]

Right now, the output from 'riak-admin vnode-status' is mostly unusable, due to incorrect column formatting.

There is a workaround using the Erlang console right now:

1> Bin = <<"                               Compactions\nLevel  Files Size(MB) Time(sec) Read(MB) Write(MB)\n--------------------------------------------------\n  0        1        5         0        0        17\n  1        0        0         1       24        23\n  2       74       99         2       82        82\n  3      654      999         4      110       110\n  4     6320    10000         0        0         0\n  5    51123    99998         0        0         0\n  6    21816    44506         0        0         0\n">>.

3> io:put_chars(Bin).
                               Compactions
Level  Files Size(MB) Time(sec) Read(MB) Write(MB)
--------------------------------------------------
  0        1        5         0        0        17
  1        0        0         1       24        23
  2       74       99         2       82        82
  3      654      999         4      110       110
  4     6320    10000         0        0         0
  5    51123    99998         0        0         0
  6    21816    44506         0        0         0
ok

Batch MapReduce output before sending to client

Streamed MapReduce output from riak_pipe is sent to the HTTP or PB client as one message per result. It may be more efficient to batch up results and send several results per message, as the legacy MapReduce implementation did.

Deleted keys stay in index queries indefinitely

Deleted keys remain in indexes indefinitely if the tombstone reap process fails. The key will finally be removed once a GET request is made on the key.

add latency leveling

Purpose/High-Level

In short, "latency leveling" (1) is about bounding the amount of time
a request can take. On the surface this sounds no different than a
timeout but the difference is that you attempt to do all work but
accept some work after a given time bound. It's the same idea as
trading yield for harvest. The purpose is to make a predictable
system out of unpredictable parts. This is essential in a webapp
that has a hard ceiling on response time but requires the coordination
of many services. If a single service's response time can grow
unbounded then that ceiling is breached. A timeout will prevent this
but will also throw away any work that may have occurred.

E.g. Given a read request with N=3, R=2 and a timeout of 50ms
two of the replicas must respond to the coordinator in less than
50ms for the client to get a response. If only one replica replies
in time then the client receives a timeout. The read request could be
changed to use R=1 but now all requests are giving up some
consistency for latency. It would be nice if there was a way to
sometimes tradeoff the desired consistency level for latency. This
is what latency leveling achieves. Returning to the first case, if
one replica returns it's value in 29ms but then nothing else is seen
for the next 31ms then the coordinator, if told to, will return the
value it did see with an indicator that the value returned was read
with less than desired consistency.

HTTP mapred yields 500 on missing Content-Type header

The known_content_type WM callback crashes if no content-type is given.

The error happens around these lines: https://github.com/basho/riak_kv/blob/master/src/riak_kv_wm_mapred.erl#L98 because the call to wrq:get_req_header("content-type", RD) returns undefined, and then soon thereafter there is a no-match inside base_ctype.

Statistics for node local backend performance

I think it would be good to add statistics for node local backend performance to "riak-admin status".

This would make it possible to measure effects of tuning a single node.

Something like:

node_backend_gettime_mean: <>
node_backend_gettime_99: <<99% percentile for for node local gets>>

etc.

The other time measure measurements include network latencies and make it impossible to identify node local problems.

all puts fail if precommit hook is specified in default bucket props

Tested as below, issue seems to be that indexdoc buckets are not excluded from the precommit hook set in default bucket props. When a value is put, precommit is invoked on <><> and then again on <<_rsid_bucket>><>, resulting in 500 internal server error for every put.

https://gist.github.com/feb8bada4e6e4ed6c8a6

Race Condition causes Javascript VMs to never be returned to VM Pool

There is a race condition in the riak_kv_js_manager wherein:

A process (e.g. a pipe fitter) calls riak_kv_js_manager:blocking_dispatch/4
riak_kv_js_manager:blocking_dispatch function calls reserve_vm, a gen_server:call to the riak_kv_js_manager, which reserves a VM
The calling process (e.g. a pipe fitting) is killed before calling riak_kv_js_vm:blocking_dispatch/3, a gen_server:call to the riak_kv_js_manager, which normally would return the JS VM to the VM pool after successfully running the provided JS

In this situation, the reserved JS VM will never be returned to the JS VM pool.

The offending code is as follows:

blocking_dispatch(Name, JSCall, MaxCount, Count) ->
    case reserve_vm(Name) of %% reserve_vm is a gen_server:call to the riak_kv_js_manager
        {ok, VM} ->
            JobId = {VM, make_ref()},
            riak_kv_js_vm:blocking_dispatch(VM, JobId, JSCall); %% riak_kv_js_vm:blocking_dispatch is a gen_server:call to the riak_kv_js_vm
        {error, no_vms} ->
            back_off(MaxCount, Count),
            blocking_dispatch(Name, JSCall, MaxCount, Count - 1)
    end.

The simple solution would appear to be linking the calling process to the JS VM before allowing it to be checkout of the pool.

PR/PW can be violated during netsplit conditions

PR/PW are checked by looking at the preflist for the partition and checking
that the number of primaries in the preflists satisifies PR/PW. However, in the
case of a yanked network cable, kernel panic or VM pause, where no FIN packet
is sent, in the interval before the other nodes notice a node has become
unavailable, the preference list is not recalculated and so PR/PW checks will
continue to pass. Once erlang notices the node is down, this goes away, but
there is a significant window before this happens.

A possible fix, at least for reads, is to check which nodes actually serviced
the request and check that against the preflist, if there's a discrepancy that
would make PR invalid, throw an error at that point. I'm not sure how to
properly fix writes so they would be 'all or nothing'.

JS Built-Ins don't handle tombstones

Migrating issue from Buzilla. See the original thread for existing comments and code.

(Moving here from basho/riak_pipe#47, because it's a KV issue.)

Riak KV vnodes can block in certain scenarios when using Bitcask

Background. Prior to Riak 0.14.2, all fold operations would block the relevant vnode and prevent the vnode from servicing requests. This was changed in Riak 1.0, with the introduction of asynchronous folds that used an async worker pool, as well as additions to the various backends to support async folds.

To support async folds, Bitcask freezes it's in-memory keydir and has async folds iterate over the frozen keydir, with new concurrent writes going to a pending keydir. Since the keydir is in-memory, Bitcask only allows a single frozen keydir. Multiple folds can reuse the same keydir, but only if there has not been writes since the keydir was frozen. If a fold is started, a write occurs, and then a new fold is started, the second fold will block until the first fold finishes, and then re-freeze the keydir.

The Problem. Blocking async folders is expected and not a big deal. However, when determining if a vnode should handoff data, Riak will end up calling riak_kv_vnode:is_empty which will call, for a Bitcask vnode, bitcask:is_empty. In Bitcask, the is_empty check is implemented through as a fold (start a fold and exit as soon as any key is found) to deal with tombstones, expired keys, etc. This fold is executed directly in the vnode pid, not an async worker, and will block the vnode in scenarios such as above.

There are two scenarios:

An existing fold is running (list keys, one of the folds used in mDC replication, etc) and handoff is triggered. The vnode will then block until the first fold finishes, servicing no requests and leading to an ever growing message queue.
Handoff is triggered (which starts a fold), and then handoff is re-triggered in the future. The vnode manage retriggers handoff periodically as a fault-tolerance mechanism. The handoff manager ensures that a handoff won't be started if already running. However, the is_empty check occurs before calling the handoff manager. So, handoff A to B, write, handoff A to B will cause the vnode to block on the second handoff request, again servicing no requests and leading to a growing message queue.

Keylisting continues after pipe death

If MapReduce on a bucket is cancelled (timeout or dropped client connection) the listkeys feeding the objects from the bucket is not cancelled, instead it generates a large number of error messages

Pipe worker startup failed:fitting was gone before startup

Additionally, these running listkey jobs continue to tie up resources that would otherwise be capable of running other MapReduce queries. If enough listkey jobs end up in this state, subsequent MapReduce queries will be unable to complete and will timeout. In the worst case, all MapReduce queries submitted will timeout until the running listkeys eventually terminate and the cluster recovers.

riak_kv_js_manager:blocking_dispatch may exit

The function riak_kv_js_manager:blocking_dispatch/3 may exit with a noproc error if either the manager process has died, or the VM that is attempts to call riak_kv_js_vm:checkout_to/2 on has died. We may want to catch this exit in riak_kv_mrc_map:map_js/3 to avoid bringing down the whole MapReduce pipeline with a cryptic error, as was seen in this mailing list post: http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-October/009798.html

badarg error in binary_to_term/1

Riak KV handoff can be stymied indefinitely by a badarg error in erlang:binary_to_term/1, e.g.:

2012-06-12 01:25:10.263 [error] <0.89.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.9312.1> exit with reason bad argument in call to erlang:binary_to_term(<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,10,114,101,97,100,95,105,110,100,101,...>>) in riak_kv_vnode:do_diffobj_put/3 in context child_terminated

Similarly, GET operations can fail with a similar error:

2012-06-12 17:47:00.750 [error] <0.89.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.27729.58> exit with reason bad argument in call to erlang:binary_to_term(<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,10,114,101,97,100,95,105,110,100,101,...>>) in riak_kv_vnode:do_get_term/3 in context child_terminated

In both of these cases, the KV backend is leveldb. Neither the LevelDB sanity checks (if the bad SST is copied to my local box for testing) nor the Snappy compression library are detecting any error, which is more than a little strange and needs further investigaion.

mapred on key-range should be faster for leveldb

Since some backends provides efficient range fold (in particular eleveldb and hanoidb), we should add an API that provides fast M/R for objects in a given key range.

Right now, this is done by first getting the keys in a given range (which is actually reasonably fast in eleveldb and hanoidb), and then getting each object individually.

Since leveldb and hanoi actually has the objects stored close to each other, it would be much faster to simply provide all the values in a given key range to a reduce phase.

Much faster. Grr.. :-)

system_info(global_heaps_size) not supported in R15B02

The riak_kv_stat:system_stats/0 function calls erlang:system_info(global_heaps_size) but that call results in a badarg exception under Erlang R15B02. The OTP team removed support for global_heaps_size in commit 9cfeeb9b in the erlang/otp repo.

JS MapReduce returns partial results when some VMs are busy

To reproduce this issue do the following:

make devrel with Riak 1.0.3
Set map_js_vm_count to 0 on one node
Issue the following puts:

curl -XPUT http://localhost:8091/riak/docs/foo \
-H 'Content-Type: text/plain' -d 'demo data goes here' -v
curl -XPUT http://localhost:8091/riak/docs/bar \
-H 'Content-Type: text/plain' -d 'demo demo demo demo' -v
curl -XPUT http://localhost:8091/riak/docs/baz \
-H 'Content-Type: text/plain' -d 'nothing to see here' -v
curl -XPUT http://localhost:8091/riak/docs/qux \
-H 'Content-Type: text/plain' -d 'demo demo' -v

Issue the following MapReduce job:

curl -XPOST http://localhost:8091/mapred \
     -H 'Content-Type: application/json' \
     -d '{"inputs":"docs",
          "query":[{"map":{"language":"javascript",
            "source":"Riak.mapValues"}}]}'

Only a partial result set will be returned. For example:

["demo demo demo demo","demo data goes here","demo demo"]

You should get:

["demo demo demo demo","demo data goes here","demo demo","nothing to see here"]

In the logs on the node with 0 map_js_vms you will see notifications like this:

2012-02-07 12:38:36.700 [notice] <0.26723.1821>@riak_kv_js_manager:blocking_dispatch:246 JS call failed: All VMs are busy.

HTTP API can return very long Link header when listing keys/buckets

GET /buckets/bucket/keys?keys=true and GET /buckets?buckets=true return Link headers for each item returned. This can result in a very long header when listing a large number of items, which makes certain HTTP clients (such as txriak) very unhappy.

Tombstones are not reaped if reaping occurs before tombstones reach all replicas [JIRA: RIAK-2803]

Tombstones may not be reaped if reaping occurs before tombstones are written to all replicas.

Scenario

Tombstone is written

riak_kv/src/riak_kv_delete.erl

Line 80 in cada9c7

Reply = C:put(Tombstone, [{w,W},{pw,PW},{dw, DW},{timeout,Timeout}]),
Tombstone is reaped

riak_kv/src/riak_kv_delete.erl

Line 86 in cada9c7

C2:get(Bucket, Key, all, AsyncTimeout);
riak_kv_get_fsm checks for final action ( delete | read_repair | noop )

riak_kv/src/riak_kv_get_fsm.erl

Line 282 in cada9c7

{Action, UpdGetCore} = riak_kv_get_core:final_action(GetCore),
riak_kv_get_core checks if any replicas of the object (RObj) don't match the merged object (MObj)

If one of the replicas returns an older version during tombstone reaping, the riak_kv_get_core check returns read_repair rather than delete.

The riak_kv_get_fsm should be able to delete the object rather than read repairing if one of the replicas returns an object older than the tombstone.

Below is an example set of replicas that will result in a read repair rather than a delete:

[{274031556999544297163190906134303066185487351808,
  {ok,{r_object,<<"foo">>,<<"9698">>,
                [{r_content,{dict,6,16,16,8,80,48,
                                  {[],[],[],[],[],[],[],[],[],...},
                                  {{[],[],[[<<"Links">>]],[],[],[],[],...}}},
                            <<"hello world">>}],
                [{<<"úZ»/O{=\t">>,{1,63500707125}}],
                {dict,1,16,16,8,80,48,
                      {[],[],[],[],[],[],[],[],[],[],...},
                      {{[],[],[],[],[],[],[],[],...}}},
                undefined}}},
 {296867520082839655260123481645494988367611297792,
  {ok,{r_object,<<"foo">>,<<"9698">>,
                [{r_content,{dict,4,16,16,8,80,48,
                                  {[],[],[],[],[],[],[],[],...},
                                  {{[],[],[],[],[],[],...}}},
                            <<>>}],
                [{<<"úZ»/O{=\t">>,{2,63500707280}}],
                {dict,1,16,16,8,80,48,
                      {[],[],[],[],[],[],[],[],[],...},
                      {{[],[],[],[],[],[],[],...}}},
                undefined}}},
 {251195593916248939066258330623111144003363405824,
  {ok,{r_object,<<"foo">>,<<"9698">>,
                [{r_content,{dict,4,16,16,8,80,48,
                                  {[],[],[],[],[],[],[],...},
                                  {{[],[],[],[],[],...}}},
                            <<>>}],
                [{<<"úZ»/O{=\t">>,{2,63500707280}}],
                {dict,1,16,16,8,80,48,
                      {[],[],[],[],[],[],[],[],...},
                      {{[],[],[],[],[],[],...}}},
                undefined}}}]

transfer status - percentage complete

This is in reference to the riak-admin transfers command and written in response to Matt Ranney's comment on a recent blog post.

IFF the information is easy to obtain then it would be very helpful to display the percentage complete for each transfer.

We still expose "riak" with a custom setting for an access URL

This code in question adds "riak": even if we have a custom "raw_name":

[riak_kv_web.erl]

raw_dispatch() ->
case app_helper:get_env(riak_kv, raw_name) of
undefined -> raw_dispatch("riak");
Name -> lists:append(raw_dispatch(Name), raw_dispatch("riak"))
end.

POST returns "old format" riak/bucket/key combination

For consistency, should POST return the new format i.e. "buckets/bucket/keys/key"

Eliminate use of zlib:zip() and zlib:unzip()

In a many-core environment, use of the zlib:zip() and zlib:unzip() function isn't a good idea. Each appears to open an close an Erlang port (for the "zlib_drv" driver) for each invocation. Inside the R15B01 virtual machine, all allocations of a new port # are serialized by the 'get_free_port' mutex.

Suggestion: eliminate zlib's use in favor of term_to_binary(VClock, [compressed]). Or something else that doesn't allocate & free an Erlang port on each call.

NOTE: zlib:zip() and unzip() are also used to compress handoff payload. On extremely busy systems (e.g. 8K Riak ops/sec/node or more), this same port allocation serialization probably interferes quite a bit with handoff.

What is riak_kv_vnode:local_get/2 for, seems unused. Can delete?

https://github.com/basho/riak_kv/blob/master/src/riak_kv_vnode.erl#L196

Same question for riak_kv_vnode:local_put too, I guess.

Object WM Resource Mishandles Some Precommit Hook Failures

When a precommit hook fails, the put operation might return an {error, {precommit_fail, Message}} tuple. The riak_kv_wm_object:handle_common_error/3 function immediately passes Message to riak_kv_wm_object:send_precommit_error/2.

Unfortunately, send_precommit_error/2 only handles the cases where Message is an iolist, or the atom undefined. For all other cases, a badarg is raised, causing an ugly error message to be sent to the client. For example, if the precommit hook exits (as described in basho/riak_search#134), the client sees:

{error,
    {error,badarg,
        [{erlang,iolist_to_binary,
             [{hook_crashed,{riak_search_kv_hook,precommit,error,badarg}}],
             []},
         {wrq,append_to_response_body,2,[{file,"src/wrq.erl"},{line,205}]},
         {riak_kv_wm_object,handle_common_error,3,
             [{file,"src/riak_kv_wm_object.erl"},{line,998}]},
         {webmachine_resource,resource_call,3,
             [{file,"src/webmachine_resource.erl"},{line,171}]},
         {webmachine_resource,do,3,
             [{file,"src/webmachine_resource.erl"},{line,130}]},
         {webmachine_decision_core,resource_call,1,
             [{file,"src/webmachine_decision_core.erl"},{line,48}]},
         {webmachine_decision_core,accept_helper,0,
             [{file,"src/webmachine_decision_core.erl"},{line,606}]},
         {webmachine_decision_core,decision,1,
             [{file,"src/webmachine_decision_core.erl"},{line,512}]}]}}

The hook_crashed tuple is generated here: https://github.com/basho/riak_kv/blob/master/src/riak_kv_put_fsm.erl#L715

It looks like a similar problem could happen just below that, if a precommit hook returns something that is not a failure message, but also not a riak object: https://github.com/basho/riak_kv/blob/master/src/riak_kv_put_fsm.erl#L726

The client should see the returned failure message, not a stack trace of webmachine.

Spurious 400 after HTTP MapReduce

History:

The riak_kv_wm_mapred module, which provided the MapReduce HTTP interface, implements its timeout mechanism by scheduling a message to be delivered to its process after the timeout has occurred:

https://github.com/basho/riak_kv/blob/master/src/riak_kv_wm_mapred.erl#L148

When a MapReduce query finishes before that message is received, however, mochiweb is likely to receive it instead:

https://github.com/mochi/mochiweb/blob/master/src/mochiweb_http.erl#L69-70

The mochiweb_http module expects only tcp socket messages, so if this unknown pipe_timeout message is received, mochiweb immediately sends a 400 error message to the client. The message is safe to ignore, but it can be confusing to prove that it came from this bit of code, and not somewhere else.

What riak_kv_wm_mapred should do instead is keep the timer reference around, and cancel the timer before returning control to mochiweb. Alternatively, all pipe and timeout messages should be handled by a process other than the mochiweb/webmachine/socket-holding process, which would also make it possible to remove the use of the mochiweb_request_force_close hack:

https://github.com/basho/riak_kv/blob/master/src/riak_kv_wm_mapred.erl#L299-300

/stats crashes on Mountain Lion

When trying to access /stats it crashes hard on me, here's the output from the response and the error that pops up in the logs:

https://gist.github.com/3496819

riak-admin status succeeds, but only after a timeout. It seems to have trouble fetching the number of processors, as this line comes up in the output: cpu_nprocs : {error,timeout}

Using Riak 1.2.0, binary distribution, with the included Erlang. Anything I could do to provide more info or to get this working temporarily? Thanks!

Negative clock deskew can cause stats to fail

Looking at riak_kv_put_fsm:add_timing, it uses os:timestamp() to provide time. If NTP decides to set the clock backward to deskew, then timer:now_diff(StageEnd, StageStart) could return a negative. Which will cause /stats to return 500 - Internal Server Errors until the negative falls out of the histogram window.

Suggestion is to not capture negative timing statistics

Steps to reproduce:

From riak attach run:

folsom_metrics:notify_existing_metric({riak_kv,node_put_fsm_time},-9999,histogram).`

From a browser make a call to /stats

Result:

Calls to /stats fail with a 500
The following output is logged on a call to /stats

15:44:13.936 [error] Error in process <0.5758.0> on node '[email protected]' with exit value: {badarith,[{bear,scan_values,2,[]},{bear,get_statistics,1,[]},{lists,keysort,2,[]}]}
15:44:13.941 [error] webmachine error: path="/stats"
{error,{error,function_clause,[{proplists,delete,[disk,{error,{badarith,[{bear,scan_values,2,[]},{bear,get_statistics,1,[]},{lists,keysort,2,[]}]}}],[{file,"proplists.erl"},{line,362}]},{riak_kv_wm_stats,get_stats,0,[{file,"src/riak_kv_wm_stats.erl"},{line,88}]},{riak_kv_wm_stats,produce_body,2,[{file,"src/riak_kv_wm_stats.erl"},{line,77}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,169}]},{webmachine_resource,do,3,[{file,"src/webmachine_resource.erl"},{line,128}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]},{webmachine_decision_core,decision,1,[{file,"src/webmachine_decision_core.erl"},{line,532}]},{webmachine_decision_core,handle_request,2,[{file,"src/webmachine_decision_core.erl"},{line,33}]}]}}

add returnHead request parameter to store requests

riak should have a returnHead parameter quite similar to returnBody request parameter but without returning the body of the k/v just the head informations so the client is able to retrieve the updated vector clock after a store operation and can update the local domain object. This would be quite useful to work with POJOs in the riak-java-client. Without the updated vector clock a refetch of the data would always be necessary before storing a POJO again.

Failed JS MR includes Erlang terms

When performing a JS MR job against master (commit 8fe7e1ec9434a616d656c400bd51423f69590f37) that results in an error, strings containing Erlang terms are returned inside the object. I produced this error manually running the badMap test case found in the riak-javascript-client (1).

Stand up an instance of Riak
Checkout riak-javascript-client
cd riak-javascript-client/tests
./setup -i (you may need to modify the setup script to point to correct port/host)

Finally, run the following curl

curl -XPOST http://10.0.1.4:8098/mapred -d '{"inputs":"mr_test", "query":[{"map":{"language":"javascript","source":"function(value) { return [JSON.parse(value)]; }", "keep":true}}], "timeout":60000}'

Below is a copy of my output

1.0.3

$ curl -XPOST http://10.0.1.4:8098/mapred -d '{"inputs":"mr_test", "query":[{"map":{"language":"javascript","source":"function(value) { return [JSON.parse(value)]; }", "keep":true}}], "timeout":60000}' | jsonpp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   210  100    56  100   154   2365   6506 --:--:-- --:--:-- --:--:--  8105
{
  "lineno": 477,
  "message": "JSON.parse",
  "source": "unknown"
}

Master

$ curl -XPOST http://10.0.1.4:8098/mapred -d '{"inputs":"mr_test", "query":[{"map":{"language":"javascript","source":"function(value) { return [JSON.parse(value)]; }", "keep":true}}], "timeout":60000}' | jsonpp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   808  100   654  100   154  63513  14955 --:--:-- --:--:-- --:--:--  106k
{
  "phase": 0,
  "error": "[{<<\"lineno\">>,477},{<<\"message\">>,<<\"JSON.parse\">>},{<<\"source\">>,<<\"unknown\">>}]",
  "input": "{ok,{r_object,<<\"mr_test\">>,<<\"third\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\"Links\">>]],[],[],[],[],[],[],[],[[<<\"content-type\">>,116,101,120,116,47,112,108,97,105,110],[<<\"X-Riak-VTag\">>,49,117,106,81,100,85,75,113,105,88,119,84,49,86,99,55,121,121,67,55,85,53]],[[<<\"index\">>]],[],[[<<\"X-Riak-Last-Modified\">>|{1329,256793,914740}]],[],[[<<\"X-Riak-Meta\">>]]}}},<<\"1\">>}],[{<<35,9,254,249,79,57,154,63>>,{3,63496475993}}],{dict,1,16,16,...},...},...}"
}

(1) https://github.com/basho/riak-javascript-client/blob/slf-sigsegv-bug/tests/unit_tests.js#L313

Put FSM crashes if forwarded node goes down.

In 1.0 and above with vnode_vclocks enabled, nodes must be a member of the preference list to coordinate a put.
Currently nodes forward put requests to the first member of the preference list by talking to riak_kv_up_fsm_sup on the remote node - code here https://github.com/basho/riak_kv/blob/master/src/riak_kv_put_fsm.erl#L193

If the node goes down while starting the remote process instead of getting an {error, Reason} response from the remote supervisor, the start_put_fsm call throws an uncaught exception which crashes the local put FSM.

2012-03-09 06:11:00 =SUPERVISOR REPORT====
     Supervisor: {local,riak_kv_put_fsm_sup}
     Context:    child_terminated
     Reason:     {{nodedown,'riak@host1'},{gen_server,call,[{riak_kv_put_fsm_sup,'riak@host1'},{start_child,[{raw,12339479,<0.4491.12>},{r_object,<<"bucket">>,<<"key">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],[[<<"content-type">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<"X-Riak-VTag">>,52,113,74,65,119,52,87,90,51,116,119,86,107,73,112,113,101,65,48,52,73,53]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1331,269199,636663}]],[],[[<<"X-Riak-Meta">>]]}}},<<"body">>}],[{<<54,74,223,68,79,76,172,101>>,{2,63498487778}}],{dict,4,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],[[<<"content-type">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110]],[[<<"index">>]],[],[],[],[[<<"X-Riak-Meta">>]]}}},<<"{"started":1331268781747}">>},[{w,default},{dw,default},{pw,default},{timeout,60000}]]},infinity]}}
     Offender:   [{pid,<0.4817.12>},{name,undefined},{mfargs,{riak_kv_put_fsm,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]

If the node is under heavy load you'll get a crash dump for however many puts ran in the last 4*net_tick_timeout seconds (typically 60s) which can be in the thousands and overload the logging subsystem.

The easiest path seems to be to replace the remote supervisor call with an rpc:call to the remote node which will then start the supervisor, that way we can supply a timeout that honors the requested timeout for the FSM (which I suppose could worse-case be 2*timeout if the forward requests then the node uses the same timeout again - but distributed time is hard).

To prove the issue is resolved, please do a comparative run of the current code vs the fixed code
I'd suggest using basho_bench against a 4 node cluster, setting N=1 for the test bucket (so that 75% of requests are forwarded) with the memory backend over protobuffs.

If the times are within 95% of the current solution we'll accept the performance trade off for robustness. This failure can be enough to take out a node due to running out of processes/memory.

Please make sure the fix can be applied against the 1.0 branch as well as 1.1 and master.

riak_kv_pipe_get's failover has trouble with fast-input+slow-output rates

A problem was uncovered with riak_kv_pipe_get's failover strategy.

Description

When the KV vnode that the pipe worker talks to returns an error, "not
found" for example, the worker attempts to have another worker for
this fitting try on a different vnode by returning the
forward_preflist result. The last worker in the fallback list is
then supposed to forward the error to the next phase if it also fails.

This strategy does not work when the kvget workers are unable to keep
up with the rate of incoming inputs. The forward_preflist strategy
has to use a non-blocking enqueue to prevent workers in a fitting from
mutually blocking on one another. If the workers' queues are already
full of other inputs, forwarding retry inputs to them will never
succeed. When forwarding retry inputs fails, the worker does not get
a chance to send the error to the next phase; Pipe's internal code
sends an error trace message and drops the input instead.

This bug manifests itself as MapReduce queries failing with a
timeout-like error before their configured timeout, since both the
HTTP and PB mapreduce endpoints treat error trace messages as reasons
to terminate the query. The issue was raised on the mailing list:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-February/007590.html

Possible fixes

One quick and incomplete fix may be to simply set the q_limit much
higher on the riak_kv_pipe_get fittings in the riak_kv_mrc_pipe
module. This would raise the level/rate of inputs necessary to
trigger the bug. Raising the following riak_kv_mrc_map fitting's
q_limit may help as well.

One long-term fix is likely to change riak_kv_pipe_get such that it
does not use the forward_preflist return value. Either it will need
to talk directly to the fallback KV vnodes without forwarding the
input, or it will need to be able to call
riak_pipe_vnode_worker:recurse_input/5 directly (which will also
require passing the used-preflist to the worker) to be able to handle
with full-queue errors.

Another long-term fix might be to alter the behavior of
forward_preflist such that instead of producing a trace error, it
forwards some standard message to the downstream fitting.

Reproduction

Below is a basic method for reproducing the problem. The method is to
follow a riak_kv_pipe_get fitting with a fitting that takes forever
to process its first input. This in turn causes the
riak_kv_pipe_get fitting's queues to back up. The nval, q_limit
and chashfun parameters have been tuned to provide a minimal case.
The sleeps are not needed, but they make the trace easier to read.
Each call to riak_pipe:queue_work/3 is preceeded by a description of
how the worker (W), input queue (Q), and blocking queue (B) look
after each input settles in its final position.

KVGet = #fitting_spec{name = kvget,
                      module = riak_kv_pipe_get,
                      chashfun = chash:key_of(now()),
                      nval = 2,
                      q_limit = 1},
SlowMap = #fitting_spec{name = slowmap,
                        module = riak_pipe_w_xform,
                        chashfun = chash:key_of(now()),
                        arg = fun(I,P,F) ->
                                 receive never -> ok end
                              end,
                        q_limit = 1},
{ok, Pipe} = riak_pipe:exec([KVGet, SlowMap],
                            [{log, sink}, {trace, all}]),

%% 1: makes it all the way to the slowmap worker
%% K1: W=idle, Q=[], B=[]
%% K2: W=idle, Q=[], B=[]
%% M1: W={notfound,1}, Q=[], B=[]
riak_pipe:queue_work(Pipe, {<<"notthere">>, <<"1">>}, noblock),
timer:sleep(1000),

%% 2: makes it into the slowmap worker's queue
%% K1: W=idle, Q=[], B=[]
%% K2: W=idle, Q=[], B=[]
%% M1: W={notfound,1}, Q=[{notfound,2}], B=[]
riak_pipe:queue_work(Pipe, {<<"notthere">>, <<"2">>}, noblock),
timer:sleep(1000),

%% 3: makes it to the fallback kvget worker,
%%    blocking that worker on the slowmap worker's queue
%% K1: W=idle, Q=[], B=[]
%% K2: W={notthere 3}, Q=[], B=[]
%% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notthere 3}]
riak_pipe:queue_work(Pipe, {<<"notthere">>, <<"3">>}, noblock),
timer:sleep(1000),

%% 4: makes it to the fallback worker's queue
%% K1: W=idle, Q=[], B=[]
%% K2: W={notthere 3}, Q=[{notthere, 4}], B=[]
%% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
riak_pipe:queue_work(Pipe, {<<"notthere">>, <<"4">>}, noblock),
timer:sleep(1000),

%% 5: fails because preflist forwarding would block on the
%%    fallback worker's queue
%% K1: W={notthere 5}, Q=[], B=[]
%% K2: W={notthere 3}, Q=[{notthere, 4}], B=[]
%% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
riak_pipe:queue_work(Pipe, {<<"notthere">>, <<"5">>}, noblock),
timer:sleep(1000),

{timeout, [], Trace} = riak_pipe:collect_results(Pipe),
rp(Trace).

Below is the value captured in the Trace variable, with additional
comments about what is happening at each step.

%% K1: W=idle, Q=[], B=[]
%% K2: W=idle, Q=[], B=[]
%% M1: W=idle, Q=[], B=[]

[{slowmap,{trace,all,{fitting,init_started}}},
 {slowmap,{trace,all,{fitting,init_finished}}},
 {kvget,{trace,all,{fitting,init_started}}},
 {kvget,{trace,all,{fitting,init_finished}}},

 %% 1: send first input to primary kvget vnode
 %% K1: W=idle, Q=[{notthere,1}], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W=idle, Q=[], B=[]
 {kvget,
     {trace,all,
         {fitting,
             {get_details,#Ref<0.0.0.34313>,
                 1438665674247607560106752257205091097473808596992,
                 <11776.302.0>}}}},
 {kvget,
     {trace,all,
         {vnode,
             {start,1438665674247607560106752257205091097473808596992}}}},
 {kvget,
     {trace,all,
         {vnode,
             {queued,1438665674247607560106752257205091097473808596992,
                 {<<"notthere">>,<<"1">>}}}}},
 %% 1: primary kvget worker processes first input
 %% K1: W={notthere,1}, Q=[], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W=idle, Q=[], B=[]
 {kvget,
     {trace,all,
         {vnode,
             {dequeue,
                 1438665674247607560106752257205091097473808596992}}}},
 %% 1: primary kvget worker forwards first input to fallback vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[{notthere,1}], B=[]
 %% M1: W=idle, Q=[], B=[]
 {kvget,
     {trace,all,
         {fitting,{get_details,#Ref<0.0.0.34313>,0,<0.282.0>}}}},
 {kvget,{trace,all,{vnode,{start,0}}}},
 {kvget,
     {trace,all,{vnode,{queued,0,{<<"notthere">>,<<"1">>}}}}},
 %% 1: fallback kvget vnode processes first input
 %% K1: W=idle, Q=[], B=[]
 %% K2: W={notthere,1}, Q=[], B=[]
 %% M1: W=idle, Q=[], B=[]
 {kvget,{trace,all,{vnode,{dequeue,0}}}},
 {kvget,
     {trace,all,
         {vnode,
             {waiting,
                 1438665674247607560106752257205091097473808596992}}}},
 %% 1: fallback kvget worker sends notfound output to slowmap vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W=idle, Q=[{notfound,1}], B=[]
 {slowmap,
     {trace,all,
         {fitting,
             {get_details,#Ref<0.0.0.34313>,
                 593735040165679310520246963290989976735222595584,
                 <11775.290.0>}}}},
 {slowmap,
     {trace,all,
         {vnode,
             {start,593735040165679310520246963290989976735222595584}}}},
 {slowmap,
     {trace,all,
         {vnode,
             {queued,593735040165679310520246963290989976735222595584,
                 {{error,notfound},{<<"notthere">>,<<"1">>},undefined}}}}},
 {kvget,{trace,all,{vnode,{waiting,0}}}},
 %% 1: slowmap vnode picks up first input
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[], B=[]
 {slowmap,
     {trace,all,
         {vnode,
             {dequeue,
                 593735040165679310520246963290989976735222595584}}}},
 %% 2: send second input to primary kvget vnode
 %% K1: W=idle, Q=[{notthere,2}], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[], B=[]
 {kvget,
     {trace,all,
         {vnode,
             {queued,1438665674247607560106752257205091097473808596992,
                 {<<"notthere">>,<<"2">>}}}}},
 %% 2: primary kvget worker forwards second input to fallback vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[{notthere,2}], B=[]
 %% M1: W={notfound,1}, Q=[], B=[]
 {kvget,
     {trace,all,{vnode,{queued,0,{<<"notthere">>,<<"2">>}}}}},
 {kvget,
     {trace,all,
         {vnode,
             {waiting,
                 1438665674247607560106752257205091097473808596992}}}},
 %% 2: fallback worker sends notfound output to slowmap vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[]
 {slowmap,
     {trace,all,
         {vnode,
             {queued,593735040165679310520246963290989976735222595584,
                 {{error,notfound},{<<"notthere">>,<<"2">>},undefined}}}}},
 {kvget,{trace,all,{vnode,{waiting,0}}}},
 %% 3: send third input to primary kvget vnode
 %% K1: W=idle, Q=[{notthere 3}], B=[]
 %% K2: W=idle, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[]
 {kvget,
     {trace,all,
         {vnode,
             {queued,1438665674247607560106752257205091097473808596992,
                 {<<"notthere">>,<<"3">>}}}}},
 %% 3: primary kvget worker forwards third input to fallback vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W=idle, Q=[{notthere 3}], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[]
 {kvget,
     {trace,all,{vnode,{queued,0,{<<"notthere">>,<<"3">>}}}}},
 {kvget,
     {trace,all,
         {vnode,
             {waiting,
                 1438665674247607560106752257205091097473808596992}}}},
 %% 3: falback worker blocks on sending third notfound to slowmap vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W={notthere 3}, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notthere 3}]
 {slowmap,
     {trace,all,
         {vnode,
             {queue_full,
                 593735040165679310520246963290989976735222595584,
                 {{error,notfound},{<<"notthere">>,<<"3">>},undefined}}}}},
 %% 4: send fourth input to primary kvget vnode
 %% K1: W=idle, Q=[{notthere, 4}], B=[]
 %% K2: W={notthere 3}, Q=[], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
 {kvget,
     {trace,all,
         {vnode,
             {queued,1438665674247607560106752257205091097473808596992,
                 {<<"notthere">>,<<"4">>}}}}},
 %% 4: primary kvget worker forwards forth input to fallback vnode
 %% K1: W=idle, Q=[], B=[]
 %% K2: W={notthere 3}, Q=[{notthere, 4}], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
 {kvget,
     {trace,all,{vnode,{queued,0,{<<"notthere">>,<<"4">>}}}}},
 {kvget,
     {trace,all,
         {vnode,
             {waiting,
                 1438665674247607560106752257205091097473808596992}}}},
 %% 5: send fifth input to primary kvget vnode
 %% K1: W=idle, Q=[{notthere 5}], B=[]
 %% K2: W={notthere 3}, Q=[{notthere, 4}], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
 {kvget,
     {trace,all,
         {vnode,
             {queued,1438665674247607560106752257205091097473808596992,
                 {<<"notthere">>,<<"5">>}}}}},
 %% 5: primary kvget worker fails to forward input to fallback
 %%    because fallback's queue is full
 %% K1: W={notthere 5}, Q=[], B=[]
 %% K2: W={notthere 3}, Q=[{notthere, 4}], B=[]
 %% M1: W={notfound,1}, Q=[{notfound,2}], B=[{notfound, 3}]
 {kvget,
     {trace,all,
         {error,
             [{module,riak_kv_pipe_get},
              {partition,
                  1438665674247607560106752257205091097473808596992},
              {details,
                  [{fitting,
                       #fitting{
                           pid = <0.1434.0>,ref = #Ref<0.0.0.34313>,
                           chashfun = 
                               <<249,252,247,9,220,75,44,145,58,167,246,118,61,
                                 183,57,33,176,79,180,32>>,
                           nval = 2}},
                   {name,kvget},
                   {module,riak_kv_pipe_get},
                   {arg,undefined},
                   {output,
                       #fitting{
                           pid = <0.1433.0>,ref = #Ref<0.0.0.34313>,
                           chashfun = 
                               <<102,4,176,137,137,246,153,214,136,172,94,97,
                                 152,118,215,52,233,8,52,4>>,
                           nval = 1}},
                   {options,
                       [{sink,
                            #fitting{
                                pid = <0.1223.0>,ref = #Ref<0.0.0.34313>,
                                chashfun = sink,nval = undefined}},
                        {log,sink},
                        {trace,all}]},
                   {q_limit,1}]},
              {type,forward_preflist},
              {error,[timeout]},
              {input,{<<"notthere">>,<<"5">>}},
              {modstate,
                  {state,1438665674247607560106752257205091097473808596992,
                      #fitting_details{
                          fitting = 
                              #fitting{
                                  pid = <0.1434.0>,ref = #Ref<0.0.0.34313>,
                                  chashfun = 
                                      <<249,252,247,9,220,75,44,145,58,167,246,
                                        118,61,183,57,33,176,79,180,32>>,
                                  nval = 2},
                          name = kvget,module = riak_kv_pipe_get,
                          arg = undefined,
                          output = 
                              #fitting{
                                  pid = <0.1433.0>,ref = #Ref<0.0.0.34313>,
                                  chashfun = 
                                      <<102,4,176,137,137,246,153,214,136,172,
                                        94,97,152,118,215,52,233,8,52,4>>,
                                  nval = 1},
                          options = 
                              [{sink,
                                   #fitting{
                                       pid = <0.1223.0>,
                                       ref = #Ref<0.0.0.34313>,
                                       chashfun = sink,nval = undefined}},
                               {log,sink},
                               {trace,all}],
                          q_limit = 1}}},
              {stack,[]}]}}},
 {kvget,
     {trace,all,
         {vnode,
             {waiting,
                 1438665674247607560106752257205091097473808596992}}}}]

Moar stats for riak

Add the following from https://github.com/basho/cabal/issues/15 for 1.3 release:

PUT FSM timings
GET FSM timings
Per Vnode request operation and duration
Read Repair Stats (count, sent to dest node, repair reason, dest type (fallback/primary)
Riak_api pb_connects / pb_active should use supervisor:count_children() since a crashing connection will not call the terminate fun so active does not shrink appropriately.

Some useful REST API features should be implemented on PBC API

I believe all users who needs more performance would like to process all operations via PBC API.
However, user cannot do so due to lack of some APIs.

Backet Properrties of PBC API should have more fields
PBC API is strictly limited to access to Backet Properties.
That is, we have to use REST API for Backet management.
Some applications creates per-user Backets.
This indicates that we can write much simpler and faster application by creating/deleting a large amount of Backets.
PBC API should support Link Walking
It would be great if we could use Link Walking, which is really useful feature.
PBC API RpbGetServerInfoResp should return the same information to REST API /stats
RpbGetServerInfoResp returns so poor information that we can't use this API at all.
/stats includes many useful information which is required for client to coordinate with Riak.
PBC API user who wants fast client needs this information.

Add utility to delete per bucket stats [JIRA: RIAK-2808]

See review comment here for details and context #414 (comment)

MapReduce query fails if only a single JavaScript reduce phase is specified

https://issues.basho.com/show_bug.cgi?id=1034

To reproduce:

Create an object

curl http://127.0.0.1:8098/riak/test/key -XPUT -H 'content-type: text/plain' -d 
''

Run a MapReduce query with a keyfilter and a single reduce phase

curl http://127.0.0.1:8098/mapred -XPOST -d '{"inputs": {"key_filters": 
[["matches", ".*"]], "bucket":"test"}, "query": [{"reduce": {"source": 
"function() { return [1] }", "language": "javascript", "keep": true}}]}'

Run a MapReduce query with a full bucket key listing and a single reduce phase

curl http://127.0.0.1:8098/mapred -XPOST -d '{"inputs": "test", "query": 
[{"reduce": {"source": "function() { return [1] }", "language": "javascript", 
"keep": true}}]}'

Run a MapReduce query with explicit bucket/key pairs and a single reduce phase

curl http://127.0.0.1:8098/mapred -XPOST -d '{"inputs": [["test","key"]], 
"query": [{"reduce": {"source": "function() { return [1] }", "language": 
"javascript", "keep": true}}]}'

Expected:
Output from reduce phase

Actual:

{"error":"bad_json"}

Error report from log file:

=ERROR REPORT==== 3-Mar-2011::14:30:07 ===
** State machine <0.17958.0> terminating 
** Last event in was inputs_done
** When State == executing
**      Data  == {state,0,riak_kv_reduce_phase,
                     {state,
                         {javascript,
                             {reduce,
                                 {jsanon,<<"function() { return [1] }">>},
                                 none,true}},
                         [],
                         [{<<"test">>,<<"key">>},1]},
                     true,true,undefined,
                     [<0.17957.0>],
                     undefined,0,0,<0.17956.0>,66000}
** Reason for termination = 
** {error,bad_json}

stats should not fail

It'd be nice if we tracked down all of the lingering timeouts that can potentially cause a stats call to fail. We often error out with a 500 with all of our stats contained in the body of it. It would be better if some stats cannot be fetched in a timely manner that we just drop or say 'couldn't fetch' for those stats, instead of failing the entire call. Partial information is better than none, and this sounds easier than teaching monitoring solutions to parse our error messages for whatever data they may contain.