Giter VIP home page Giter VIP logo

riak_repl's Introduction

riak_repl

Riak MDC Replication


Pull Request template

Testing

  • manual verification of code
  • eunit (w/ gist of output)
  • EQC (w/ gist of output)
  • riak_test (w/ gist of output)
  • Dialyzer
  • XRef
  • Coverage reports

Documentation

  • internal docs (design docs)
  • external docs (docs.basho.com)
  • man pages

New Feature Deliverables

  • design documentation + diagrams
    • nothing formal
    • to help out during support, "this is how xyz works"
  • eunit tests
  • riak_tests
  • EQC + Pulse tests
  • tests at scale and under heavy load
    • Boston Cluster or AWS
  • notes for public documentation
    • for the docs team

BEAM release process

  1. git tag the specific commit(s) that will be released
  2. run all eunit tests, EQC tests, store the output in a gist.
  3. if possible, run all riak_tests for replication
  4. record specific commit(s) that the beam targets in a README.txt file
  5. create a tar file.
  • Note that OSX will include a hidden directory in the tar file. Find the knob to prevent those files from being added to the .tar file, or build/test the beams on Linux. (you can use 'find' to pipe the exact files you want into the tar, see: https://github.com/basho/node_package/blob/develop/priv/templates/fbsd/Makefile#L27 for an example of using -rf with a pipe)
  • include the README.txt file from the step above
  1. once .tar is built, calc an MD5 to pass along with the file
  2. create an entry on https://github.com/basho/internal_wiki/wiki/Releases page
    • include:
      • link to the gist output
      • version of Erlang the beams were built with
      • MD5 of the file
      • link to compiled beams
  3. notify client services + TAMs
  4. port the PR to the develop branch if applicable

riak_repl's People

Contributors

andrewjstone avatar argv0 avatar borshop avatar bowrocker avatar bsparrow435 avatar buddhisthead avatar cmeiklejohn avatar dizzyd avatar engelsanchez avatar fadushin avatar ian-mi avatar jaredmorrow avatar jonmeredith avatar jtuple avatar jvoegele avatar kellymclaughlin avatar krestenkrab avatar kuenishi avatar lordnull avatar macintux avatar martinsumner avatar nickelization avatar nsaadouni avatar runesl avatar russelldb avatar rzezeski avatar slfritchie avatar thomasarts avatar ulfnorell avatar vagabond avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

riak_repl's Issues

Zombie server processes

The send_timeout_close option seems to generate zombie server processes when exposed to bad network conditions.

BNW Bootstrap mode

REPL to an empty cluster is incredibly inefficient. Provide a bootstrap command that blasts all data to a second cluster.

realtime replication does not automatically connect to a sink when a node is added

When a new node is joined to an existing cluster, the new node does not automatically connect to a node in the sink cluster when RT replication is enabled. This makes it impossible to add nodes to a running cluster without having to stop/start RT replication, which means objects written to the new node will not be RT replicated. This is bad. This should be fixed.

BNW real-time postcommit hook called on "old" style cluster

@kellymclaughlin found an issue while running a replication riak_test against CS, using old 1.2 repo configuration (what is now called "default" - not 1.3 BNW).

A bunch of repeated errors showed up in the logs list this:

2013-02-06 16:34:05.681 [debug] <0.5494.0>@riak_kv_put_fsm:decode_postcommit:781 Problem invoking post-commit hook riak_repl2_rt:postcommit -> error:badarg

Looking into it, the only way I can see this getting hit is when BNW postcommit has been installed into the bucket properties, because that is a riak_repl2_rt:postcommit() call. The badarg is a mystery to me, though.

When I look at the code path, it includes a call here:
get_modes(Ring) ->
    RC = get_repl_config(Ring),
    case dict:find(repl_modes, RC) of
        {ok, ReplModes} -> ReplModes;
        error ->
            %% default to mixed modes
            [mode_repl12, mode_repl13]
    end.

Is it possible that we aren't finding repl_modes in the ring? Makes me wonder if the default should really be mixed modes until we officially support BNW in 1.4.

This doesn't seem to be causing any tests to fail, so it's not data-loss or corruption, and is thus being marked as a MUST for 1.3, but I think we should investigate for 1.4.

All RT source nodes connect to same sink when bouncing the source cluster

from ZenDesk https://help.basho.com/tickets/3441

Apple had a 48-node cluster (the source) and bounced each node in 5 minute intervals. After each bounced node came back up, the realtime sync source connected to the same realtime sync sink on the remote cluster (the sink). This is bad because it could have overloaded the one sink node. After all nodes were bounced, every source was connected to the same single sink node.

The reason is that the cluster manager was giving out the same list of IP addresses to each source node as it came back up, as the locator for realtime sync. Even though the round-robin balancer was working, it was getting reset every 5 seconds by the remote node polling update (polls IP address of the remote cluster every 5 seconds). The current cluster manager assumes the list returned by the remote cluster is in an order of least busy to most busy, but that is not actually implemented yet. So, the remote just returns the same list every time and hence it's not actually balanced. Bah.

Two solutions come to mind:

  1. have the remote send over a list that correctly reflects busy-ness
  2. have the local cluster not re-order the list on updates. Just store the IP addresses

Solution 2 is easier I think. But we'll see.

Connection Manager

Connection Manager is a "brave new world" component for replication that we believe will also benefit "core",

It's job is to manage connections and sub-protocols to a remote cluster.

The following is an out-of-date documentation page:
https://github.com/basho/internal_wiki/wiki/Replication-Brave-New-World

Connection Manager will move to core after the Apple delivery, which is why all the modules are named "riak_core_*.erl"
The relevant files are as follows. I know it's a lot.

riak_repl/src/riak_core_cluster_conn.erl
riak_repl/src/riak_core_cluster_mgr_sup.erl
riak_repl/src/riak_core_service_mgr.erl
riak_repl/src/riak_core_cluster_conn_sup.erl
riak_repl/src/riak_core_connection.erl
riak_repl/src/riak_core_cluster_mgr.erl
riak_repl/src/riak_core_connection_mgr.erl

Slow connections can cause the creation of many repl server processes

Description

It appears that slow WAN connections can cause communication between two data centers to fail but the server process is never killed. The client data center opens a new connection which creates a new server process. The server processes consume more and more memory over time until the node runs out of memory.

Steps to reproduce

  • Start two servers
  • Install Riak EE
  • Setup MDC replication between the servers
  • Add several large objects (1MB)
  • Use netem to add artificial latency to the connection (6000ms)
  • Wait for processes to build up

Vagrant example

The following Vagrantfile and provisioning script will setup the above scenario. You have to replace RIAK_EE_URL in the provisioning script with the appropriate URL for the amd64 deb package.

Vagranfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant::Config.run do |config|
        [{:name => :one, :ip => "192.168.33.11", :other_ip => "192.168.33.12"},
         {:name => :two, :ip => "192.168.33.12", :other_ip => "192.168.33.11"}]
        .each do |m|
                config.vm.define m[:name] do |c|
                        c.vm.box = "lucid64"
                        c.vm.box_url = "http://files.vagrantup.com/lucid64.box"
                        c.vm.network :hostonly, m[:ip]
                        c.vm.provision :shell do |s|
                                s.path = "setup-riak-ee.sh"
                                s.args = "#{m[:ip]} #{m[:other_ip]} &"
                        end
                end
         end
end

setup-riak-ee.sh

#!/bin/bash

IP=$1
OTHER_IP=$2

sudo apt-get -y install curl

echo "Download riak-ee"
curl --silent RIAK_EE_URL -o riak-ee_1.1.1-1_amd64.deb

echo "Install riak-ee"
sudo dpkg -i riak-ee_1.1.1-1_amd64.deb

echo "Start riak-ee"
riak start

echo "Wait for riak-ee"
riak-admin wait-for-service riak_kv [email protected]

echo "Add data"
dd if=/dev/urandom count=1024 bs=1024 of=1MB_file
for i in {1..10}
do
        curl --silent http://127.0.0.1:8098/riak/b/$i -XPUT -d @1MB_file \
                -H 'content-type:text/plain'
done

echo "Add delay"
tc qdisc add dev eth1 root netem delay 6000ms

echo "Add listener"
riak-repl add-listener [email protected] $IP 9010

echo "Add site"
riak-repl add-site $OTHER_IP 9010 $OTHER_IP

Cancel fullsync hit undefined function

I tried cancelling a fullsync when it got stuck and got this error report - is

([email protected])19> riak_repl_console:cancel_fullsync([]).
ok
([email protected])20> 21:13:53.820 [error] gen_fsm <0.4187.0> in state diff_bloom terminated with reason: call to undefined function riak_repl_keylist_server:stop/1 from riak_repl_keylist_server:diff_bloom/2
21:13:53.822 [error] CRASH REPORT Process <0.4187.0> with 3 neighbours exited with reason: call to undefined function riak_repl_keylist_server:stop(<0.4225.0>) in gen_fsm:terminate/7
21:13:53.824 [error] Supervisor ranch_conns_sup had child ranch_conns_sup started with {ranch_conns_sup,start_protocol,undefined} at <0.4186.0> exit with reason call to undefined function riak_repl_keylist_server:stop(<0.4225.0>) in context child_terminated
21:13:54.014 [info] Using fullsync strategy riak_repl_keylist_server.

I think I hit this...
https://github.com/basho/riak_repl/blob/master/src/riak_repl_keylist_server.erl#L368

add support to repl for new small riak-object format

riak object smallification is officially part of the riak core cabal, but I'm putting an issue here under replication so that we can track our part of it.

Replication needs to work with mixed cluster versions where the riak object format may be different on the two clusters.

EQC test that runs random commands

There are so many cases, now, where we need to test replication in the face of other kinds of riak cluster operations. The latest, that comes to mind, is binary object downgrades. That's just one example.

So, how about we write a riak test that fires random, but legal, operations at the cluster while we do random kinds of replication. Mixtures of modes, etc. This is an idea that Andrew had some time ago; and it's looking really useful now.

Add mechanism to indicate if objects were dropped and a fullsync is needed

Currently, if we drop objects from realtime replication (BNW), we may in some cases be able to observe the drop. Relevant error messages:

  • No nodes available to migrate replication data (on shutdown)
  • No available nodes to proxy objects to (on shutdown)
  • rtq proxy target is down (on shutdown)

However, if a node crashes during realtime replication, we can't write to the logfile and there is no indication of dropped objects on restart.

At least one customer (Anya) wants to be able to detect the need for a full sync by monitoring repl_stats (json) in their operational environment, since they have already built mechanisms to trigger alarms based off polling each node.

We already have a dropped object count in the status. This will monitors drops due to connection loss, connection errors, overloaded nodes, and anything else that happens while running.

We may need to add an entry to stats for dropped objects during controlled or uncontrolled shutdowns. A completely crashed node that never comes back obviously can't indicate drops. But perhaps the other end of the connection can mark the loss of connection.

`riak-repl start-fullsync` does nothing on non-leader node

Running riak-repl start-fullsync on a node in the cluster that is not the replication leader results in no messages and no actions. Preferable behavior would be to either communicate with the leader to start a fullsync, or return a message stating that no action is being taken and that the command should be run on the leader (and/or log the message).

Add more debug info for proxy_get client and server modules

The proxy_get implementation could benefit by having more debug messages that can be traced at run time. Mostly for tracing the requests to ensure that we are getting all three phases: protocol buffer "coordinator" for the get requests, the tcp client, and the tcp server.

Call to riak_repl_wm_stats:jsonify_stats fails during fullsync

See ZD ticket 3634 for client detail. When a fullsync is running, calls to riak_repl_wm_stats:jsonify_stats will fail with e.g.

{error,{error,function_clause,[{riak_repl_wm_stats,jsonify_stats,[[{fullsync,63,left},...

This is called during webmachine calls to '/riak-repl/stats'.

1.2.1 CRASH in Tcp error handler... called as tcp ++ "_error"

In the 1.2.1 code base, we had a bug where a tcp failure could result in a bogus error message formatting that would crash the replication service. Here is what the log file showed:

2013-03-06 02:39:20.145 UTC [error] <0.1806.0> Supervisor poolboy_sup had child riak_core_vnode_worker started with {riak_core_vnode_worker,start_link,undefined} at <0.1812.0> exit with reason {{badarg,[{erlang,'++',[tcp,"_error"],[]},{riak_repl_tcp_server,send,3,[{file,"src/riak_repl_tcp_server.erl"},{line,417}]},{riak_repl_keylist_server,diff_bloom,3,[{file,"src/riak_repl_keylist_server.erl"},{line,490}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,494}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_fsm,sync_send_event,[<16329.11597.0>,{diff_obj,{r_object,<<"dev/faketest16_22">>,<<"test-blobitory-management-collection/1c60744e-033d-4c69-2350-693...">>,...}},...]}} in context child_terminated

Andrew already had a patch for that, which I applied as a cherry pick against the 1.2.1 code base, on a branch of repl: cet-apply-error-msg-fix-to-1.2. This resolved the bug. Beams were sent to the customer.

We decided not to merge this fix into the 1.2 code base because it would be confusing to us later in trying to decide which fixes were sent to which customers. Maybe the right thing to do is to create a 1.2.x release and merge it against that.

ZenDesk ticket...
https://basho.zendesk.com/agent/#/tickets/3998

The FIX:
https://github.com/basho/riak_repl/tree/cet-apply-error-msg-fix-to-1.2

The TEST: (Thanks to @metadave Dave Parfitt)
https://gist.github.com/metadave/5cb961f5ea68054e5b5d

Investigate the following REPL error message

To reproduce:
Compile with R15B01
make devrel
dev1/bin/riak-repl add-listener [email protected] 127.0.0.1 8080
dev4/bin/riak-repl add-site 127.0.0.1 8080 foo

basho_bench data into dev1.

2012-05-31 10:58:54.994 [info] <0.1412.0>@riak_repl_keylist_server:build_keylist:153 Full-sync with site "foo"; built keylist for 182687704666362864775460604089535377456991567872 (built in 0.07 secs)
2012-05-31 10:58:55.001 [error] <0.1412.0> gen_fsm <0.1412.0> in state wait_keylist terminated with reason: no function clause matching riak_repl_keylist_server:wait_keylist({#Ref<0.0.0.6404>,keylist_built}, {state,"foo",{sslsocket,new_ssl,<0.1411.0>},ranch_ssl,"./data/riak_repl/work/7224057/foo-127.0.0...",...}) line 172
2012-05-31 10:58:55.003 [error] <0.1412.0> CRASH REPORT Process <0.1412.0> with 1 neighbours exited with reason: no function clause matching riak_repl_keylist_server:wait_keylist({#Ref<0.0.0.6404>,keylist_built}, {state,"foo",{sslsocket,new_ssl,<0.1411.0>},ranch_ssl,"./data/riak_repl/work/7224057/foo-127.0.0...",...}) line 172 in gen_fsm:terminate/7 line 611
2012-05-31 10:58:55.004 [error] <0.980.0> Supervisor ranch_conns_sup had child ranch_conns_sup started with {ranch_conns_sup,start_protocol,undefined} at <0.1410.0> exit with reason no function clause matching riak_repl_keylist_server:wait_keylist({#Ref<0.0.0.6404>,keylist_built}, {state,"foo",{sslsocket,new_ssl,<0.1411.0>},ranch_ssl,"./data/riak_repl/work/7224057/foo-127.0.0...",...}) line 172 in context child_terminated
2012-05-31 10:58:55.040 [info] <0.1533.0>@riak_repl_tcp_server:handle_msg:249 Using fullsync strategy riak_repl_keylist_server.
2012-05-31 10:58:55.041 [info] <0.1533.0>@riak_repl_tcp_server:handle_msg:279 Full-sync on connect
2012-05-31 10:58:55.059 [info] <0.1535.0>@riak_repl_keylist_server:wait_for_partition:120 Full-sync with site "foo"; doing fullsync for 182687704666362864775460604089535377456991567872
2012-05-31 10:58:55.059 [info] <0.1535.0>@riak_repl_keylist_server:build_keylist:139 Full-sync with site "foo"; building keylist for 182687704666362864775460604089535377456991567872
2012-05-31 10:58:55.104 [info] <0.1543.0>@riak_repl_fullsync_helper:handle_cast:284 Sorting keylist "./data/riak_repl/work/7224057/foo-127.0.0.1:8080-127.0.0.1:57843/182687704666362864775460604089535377456991567872.ours.sterm"
2012-05-31 10:58:55.106 [info] <0.1543.0>@riak_repl_fullsync_helper:handle_cast:287 Sorted ./data/riak_repl/work/7224057/foo-127.0.0.1:8080-127.0.0.1:57843/182687704666362864775460604089535377456991567872.ours.sterm in 0.00 seconds

Stale kl_exchange message during fullsync


/home/andrew/riak_test/rt/dev/dev6/log/console.log:2012-08-10 14:23:38.152 [warning] <0.2058.0>@riak_repl_keylist_client:request_partition:153 Full-sync with site "site1"; skipping partition 548063113999088594326381812268606132370974703616 because of error node_not_available
/home/andrew/riak_test/rt/dev/dev6/log/console.log:2012-08-10 14:23:38.186 [error] <0.2058.0> gen_fsm <0.2058.0> in state request_partition terminated with reason: no function clause matching riak_repl_keylist_client:request_partition({kl_exchange,548063113999088594326381812268606132370974703616}, {state,"site1",#Port<0.11954>,ranch_tcp,"./data/riak_repl/work/112824079/site1-127.0.0.1:52588-1...",...})
/home/andrew/riak_test/rt/dev/dev6/log/console.log:2012-08-10 14:23:38.199 [error] <0.2058.0> CRASH REPORT Process <0.2058.0> with 1 neighbours exited with reason: no function clause matching riak_repl_keylist_client:request_partition({kl_exchange,548063113999088594326381812268606132370974703616}, {state,"site1",#Port<0.11954>,ranch_tcp,"./data/riak_repl/work/112824079/site1-127.0.0.1:52588-1...",...}) in gen_fsm:terminate/7
/home/andrew/riak_test/rt/dev/dev6/log/console.log:2012-08-10 14:23:38.207 [error] <0.1582.0> Supervisor riak_repl_client_sup had child "site1" started with riak_repl_tcp_client:start_link("site1") at <0.2048.0> exit with reason no function clause matching riak_repl_keylist_client:request_partition({kl_exchange,548063113999088594326381812268606132370974703616}, {state,"site1",#Port<0.11954>,ranch_tcp,"./data/riak_repl/work/112824079/site1-127.0.0.1:52588-1...",...}) in context child_terminated

The request_partition state should ignore stale kl_exchange messages.

Clients do not handle gen_tcp:send/2 errors

The riak_repl_tcp_client does not handle the return from gen_tcp:send/2 which may return {error, Reason}. This can cause replication clients to hang around indefinitely without replication actually occurring.

A module has been provided below to help observe current client behavior. To observe the issue you will need to setup 2 Riak nodes and connect them via riak-repl. Once that is setup, you can use the below module as follows:

riak_repl_client_test:test_client_ports().                                 
[{<0.8179.0>,ok},{<0.8182.0>,ok}]

riak_repl_client_test:close_client_ports().                                  
[{<0.8179.0>,true},{<0.8182.0>,true}]

riak_repl_client_test:test_client_ports(). 
[{<0.8179.0>,{error,closed}},{<0.8182.0>,{error,closed}}]

riak_repl_client_test:test_client_ports_through_client().
[{<0.8179.0>,
  {status,[{node,'[email protected]'},
           {site,"node2-to-node1"},
           {strategy,riak_repl_keylist_client},
           {fullsync_worker,<0.8184.0>},
           {put_pool_size,5},
           {connected,"127.0.0.1",9012},
           {state,wait_for_fullsync}]}},
 {<0.8182.0>,
  {status,[{node,'[email protected]'},
           {site,"node3-to-node1"},
           {strategy,riak_repl_keylist_client},
           {fullsync_worker,<0.8186.0>},
           {put_pool_size,5},
           {connected,"127.0.0.1",9013},
           {state,wait_for_fullsync}]}}]

The last command, riak_repl_client_test:test_client_ports_through_client/0, forces the client to send a keapalive_ack over the closed port. The expectation is that this would cause the client to fail and re-connect. The client does not fail. The client ignores the {error, closed} returned from gen_tcp:send/2.

-module(riak_repl_client_test).

-compile(export_all).

test_client_ports_through_client() ->
    Pids = get_client_pids(),
    [{Pid, send_keepalive_through_client(Pid)} || Pid <- Pids].

test_client_ports() ->
    Ports = get_client_ports(),
    [{Pid, send_keepalive(Port)} || {Pid, Port} <- Ports].

close_client_ports() ->
    Ports = get_client_ports(),
    [{Pid, close_port(Port)} || {Pid, Port} <- Ports].

get_client_ports() ->
    Pids = get_client_pids(),
    [{Pid, get_port_from_pid(Pid)} || Pid <- Pids].

get_client_pids() ->
   [ Pid2 || {riak_repl_client_sup,P,_,_} <- supervisor:which_children(riak_repl_sup),
           {_,Pid2,_,_} <- supervisor:which_children(P)].

get_port_from_pid(Pid) ->
    get_port_from_status(sys:get_status(Pid)).

get_port_from_status(Status) ->
    {_, _, _, [_,running,_,_,[_,_,{data,[{_,State}]}]]} = Status,
    element(5, State).

close_port(Port) ->
    erlang:port_close(Port).

send_keepalive(Port) ->
    gen_tcp:send(Port, term_to_binary(keepalive)).

send_keepalive_through_client(Pid) ->
    Port = get_port_from_pid(Pid),
    Pid ! {tcp, Port, term_to_binary(keepalive)},
    riak_repl_tcp_client:status(Pid).

Typo in lager message

Saw this manifested in some repl logs today:

In src/riak_repl_keylist_client.erl, line 225: "exhanging differences for"

Server Stats and Start/Stop/Pause FullSync Do Not Work

Previously, riak_repl_server_sup handled riak_repl_tcp_server. When the fullsync and stats commands were run from riak_repl_console, the following was ran:

server_pids() ->
    [P || {_,P,_,_} <- supervisor:which_children(riak_repl_server_sup), P /= undefined].

This listed the riak_repl_tcp_server children, and then the commands were sent like so:

start_fullsync([]) ->
    [riak_repl_tcp_server:start_fullsync(Pid) || Pid <- server_pids()],
    ok.

However, this has changed. Instead of calling riak_repl_server_sup:start_server/1, it is called via riak_repl_listener_sup:ensure_listeners/1, which calls riak_repl_listener_sup:start_listener/1:

start_listener(Listener = #repl_listener{listen_addr={IP, Port}}) ->
    case riak_repl_util:valid_host_ip(IP) of
        true ->
            lager:info("Starting replication listener on ~s:~p",
                [IP, Port]),
            {ok, RawAddress} = inet_parse:address(IP),
            ranch:start_listener(Listener, 10, ranch_tcp,
                [{ip, RawAddress}, {port, Port}], riak_repl_tcp_server, []);
        _ ->
            lager:error("Cannot start replication listener "
                "on ~s:~p - invalid address.",
                [IP, Port])
    end.

You'll note that it called ranch:start_listener/6. This is a new dependency, managing socket connection pools. The source is located at https://github.com/extend/ranch.

With this, calling supervisor:which_children(riak_repl_server_sup). results in nothing:

[]

This is because children are no longer spawned under riak_repl_server_sup.

However, unlike listing supervisors under riak_repl_server_sup, which had riak_repl_tcp_server as a direct child, ranch operates differently.

Note the following:

supervisor:which_children(ranch_sup).               
[{{ranch_listener_sup,#repl_listener{nodename = '[email protected]',
                                     listen_addr = {"127.0.0.1",9010}}},
  <0.31129.0>,supervisor,
  [ranch_listener_sup]}]

This has a ranch_listener_sup as a child. Digging further in:

supervisor:which_children(c:pid(0,31129,0)).
[{ranch_acceptors_sup,<0.31132.0>,supervisor,
                      [ranch_acceptors_sup]},
 {ranch_conns_sup,<0.31131.0>,supervisor,[ranch_conns_sup]},
 {ranch_listener,<0.31130.0>,worker,[ranch_listener]}]

It turns out that riak_repl_tcp_server is spawned from ranch_conns_sup:

supervisor:which_children(c:pid(0,31131,0)).
[{undefined,<0.31145.0>,worker,[ranch_conns_sup]}]

We can then query riak_repl_tcp_server with this pid:

riak_repl_tcp_server:status(c:pid(0,31145,0)).
{status,[{node,'[email protected]'},
         {site,"dev1-dev2-dev3"},
         {strategy,riak_repl_keylist_server},
         {fullsync_worker,<0.31146.0>},
         {dropped_count,0},
         {queue_length,0},
         {queue_byte_size,0},
         {state,wait_for_partition}]}

So the interface to riak_repl_tcp_server still works, however the migration to using ranch has made querying this process troublesome.

riak_test for binary object downgrade while repl running

It's possible to downgrade a cluster from "new" binary riak object format back to the old t2b form. There is already a test for that in riak_kv. However, a downgrade should also work while replication is running, so let's write a test for that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.