Giter VIP home page Giter VIP logo

arakoon's Introduction

Open vStorage Framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to manage the Open vStorage platform. The main framework pieces are written in python and javascript. It makes use of other open source projects like Arakoon for distributed configuration management and Celery as distributed task queue. The intuitive GUI allows easy setup and management of an Open vStorage Backend and link that to a distributed vPool. Having the ability to interface with hypervisors like KVM and VMware the framework turns the platform into a VM-Centric Storage platform.

The Open vStorage Framework is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE Version 3.

Get started

Check our gitbook on how to get started with an installation.

Releases

You can find an overview of the release history on our Releases page.

Documentation

The Framework specific documentation (components, concepts, logs, troubleshooting, ...) can be found here.

Contribution & Packaging

We welcome contributions, general contribution info/guidelines can be found here. From a Framework perspective packaging your own changes for testing can be done using the packager module

File a bug/enhancement

Open vStorage is quality checked to the highest level. Unfortunately we might have overlooked some tiny topics here or there. The Open vStorage Project maintains a public issue tracker where you can report bugs and request enhancements. This issue tracker is not a customer support forum but an error, flaw, failure, or fault in the Open vStorage software.

arakoon's People

Contributors

314eter avatar andersfugmann avatar blackmath86 avatar delamonpansie avatar domsj avatar harukama avatar jeffreydevloo avatar jeroenmaelbrancke avatar khenderick avatar nicolast avatar tdeconin avatar toolslive avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arakoon's Issues

Error during collapsing cluster

version: 1.9.8

2016-08-22 10:12:06 06900 +0200 - ovs-01.be-gen8-2 - 9182/140488436619072 - lib/scheduled tasks - 334276 - ERROR - Error during collapsing cluster ovsdb node jNkeV9q1hhm8IGRk
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/lib/scheduledtask.py", line 411, in collapse_arakoon
    client.collapse(str(node), 2)
  File "<update_argspec>", line 5, in collapse
  File "/opt/OpenvStorage/ovs/extensions/db/arakoon/pyrakoon/pyrakoon/compat.py", line 152, in wrapped
    raise new_exception
ArakoonException: unknown failure

Tlog_map.TLCNotProperlyClosed(_) during local collapse

During the local collapse, I got this unexpected error:

arakoon: tlog_map: _init ~check_marker:true
arakoon: tlog_map: tlog_number:498
arakoon: tlog_map: _validate_one /mnt/hdd1/ovh/arakoon/global_abm/tlogs/498.tlog
Uncaught exception:

  Tlog_map.TLCNotProperlyClosed(_)

Raised at file "src/core/lwt.ml", line 789, characters 22-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "src/main/arakoon.ml", line 465, characters 7-185
Called from file "src/main/arakoon.ml", line 569, characters 7-23
Called from file "src/main/arakoon.ml", line 583, characters 9-16

Compile Arakoon 1.9.3 on Ubuntu 16.04

Hi,

Here you can find a script which makes Arakoon 1.9.3 compiles on a fresh (at least OpenvCloud) Ubuntu 16.04:
https://gist.github.com/maxux/5d8cd7c6ce819ae8aa03d4457a7088cb

Note: this gist contains a patch needed to compiles on "Core 113.33".
After that, Arakoon compiles fine on all last version of libs used.

All test runs (except etcd which was not installed).

Version installed:

bisect               1.3
camlbz2              0.6.0
camltc               0.9.3
core                 113.33.03
depext               0.9.1
lwt                  2.5.2
ocamlfind            1.6.2
ocplib-endian        0.8
quickcheck           1.0.2
redis                0.2.3
snappy               0.1.0
ssl                  0.5.2
uri                  1.9.2

(Failure "file not found") during local collapse

During a local collapse of a arakoon node I got this error:

Uncaught exception:

  (Failure "file not found")

Raised at file "src/core/lwt.ml", line 789, characters 22-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "src/main/arakoon.ml", line 465, characters 7-185
Called from file "src/main/arakoon.ml", line 569, characters 7-23
Called from file "src/main/arakoon.ml", line 583, characters 9-16

High memory usage on ABM

The memory usage of an ABM on the Nuvolat setup was 57GB high.

arakoon version: 1.9.18
alba version: ee-1.5.2

22036 ovs       20   0 57.173g 0.054t   9092 S  15.6 29.5   2322:12 /usr/bin/arakoon --node N0RCG4vXgbP0w8LQ -config arakoon://config/ovs/arakoon/ny1-ssdbackend01-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini -autof+

We see a lot of assert failed attempts on this particular node/abm. (not sure if this is caused by the already fixed bug on https://github.com/openvstorage/alba_ee/issues/52) (same abm)

Albamgr: Assert failed /alba/ns/list/c220cf34-f904-4fff-82bc-16271219e457_bc_000000086 (attempt 4)

Change copy-head-to-db CLI signature to use -config

Currently most arakoon CLI commands accept a -config parameter to point to the config of the cluster, except for copy-head-to-db. This one accepts a clustername, IP, port and the number of tlog files to leave untouched. Can we make CLI arguments uniform so a --collapse-local and --copy-db-to-head accept the same arguments as the result of both actions is the same?

--copy-db-to-head <cluster_id> <tlogs_to_keep> requests the node to copy its database to the head location, after which it will remove old tlogs
--collapse-local <node_id> collapse all but tlogs into the head database

clean up old tlogs files even when collapse is not necessarily

Setup:
Stop service
Copy the main.db to the head.db.
Start service
Run a collapse-local

Problem:
Local collapse will not clean up old tlogs.
The clean up process is only triggered when Arakoon needs to collapse the tlogs.

i:14816766; n:3119; is_archive: true}; { i:14811766; n:3118; is_archive: true}; { i:14806766; n:3117; is_archive: true}; { i:14801766; n:3116; is_archive: true}; { i:14796766; n:3115; is_archive: true}; { i:14791766; n:3114; is_archive: true}; { i:14786766; n:3113; is_archive: true}; { i:14781766; n:3112; is_archive: true}; { i:14776766; n:3111; is_archive: true}; { i:14771766; n:3110; is_archive: true}; { i:14766766; n:3109; is_archive: true}; { i:14761766; n:3108; is_archive: true}; { i:14756766; n:3107; is_archive: true}; { i:14751766; n:3106; is_archive: true}; { i:14746766; n:3105; is_archive: true}; { i:14741766; n:3104; is_archive: true}; { i:14736766; n:3103; is_archive: true}; { i:14731766; n:3102; is_archive: true}; { i:14726767; n:3101; is_archive: true}; { i:14721767; n:3100; is_archive: true}; { i:14716769; n:3099; is_archive: true}; { i:14711771; n:3098; is_archive: true}; { i:14706772; n:3097; is_archive: true}; { i:14701772; n:3096; is_archive: true}; { i:14696772; n:3095; is_archive: true}; { i:14691776; n:3094; is_archive: true}; { i:14686781; n:3093; is_archive: true}; { i:14684163; n:3092; is_archive: true}; { i:14680607; n:3091; is_archive: true}; { i:14675607; n:3090; is_archive: true}; { i:14670608; n:3089; is_archive: true}; { i:14665608; n:3088; is_archive: true}; { i:14660609; n:3087; is_archive: true}; { i:14655609; n:3086; is_archive: true}; { i:14650609; n:3085; is_archive: true}; { i:14645609; n:3084; is_archive: true}; { i:14642049; n:3083; is_archive: true}; { i:14637315; n:3082; is_archive: true}; { i:14632316; n:3081; is_archive: true}; { i:14627322; n:3080; is_archive: true}; { i:14622322; n:3079; is_archive: true}; { i:14617323; n:3078; is_archive: true}; { i:14612323; n:3077; is_archive: true}]
arakoon: tlog_map: tlogs_to_collapse 15348832 15353260 20 => None
arakoon: main: Nothing to collapse...
arakoon: main: sn' = 0

Extending a cluster to an extra node never seems to catch up

The scenario:

We had a 5 node arakoon cluster and used the framework tools to move one of the members to a different node. The plan was to extend to cluster with a 6th node to afterwards shrink to a 5 node cluster again. We noticed that the new member was never able to catch up with the cluster, resulting in a more or less endless loop. We let the new member try to catch up for +12 hours, but to no avail. It must also be noted that one of the existing members of the cluster was down before we triggered the extend to a 6th node, hence the log file contains connection refused messages for node 172.17.16.8.

Some log file snippets:

2017-07-25 05:47:04 802887 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626081 - info - CATCHUP start: I'm @ 22280560 and N0RCG4vXgbP0w8LQ is more recent
2017-07-25 05:47:04 802906 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626082 - info - catchup_tlog 22280560 from N0RCG4vXgbP0w8LQ
2017-07-25 05:47:04 854183 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626083 - info - loop_entries
2017-07-25 05:47:04 946255 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626084 - info - remote_nodestream :: last_seen = Some ("22280579")
2017-07-25 05:47:04 946299 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626085 - info - loop_parts done
2017-07-25 05:47:04 946337 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626086 - info - catchup_tlog completed
2017-07-25 05:47:04 946358 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626087 - info - CATCHUP phase 1 done (until_i = 22280579); now the store
2017-07-25 05:47:04 946372 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626088 - info - replaying log to store
2017-07-25 05:47:04 951188 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626089 - info - Replaying tlog file: 439.tlog [22280281,...] (195/195)
2017-07-25 05:47:05 921475 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626090 - info - establishing connection to (172.17.16.8,26455) (timeout = 5.000000)
2017-07-25 05:47:05 921638 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626091 - info - __open_connection to (172.17.16.8,26455) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
2017-07-25 05:47:06 608391 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626092 - info - Completed replay of 439.tlog, took 1.657176 seconds, 0 to go
2017-07-25 05:47:06 650158 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626093 - info - catchup_store completed, store is @ Some ("22280579")
2017-07-25 05:47:06 650182 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626094 - info - CATCHUP end
2017-07-25 05:47:06 650279 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626095 - info - CATCHUP start: I'm @ 22280579 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:47:06 650286 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626096 - info - catchup_tlog 22280579 from wZ4GAcmR2MZl5x7S
2017-07-25 05:47:06 654715 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626097 - info - loop_entries
2017-07-25 05:47:06 806162 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626098 - info - remote_nodestream :: last_seen = Some ("22280586")
2017-07-25 05:47:06 806184 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626099 - info - loop_parts done
2017-07-25 05:47:06 806205 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626100 - info - catchup_tlog completed
2017-07-25 05:47:06 806228 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626101 - info - CATCHUP phase 1 done (until_i = 22280586); now the store
2017-07-25 05:47:06 806235 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626102 - info - replaying log to store
2017-07-25 05:47:06 809906 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626103 - info - Replaying tlog file: 439.tlog [22280281,...] (195/195)
2017-07-25 05:47:07 037526 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626104 - info - Completed replay of 439.tlog, took 0.227597 seconds, 0 to go
2017-07-25 05:47:07 081012 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626105 - info - catchup_store completed, store is @ Some ("22280586")
2017-07-25 05:47:07 081033 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626106 - info - CATCHUP end
2017-07-25 05:47:07 185386 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626107 - info - establishing connection to (172.17.16.8,26455) (timeout = 5.000000)
2017-07-25 05:47:07 185647 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626108 - info - __open_connection to (172.17.16.8,26455) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
2017-07-25 05:47:07 227959 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626109 - info - CATCHUP start: I'm @ 22280586 and N0RCG4vXgbP0w8LQ is more recent
2017-07-25 05:47:07 227983 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626110 - info - catchup_tlog 22280586 from N0RCG4vXgbP0w8LQ
2017-07-25 05:47:07 239730 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626111 - info - loop_entries
2017-07-25 05:47:07 296488 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626112 - info - remote_nodestream :: last_seen = Some ("22280587")
2017-07-25 05:47:07 296506 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626113 - info - loop_parts done
2017-07-25 05:47:07 296522 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626114 - info - catchup_tlog completed
2017-07-25 05:47:07 296529 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626115 - info - CATCHUP phase 1 done (until_i = 22280587); now the store
2017-07-25 05:47:07 296551 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626116 - info - replaying log to store
2017-07-25 05:47:07 299951 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626117 - info - Replaying tlog file: 439.tlog [22280281,...] (195/195)
2017-07-25 05:47:07 359810 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626118 - info - Completed replay of 439.tlog, took 0.059834 seconds, 0 to go
2017-07-25 05:47:07 360198 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626119 - info - catchup_store completed, store is @ Some ("22280587")
2017-07-25 05:47:07 360210 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626120 - info - CATCHUP end
2017-07-25 05:47:07 371226 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626121 - info - CATCHUP start: I'm @ 22280587 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:47:07 371253 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626122 - info - catchup_tlog 22280587 from wZ4GAcmR2MZl5x7S
2017-07-25 05:47:07 373095 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626123 - info - loop_entries
2017-07-25 05:47:07 479117 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626124 - info - remote_nodestream :: last_seen = Some ("22280588")
2017-07-25 05:47:07 479141 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626125 - info - loop_parts done
2017-07-25 05:47:07 479159 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626126 - info - catchup_tlog completed
2017-07-25 05:47:07 479168 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626127 - info - CATCHUP phase 1 done (until_i = 22280588); now the store
2017-07-25 05:47:07 479175 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626128 - info - replaying log to store
2017-07-25 05:47:07 483116 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626129 - info - Replaying tlog file: 439.tlog [22280281,...] (195/195)
2017-07-25 05:47:07 539037 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626130 - info - Completed replay of 439.tlog, took 0.055894 seconds, 0 to go
2017-07-25 05:47:07 587151 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626131 - info - catchup_store completed, store is @ Some ("22280588")
2017-07-25 05:47:07 587172 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626132 - info - CATCHUP end

To have a more clear view on the amount of updates the new member was running behind:

2017-07-25 05:48:13 161195 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626653 - info - CATCHUP start: I'm @ 22280757 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:13 169220 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626659 - info - CATCHUP phase 1 done (until_i = 22280759); now the store
2017-07-25 05:48:13 268636 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626664 - info - CATCHUP end
2017-07-25 05:48:13 268955 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626665 - info - CATCHUP start: I'm @ 22280759 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:13 285894 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626671 - info - CATCHUP phase 1 done (until_i = 22280761); now the store
2017-07-25 05:48:13 412463 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626676 - info - CATCHUP end
2017-07-25 05:48:13 412893 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626677 - info - CATCHUP start: I'm @ 22280761 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:13 435778 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626683 - info - CATCHUP phase 1 done (until_i = 22280762); now the store
2017-07-25 05:48:13 623228 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626690 - info - CATCHUP end
2017-07-25 05:48:13 623528 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626691 - info - CATCHUP start: I'm @ 22280762 and N0RCG4vXgbP0w8LQ is more recent
2017-07-25 05:48:13 736298 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626697 - info - CATCHUP phase 1 done (until_i = 22280764); now the store
2017-07-25 05:48:13 839142 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626702 - info - CATCHUP end
2017-07-25 05:48:13 839181 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626703 - info - CATCHUP start: I'm @ 22280764 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:13 847774 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626709 - info - CATCHUP phase 1 done (until_i = 22280766); now the store
2017-07-25 05:48:14 027051 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626714 - info - CATCHUP end
2017-07-25 05:48:25 909635 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626740 - info - CATCHUP start: I'm @ 22280766 and N0RCG4vXgbP0w8LQ is more recent
2017-07-25 05:48:26 053015 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626746 - info - CATCHUP phase 1 done (until_i = 22280829); now the store
2017-07-25 05:48:27 210702 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626753 - info - CATCHUP end
2017-07-25 05:48:27 210738 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626754 - info - CATCHUP start: I'm @ 22280829 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:27 227744 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626760 - info - CATCHUP phase 1 done (until_i = 22280837); now the store
2017-07-25 05:48:27 449894 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626765 - info - CATCHUP end
2017-07-25 05:48:27 860400 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626768 - info - CATCHUP start: I'm @ 22280837 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:27 891243 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626774 - info - CATCHUP phase 1 done (until_i = 22280838); now the store
2017-07-25 05:48:28 134927 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626779 - info - CATCHUP end
2017-07-25 05:48:39 371910 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626803 - info - CATCHUP start: I'm @ 22280838 and hVEIul3TXTpCAWwg is more recent
2017-07-25 05:48:39 674368 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626809 - info - CATCHUP phase 1 done (until_i = 22280861); now the store
2017-07-25 05:48:41 339351 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626818 - info - CATCHUP end
2017-07-25 05:48:41 339401 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626819 - info - CATCHUP start: I'm @ 22280861 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:41 388302 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626825 - info - CATCHUP phase 1 done (until_i = 22280862); now the store
2017-07-25 05:48:42 229435 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626830 - info - CATCHUP end
2017-07-25 05:48:42 229831 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626832 - info - CATCHUP start: I'm @ 22280862 and hVEIul3TXTpCAWwg is more recent
2017-07-25 05:48:42 350230 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626839 - info - CATCHUP phase 1 done (until_i = 22280863); now the store
2017-07-25 05:48:42 473137 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626844 - info - CATCHUP end
2017-07-25 05:48:42 716423 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626845 - info - CATCHUP start: I'm @ 22280863 and wZ4GAcmR2MZl5x7S is more recent
2017-07-25 05:48:42 745541 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626851 - info - CATCHUP phase 1 done (until_i = 22280864); now the store
2017-07-25 05:48:43 230890 -0400 - NY1SRV0012 - 11327/0000 - arakoon - 1626856 - info - CATCHUP end

arakoon 1.9.3 node refuses to start

The node had a tlog file (x.tlx) that was filled with just 1 entry, repeated thousands of times.
The next startup of that node failed as the tlog_map refuses the next tlog (x+1.tlx)
as that one has a starting i that is not higher than the starting i from x.tlx.
The environment had networking issues and the node was left isolated for days.

The workaround was simple in this specific case: delete all tlog files [x,_)

arakoon 1.9.3 range issue

In [38]: client = Arakoon.ArakoonClient(cfg)
In [39]: client.range(None, False, None, False, -1)
Out[39]: ['xey001']
In [40]: client.range(None, False, None, False, 0)
Out[40]: ['xey001']

probably impacts older versions too.

Corrupt head.db & verify-db marked OK

root@pocops-ovs01:~# arakoon --version
version: 1.9.12

It looks like the verify-db isn't working as it should.

2016-10-10 14:10:03 200099 +0200 - pocops-ovs03 - 30532/0 - arakoon - 283286 - info - returning assuming no I "/opt/OpenvStorage/db/arakoon/ovsdb/tlogs/head.db": (Failu
re success); backtrace:; Raised by primitive operation at file "hotc.ml", line 63, characters 12-24; Called from file "src/node/local_store.ml", line 49, characters 2-6
2; Called from file "src/node/local_store.ml", line 277, characters 2-41; Called from file "src/node/batched_store.ml", line 57, characters 4-34; Called from file "src/
node/store.ml", line 159, characters 4-48; Called from file "src/node/collapser.ml", line 183, characters 7-131; Called from file "src/core/lwt.ml", line 686, character
s 20-24
root@pocops-ovs01:~# arakoon --verify-db /opt/OpenvStorage/db/arakoon/ovsdb/tlogs/head.db
         0: "*i"        ok
         1: "*j"        ok
         2: "*master"   ok
         3: "@__ovs_config"     ok
         4: "@__ovs_metadata"   ok
         5: "@ovs_celery_beat_entries"  ok
         6: "@ovs_celery_beat_lock"     ok
         7: "@ovs_data_abmservice_097dcb2e-aed7-42ac-ba73-0e51453d3c5a" ok
         8: "@ovs_data_abmservice_517e87b5-454a-4dac-881b-786ce8a6c1a0" ok
         9: "@ovs_data_abmservice_e1dc8145-39e2-4226-80c6-82034f228737" ok

Verify-db related ticket: #122

Arakoon fails to start with error: starts with 17571602, which ain't higher than 19229207

A arakoon on our multiDC setup will not start anymore because of:

2016-06-30 15:29:41 469132 +0200 - perf-roub-01 - 117006/0 - arakoon - 14 - info - other: bKz9fJNRoh8ZGfe7
2016-06-30 15:29:41 469151 +0200 - perf-roub-01 - 117006/0 - arakoon - 15 - info - other: nhfE8ZP77taJwMBG
2016-06-30 15:29:41 469165 +0200 - perf-roub-01 - 117006/0 - arakoon - 16 - info - quorum_function gives 2 for 3
2016-06-30 15:29:41 469177 +0200 - perf-roub-01 - 117006/0 - arakoon - 17 - info - DAEMONIZATION=false
2016-06-30 15:29:41 469194 +0200 - perf-roub-01 - 117006/0 - arakoon - 18 - info - open_tlog_collection_and_store ~autofix:true
2016-06-30 15:29:41 469222 +0200 - perf-roub-01 - 117006/0 - arakoon - 19 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/touch-117006"
2016-06-30 15:29:41 469296 +0200 - perf-roub-01 - 117006/0 - arakoon - 20 - info - Unlink of "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/touch-117006" failed with ENOENT
2016-06-30 15:29:41 470170 +0200 - perf-roub-01 - 117006/0 - arakoon - 21 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/touch-117006"
2016-06-30 15:29:41 470278 +0200 - perf-roub-01 - 117006/0 - arakoon - 22 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 470333 +0200 - perf-roub-01 - 117006/0 - arakoon - 23 - info - Unlink of "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006" failed with ENOENT
2016-06-30 15:29:41 471065 +0200 - perf-roub-01 - 117006/0 - arakoon - 24 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 471170 +0200 - perf-roub-01 - 117006/0 - arakoon - 25 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 471209 +0200 - perf-roub-01 - 117006/0 - arakoon - 26 - info - Unlink of "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006" failed with ENOENT
2016-06-30 15:29:41 471796 +0200 - perf-roub-01 - 117006/0 - arakoon - 27 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 471873 +0200 - perf-roub-01 - 117006/0 - arakoon - 28 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 471904 +0200 - perf-roub-01 - 117006/0 - arakoon - 29 - info - Unlink of "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006" failed with ENOENT
2016-06-30 15:29:41 472517 +0200 - perf-roub-01 - 117006/0 - arakoon - 30 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-117006"
2016-06-30 15:29:41 472609 +0200 - perf-roub-01 - 117006/0 - arakoon - 31 - info - File at /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/head.db does not exist
2016-06-30 15:32:57 751364 +0200 - perf-roub-01 - 117006/0 - arakoon - 32 - fatal - AUTOFIX: can't autofix this: (Failure;   "\"/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/48713.tlx\" starts with 17571602, which ain't highe
r than 19229207"); backtrace:; Raised at file "map.ml", line 117, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2016-06-30 15:32:57 751399 +0200 - perf-roub-01 - 117006/0 - arakoon - 33 - fatal - going down: (Failure;   "\"/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/48713.tlx\" starts with 17571602, which ain't higher than 19229207")
; backtrace:; Raised at file "format.ml", line 185, characters 41-52; Called from file "format.m
l", line 427, characters 6-24
2016-06-30 15:32:57 751408 +0200 - perf-roub-01 - 117006/0 - arakoon - 34 - fatal - after pick
2016-06-30 15:32:57 753391 +0200 - perf-roub-01 - 117006/0 - arakoon - 35 - info - Crash log dumped
2016-06-30 15:32:57 895195 +0200 - perf-roub-01 - 127313/0 - arakoon - 0 - info - --- NODE STARTED ---
2016-06-30 15:32:57 895236 +0200 - perf-roub-01 - 127313/0 - arakoon - 1 - info - git_revision: tags/1.9.6-0-gea32319 
2016-06-30 15:32:57 895251 +0200 - perf-roub-01 - 127313/0 - arakoon - 2 - info - compile_time: 10/06/2016 13:21:09 UTC 
2016-06-30 15:32:57 895263 +0200 - perf-roub-01 - 127313/0 - arakoon - 3 - info - version: 1.9.6
2016-06-30 15:32:57 895277 +0200 - perf-roub-01 - 127313/0 - arakoon - 4 - info - NOFILE: 1024

Change severity level of ECONNRESET error

As discussed please change the severity for ECONNRESET to info or debug.

Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485672 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980596 - error - Exception during client request (Unix.Unix_error(Unix.ECONNRESET, "read", "")) => rc:ff msg:unknown failure

[2:43:16 PM] Jeroen Maelbrancke: Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 481349 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980591 - info - 172.17.16.5:client_service_2566117: closing
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485568 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980592 - info - Exception during client request (End_of_file) => rc:ff msg:end of file because the client disconnected
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485601 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980593 - info - 172.17.16.5:client_service_2566120: closing
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485625 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980594 - info - Exception during client request (End_of_file) => rc:ff msg:end of file because the client disconnected
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485641 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980595 - info - 172.17.16.5:client_service_2566121: closing
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485672 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980596 - error - Exception during client request (Unix.Unix_error(Unix.ECONNRESET, "read", "")) => rc:ff msg:unknown failure
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485696 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980597 - info - exiting session (941) connection=172.17.16.5:client_service_2566119: Unix.Unix_error(Unix.EPIPE, "write", ""); backtrace:; Raised by primitive operation at file "src/unix/lwt_bytes.ml", line 148, characters 43-86; Called from file "src/unix/lwt_unix.cppo.ml", line 555, characters 17-28
Jun 15 08:00:10 NY1SRV0002 arakoon[20716]: 2017-06-15 08:00:10 485703 -0400 - NY1SRV0002 - 20716/0000 - arakoon - 7980598 - info - 172.17.16.5:client_service_2566119: closing
[2:45:02 PM] Romain Slootmaekers: idd we loggen die op info terwijl dat onschuldig is.
[2:45:27 PM] Romain Slootmaekers: die EPIPE is wel iets anders
[2:45:42 PM] Jan Doms: EPIPE is even onschuldig?
[2:45:55 PM] Romain Slootmaekers: ECONNRESET is een client die ermee stopt tussen requests in
EPIPE is een client die niet op't antwoord wacht
[2:46:17 PM] Romain Slootmaekers: voor de arakoon is dat onschuldig, maar het is wel een veeg teken
[2:46:39 PM] Jan Doms: right idd
[2:46:53 PM] Jan Doms: en hoeveel zien jullie die EPIPE vs ECONNRESET?
[2:47:04 PM] Jeroen Maelbrancke: dus eigenlijk zou de EPIPE error moeten zijn en ECONNRESET info?
[2:47:04 PM] Romain Slootmaekers: wss een client die doodgaat met NoneType has no attribute ;)
[2:47:07 PM] Jeroen Maelbrancke: effen zien voor 1 arakoon
[2:47:28 PM] Romain Slootmaekers: ECONNRESET is info. 
EPIPE is ook info.
[2:47:41 PM] Romain Slootmaekers: er is geen filtering....
[2:48:40 PM] Jan Doms: @romain die ECONNRESET wordt momenteel blijkbaar wel op error gelogd he
[2:48:46 PM] Jeroen Maelbrancke: vlug effen gezien op 1 arakoon en de ECONNRESET komt 204 keer voor in 1uur en EPIPE 200 keer
[2:49:02 PM] Jeroen Maelbrancke: indd die wordt gelogd in error
[2:49:33 PM] Jan Doms: die ECONNRESET mogen we mss zelfs naar debug veranderen?
[2:49:44 PM] Romain Slootmaekers: ja voor mij wel.
[2:50:16 PM] Jeroen Maelbrancke: ik zal er een ticketje voor aanmaken

Move/run away from tokyo cabinet

See #110 and #122 for the motivation. Ops team sees this too much, and tokyo cabinet is unmaintained.
Currently it's an accident waiting to happen.
In the short term we still want to implement #122 so ops can monitor this more closely.

Possible replacement candidates:

  • rocksdb
    • could remove the need for head.db + collapsing could become cheap
    • db consists of multiple files, which would require some changes for catchup to handle this
  • lmdb
    • ocaml binding doesn't expose iterators / range queries
    • rest of the flow/code would remain the same

For both of these it should be possible to provide a way forward for existing clusters.
Either:

  • allow catchup of head.db in a database agnostic fashion (send key value pairs over the wire)
    • for each node: wipe it, let it catchup from the others
  • provide a method to export an old (head) db to the new database format
    • for each node: stop node, transform head db, remove main db, start node

Arakoon fails to start due to new tlog entry being lower in sequence 1405207 < 1408230

Input from Jan D.:
Some findings:

2016-06-20 13:28:18 472487 +0530 - Storagerouter3 - 13717/0 - arakoon - 32 - info - Not copying /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/head.db to /mnt/ssd1/arakoon/backend1-nsm_0/db/5yoL2lzwheSwwpkw.db because target already exists
2016-06-20 13:28:24 715039 +0530 - Storagerouter3 - 13717/0 - arakoon - 33 - fatal - AUTOFIX: can't autofix this: (Failure;   "\"/mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6020.tlx\" starts with 1405207, which ain't higher than 1408230"); backtrace:; Raised at file "map.ml", line 117, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2016-06-20 13:28:24 715081 +0530 - Storagerouter3 - 13717/0 - arakoon - 34 - fatal - going down: (Failure;   "\"/mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6020.tlx\" starts with 1405207, which ain't higher than 1408230"); backtrace:; Raised by primitive operation at file "src/zone.ml", line 690, characters 31-54
2016-06-20 13:28:24 715091 +0530 - Storagerouter3 - 13717/0 - arakoon - 35 - fatal - after pick
2016-06-20 13:28:24 715547 +0530 - Storagerouter3 - 13717/0 - arakoon - 36 - info - Crash log dumped
2016-06-20 13:28:24 795407 +0530 - Storagerouter3 - 13746/0 - arakoon - 0 - info - --- NODE STARTED ---
2016-06-20 13:28:24 795452 +0530 - Storagerouter3 - 13746/0 - arakoon - 1 - info - git_revision: tags/1.9.6-0-gea32319 
2016-06-20 13:28:24 795461 +0530 - Storagerouter3 - 13746/0 - arakoon - 2 - info - compile_time: 10/06/2016 13:21:09 UTC 
2016-06-20 13:28:24 795471 +0530 - Storagerouter3 - 13746/0 - arakoon - 3 - info - version: 1.9.6
2016-06-20 13:28:24 795481 +0530 - Storagerouter3 - 13746/0 - arakoon - 4 - info - NOFILE: 1024

Start i of .tlx files:

6000 => 1404602
6001 => 1404810
6010 => 1406431
6011 => 1406634
6012 => 1406827
6013 => 1407031
6014 => 1407226
6015 => 1407427
6016 => 1407629
6017 => 1407823
6018 => 1408027
6019 => 1408230
6020 => 1405207
6021 => 1405407
6022 => 1405609

disk was full, probably because collapsing failed due to the fuck up in the tlogs (6020.tlx has lower start i)

2016-06-17 17:58:10 625935 +0530 - Storagerouter3 - 25930/0 - arakoon - 84922 - info - save_tlog_file: first_i:1405207 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6004.tlx.tmp
2016-06-17 17:58:10 688960 +0530 - Storagerouter3 - 25930/0 - arakoon - 84929 - info - save_tlog_file: first_i:1405407 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6005.tlx.tmp
2016-06-17 17:58:10 772957 +0530 - Storagerouter3 - 25930/0 - arakoon - 84934 - info - save_tlog_file: first_i:1405609 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6006.tlx.tmp
2016-06-17 17:58:10 865643 +0530 - Storagerouter3 - 25930/0 - arakoon - 84941 - info - save_tlog_file: first_i:1405812 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6007.tlx.tmp
2016-06-17 17:58:10 947088 +0530 - Storagerouter3 - 25930/0 - arakoon - 84946 - info - save_tlog_file: first_i:1406034 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6008.tlx.tmp
2016-06-17 17:58:11 34121 +0530 - Storagerouter3 - 25930/0 - arakoon - 84955 - info - save_tlog_file: first_i:1406235 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6009.tlx.tmp
2016-06-17 17:58:11 145414 +0530 - Storagerouter3 - 25930/0 - arakoon - 84963 - info - save_tlog_file: first_i:1406431 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6010.tlx.tmp
2016-06-17 17:58:11 225089 +0530 - Storagerouter3 - 25930/0 - arakoon - 84970 - info - save_tlog_file: first_i:1406634 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6011.tlx.tmp
2016-06-17 17:58:11 334110 +0530 - Storagerouter3 - 25930/0 - arakoon - 84977 - info - save_tlog_file: first_i:1406827 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6012.tlx.tmp
2016-06-17 17:58:11 451056 +0530 - Storagerouter3 - 25930/0 - arakoon - 84984 - info - save_tlog_file: first_i:1407031 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6013.tlx.tmp
2016-06-17 17:58:11 537734 +0530 - Storagerouter3 - 25930/0 - arakoon - 84991 - info - save_tlog_file: first_i:1407226 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6014.tlx.tmp
2016-06-17 17:58:11 643316 +0530 - Storagerouter3 - 25930/0 - arakoon - 84996 - info - save_tlog_file: first_i:1407427 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6015.tlx.tmp
2016-06-17 17:58:11 720091 +0530 - Storagerouter3 - 25930/0 - arakoon - 85003 - info - save_tlog_file: first_i:1407629 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6016.tlx.tmp
2016-06-17 17:58:11 802635 +0530 - Storagerouter3 - 25930/0 - arakoon - 85010 - info - save_tlog_file: first_i:1407823 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6017.tlx.tmp
2016-06-17 17:58:11 908004 +0530 - Storagerouter3 - 25930/0 - arakoon - 85015 - info - save_tlog_file: first_i:1408027 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6018.tlx.tmp
2016-06-17 17:58:11 984177 +0530 - Storagerouter3 - 25930/0 - arakoon - 85022 - info - save_tlog_file: first_i:1408230 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6019.tlx.tmp
2016-06-17 17:58:16 260279 +0530 - Storagerouter3 - 25930/0 - arakoon - 85198 - info - save_tlog_file: first_i:1405207 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6020.tlx.tmp
2016-06-17 17:58:16 324018 +0530 - Storagerouter3 - 25930/0 - arakoon - 85205 - info - save_tlog_file: first_i:1405407 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6021.tlx.tmp
2016-06-17 17:58:16 436891 +0530 - Storagerouter3 - 25930/0 - arakoon - 85210 - info - save_tlog_file: first_i:1405609 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6022.tlx.tmp
2016-06-17 17:58:16 518537 +0530 - Storagerouter3 - 25930/0 - arakoon - 85217 - info - save_tlog_file: first_i:1405812 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6023.tlx.tmp
2016-06-17 17:58:16 636027 +0530 - Storagerouter3 - 25930/0 - arakoon - 85224 - info - save_tlog_file: first_i:1406034 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6024.tlx.tmp
2016-06-17 17:58:16 761284 +0530 - Storagerouter3 - 25930/0 - arakoon - 85231 - info - save_tlog_file: first_i:1406235 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6025.tlx.tmp
2016-06-17 17:58:16 846275 +0530 - Storagerouter3 - 25930/0 - arakoon - 85236 - info - save_tlog_file: first_i:1406431 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6026.tlx.tmp
2016-06-17 17:58:16 925307 +0530 - Storagerouter3 - 25930/0 - arakoon - 85243 - info - save_tlog_file: first_i:1406634 extension:.tlx in /mnt/ssd1/arakoon/backend1-nsm_0/tlogs/6027.tlx.tmp

So 6020.tlx was the result of a catchup and something must've gone wrong there.

Failure success head.db

Arakoon version: version: 1.9.8
git_revision: "tags/1.9.8-0-g2372b70"

GIG env. had following error message in the nsm arakoon logging:

2016-08-09 08:12:00 617304 +0200 - ovs-03.be-gen8-2 - 11646/0 - arakoon - 3336987 - info - returning assuming no I "/mnt/ssd1/arakoon/mybackend-nsm_0/tlogs/head.db": (Failure success); backtrace:; Raised by primitive operation at file "hotc.ml", line 63, characters 12-24; Called from file "src/node/local_store.ml", line 49, characters 2-62; Called from file "src/node/local_store.ml", line 277, characters 2-41; Called from file "src/node/batched_store.ml", line 57, characters 4-34; Called from file "src/node/store.ml", line 159, characters 4-48; Called from file "src/node/collapser.ml", line 183, characters 7-131; Called from file "src/core/lwt.ml", line 686, characters 20-24

copy-db-to-head not working as expected

I'm creating this ticket for tracking purpose as Romain should already be aware of this. We noticed that the --copy-db-to-head functionality of arakoon isn't working. We can work around this manually, but a working feature would be nice to have as regular collapsing is taking too long on OVH and we're trying to keep up manually to not have the disk run full.

As far as I remember, the running arakoon process doesn't seem to start the copy process. Romain should have more details.

arakoon checksum error

Hi all.
I installed openvstorage in 2 dc, all server shutdown.
I start server, i cannot start ovs-arakoon-voldrv server.
This is log i run ovs-arakoon-voldrv manually, i run on 3 master node.

2017-08-14 14:33:47 351025 +0700 - ovs-r1-controller - 22729/0000 - arakoon - 29 - info - Not copying /mnt/hdd2/arakoon/voldrv/tlogs/head.db to /mnt/hdd2/arakoon/voldrv/db/MgilN7oBbksHsEkE.db because target already exists
2017-08-14 14:33:47 351049 +0700 - ovs-r1-controller - 22729/0000 - arakoon - 30 - info - init ~check_marker:true ~check_sabotage:true
2017-08-14 14:33:47 351373 +0700 - ovs-r1-controller - 22729/0000 - arakoon - 31 - info - tlog_number:37
2017-08-14 14:33:47 356056 +0700 - ovs-r1-controller - 22729/0000 - arakoon - 32 - fatal - AUTOFIX: can't autofix this: Tlogcommon.TLogCheckSumError(
); backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2017-08-14 14:33:47 356105 +0700 - ovs-r1-controller - 22729/0000 - arakoon - 33 - fatal - [rc=49] Tlog has a checksum error, last valid pos = 130497

a53.db (overwrite=false,throttling=0.000000) buffer_size:1048576
2017-08-14 14:27:44 629300 +0700 - ovs-r2-controller - 23779/0000 - arakoon - 29 - info - Not copying /mnt/hdd1/arakoon/voldrv/tlogs/head.db to /mnt/hdd1/arakoon/voldrv/db/0vsJSdtnXhwwaa53.db because target already exists
2017-08-14 14:27:44 629338 +0700 - ovs-r2-controller - 23779/0000 - arakoon - 30 - info - init ~check_marker:true ~check_sabotage:true
2017-08-14 14:27:44 629727 +0700 - ovs-r2-controller - 23779/0000 - arakoon - 31 - info - tlog_number:37
2017-08-14 14:27:44 634753 +0700 - ovs-r2-controller - 23779/0000 - arakoon - 32 - fatal - AUTOFIX: can't autofix this: Tlogcommon.TLogCheckSumError(
); backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2017-08-14 14:27:44 634813 +0700 - ovs-r2-controller - 23779/0000 - arakoon - 33 - fatal - [rc=49] Tlog has a checksum error, last valid pos = 130906
root@ovs-r2-controller:/opt/OpenvStorage# /usr/bin/arakoon --node 0vsJSdtnXhwwaa53 -config arakoon://config/ovs/arakoon/voldrv/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini -autofix -start

2017-08-14 14:29:05 623044 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 28 - info - copy_file /mnt/hdd1/arakoon/voldrv/tlogs/head.db /mnt/hdd1/arakoon/voldrv/db/oKRzbz2bGb3OaGQr.db (overwrite=false,throttling=0.000000) buffer_size:1048576
2017-08-14 14:29:05 623097 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 29 - info - Not copying /mnt/hdd1/arakoon/voldrv/tlogs/head.db to /mnt/hdd1/arakoon/voldrv/db/oKRzbz2bGb3OaGQr.db because target already exists
2017-08-14 14:29:05 623128 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 30 - info - init ~check_marker:true ~check_sabotage:true
2017-08-14 14:29:05 623512 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 31 - info - tlog_number:37
2017-08-14 14:29:05 628377 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 32 - fatal - AUTOFIX: can't autofix this: Tlogcommon.TLogCheckSumError(
); backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2017-08-14 14:29:05 628414 +0700 - ovs-r1-performance-1 - 5698/0000 - arakoon - 33 - fatal - [rc=49] Tlog has a checksum error, last valid pos = 133881

How can i fix it. Thanks

arakoon crashed with exit status 40/n/a

On OVH i tried to restore an Arakoon from another node.
I (re)moved all the tlogs, head.db and main.db and started the Arakoon again.
What i saw was the head.db was copied from another Arakoon but not the tlx files which leads to following error:

2017-04-10 17:44:40 938215 +0200 - ovs01 - 49898/0 - arakoon - 316 - info - rename /mnt/hdd2/arakoon/voldrv/db/HdC8QGW6PDxBNgIZ.db.tmp -> /mnt/hdd2/arakoon/voldrv/db/HdC8QGW6PDxBNgIZ.db
2017-04-10 17:44:41 657588 +0200 - ovs01 - 49898/0 - arakoon - 317 - info - local_store::reopen Hotc::reopen succeeded
2017-04-10 17:44:41 657859 +0200 - ovs01 - 49898/0 - arakoon - 318 - warning - catchup_tlog failed: End_of_file; backtrace:; Raised at file "map.ml", line 122, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
ovs-arakoon-voldrv.service: Main process exited, code=exited, status=40/n/a

My solution was manually copy the head.db + tlx to this server.

Arakoon dies during collapse, refuses to start afterwards

2016-08-24 13:56:01 534167 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713656 - info - _init: tlog_entries:0 should_roll:true
2016-08-24 13:56:01 534190 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713657 - info - CATCHUP phase 1 done (until_i = 0); now the store
2016-08-24 13:56:01 534202 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713658 - info - replaying log to store
2016-08-24 13:56:01 534213 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713659 - error - Store counter (1857683) is ahead of tlog counter (1). Aborting.
2016-08-24 13:56:01 534245 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713660 - fatal - Exception in fsm thread: Catchup.StoreAheadOfTlogs(_); backtrace:; Raised at file "bytes.ml", line 219, characters 25-34; Called from file "string.ml"
, line 106, characters 2-26
2016-08-24 13:56:01 534287 +0200 - perf-roub-02 - 14556/0 - arakoon - 2713661 - fatal - [rc=40] Store ahead of tlogs: s_i=1857683, tlog_i=1
2016-08-24 13:56:01 793144 +0200 - perf-roub-02 - 50238/0 - arakoon - 33 - info - _init ~check_marker:true
2016-08-24 13:56:01 796627 +0200 - perf-roub-02 - 50238/0 - arakoon - 34 - info - tlog_number:555
2016-08-24 13:56:02 382143 +0200 - perf-roub-02 - 50238/0 - arakoon - 35 - info - _init: tlog_entries:0 should_roll:true
2016-08-24 13:56:02 382174 +0200 - perf-roub-02 - 50238/0 - arakoon - 36 - info - Taking job from compression queue...
2016-08-24 13:56:02 382184 +0200 - perf-roub-02 - 50238/0 - arakoon - 37 - info - Creating local store at /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_2/db/ZVe2krRqMuP6ZQ1x.db (lcnum=16384) (ncnum=8192)
2016-08-24 13:56:02 382481 +0200 - perf-roub-02 - 50238/0 - arakoon - 38 - info - verify_n_catchup_store; too_far_i=0 current_i=0 si_o:Some ("1857681")
2016-08-24 13:56:02 382492 +0200 - perf-roub-02 - 50238/0 - arakoon - 39 - fatal - too_far_i:0, si_o:Some ("1857681") should not happen: tlogs have been removed?
2016-08-24 13:56:02 382499 +0200 - perf-roub-02 - 50238/0 - arakoon - 40 - fatal - [rc=40] Store ahead of tlogs: s_i=1857681, tlog_i=0

so it looks like #101 is not yet properly solved...

version 1.9.9

Arakoon not starting, store ahead of tlogs...

env: OVH
node: perf-roub-03
path: /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/voor_ons_jantjeuh

2016-07-19 11:24:42 813318 +0200 - perf-roub-03 - 8782/0 - arakoon - 30 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/touch-8782"
2016-07-19 11:24:42 813435 +0200 - perf-roub-03 - 8782/0 - arakoon - 31 - info - copy_file /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/head.db /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/nhfE8ZP77taJwMBG.db (overwrite=false,throttling=0.000000)
2016-07-19 11:24:42 813493 +0200 - perf-roub-03 - 8782/0 - arakoon - 32 - info - Not copying /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/head.db to /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/nhfE8ZP77taJwMBG.db because target already exists
2016-07-19 11:24:48 756202 +0200 - perf-roub-03 - 8782/0 - arakoon - 33 - info - Taking job from compression queue...
2016-07-19 11:24:48 756272 +0200 - perf-roub-03 - 8782/0 - arakoon - 34 - info - Creating local store at /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db/nhfE8ZP77taJwMBG.db (lcnum=16384) (ncnum=8192)
2016-07-19 11:24:48 756513 +0200 - perf-roub-03 - 8782/0 - arakoon - 35 - info - verify_n_catchup_store; too_far_i=0 current_i=0 si_o:Some ("51186823")
2016-07-19 11:24:48 756525 +0200 - perf-roub-03 - 8782/0 - arakoon - 36 - fatal - too_far_i:0, si_o:Some ("51186823") should not happen: tlogs have been removed?
2016-07-19 11:24:48 756532 +0200 - perf-roub-03 - 8782/0 - arakoon - 37 - fatal - [rc=40] Store ahead of tlogs: s_i=51186823, tlog_i=0

Arakoon cluster repeatedly loses master with preferred master.

We have a global nsm cluster over 3 datacenters. We've added a preferred master in datacenter 1. But it seems like it is constantly losing its master and another server is elected as master.

2016-07-04 14:26:33 518354 +0200 - perf-roub-01 - 86211/0 - arakoon - 319 - info - cV858Fahh7iQ4iMJ is master
2016-07-04 14:27:46 811932 +0200 - perf-roub-01 - 86211/0 - arakoon - 8776 - info - cV858Fahh7iQ4iMJ is master
2016-07-04 14:29:07 569843 +0200 - perf-roub-01 - 86211/0 - arakoon - 16920 - info - cV858Fahh7iQ4iMJ is master
2016-07-04 14:30:27 124541 +0200 - perf-roub-01 - 86211/0 - arakoon - 22065 - info - cV858Fahh7iQ4iMJ is master

While it should be 09FydVMSuGL4Fp3k

local collapse not working, plugin not loaded?

version 1.9.11

config file:

root@perf-stras-06:~/arakoon_local_collapse_test# cat test.cfg 
[global]
cluster = kftA4tdCypRvFJMn
cluster_id = hdd_stras_nsm_0
tlog_max_entries = 5000
plugins = nsm_host_plugin

[kftA4tdCypRvFJMn]
crash_log_sinks = console:
ip = 172.20.20.231
tlog_compression = snappy
client_port = 26402
home = /root/arakoon_local_collapse_test/
log_level = info
name = kftA4tdCypRvFJMn
fsync = true
log_sinks = console:
messaging_port = 26403
tlog_dir = /root/arakoon_local_collapse_test/tlogs

error message:

arakoon: tlog_map: tlogs_to_collapse 13329482 14950219 2 => Some (326,14942059)
arakoon: tlog_map: infimum=13329482
arakoon: tlog_map: fold_read extension=.tlx => index':None
arakoon: tlog_map: filename:/root/arakoon_local_collapse_test/tlogs/3428.tlx: Arakoon_exc.Exception(14, "Registry.UserFunctionNotFound(\"arith64\")")
arakoon: tlog_map: backtrace:
arakoon: tlog_map: Raised at file "map.ml", line 122, characters 16-25
arakoon: tlog_map: Called from file "src/core/lwt.ml", line 161, characters 4-40
arakoon: main: collapse_local failed: Arakoon_exc.Exception(14, "Registry.UserFunctionNotFound(\"arith64\")")
arakoon: main: backtrace:
arakoon: main: Raised at file "map.ml", line 122, characters 16-25
arakoon: main: Called from file "src/core/lwt.ml", line 161, characters 4-40

Error while collapsing - corrupt head.db

arakoon version: 1.9.8
git_revision: "tags/1.9.8-0-g2372b70"

root@ovs-01:/mnt/ssd1/arakoon/mybackend-abm/tlogs# arakoon --collapse-remote mybackend-abm 10.103.2.21 26402 2

arakoon: main: remote_collapsing_failed: (Failure "unknown failure (EC: 255)")
arakoon: main: backtrace:
arakoon: main: Raised at file "map.ml", line 117, characters 16-25
arakoon: main: Called from file "src/core/lwt.ml", line 161, characters 4-40
root@ovs-01:/mnt/ssd1/arakoon/mybackend-abm/tlogs# less /var/log/upstart/ovs-arakoon-mybackend-abm.log
2016-08-09 17:07:23 43513 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568675 - info - connection=10.103.2.21:client_service_483879 COLLAPSE_TLOGS: n=2
2016-08-09 17:07:23 43533 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568676 - info - ... Start collapsing ... (n=2)
2016-08-09 17:07:23 43549 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568677 - info - Starting collapse
2016-08-09 17:07:23 43571 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568678 - info - Creating local store at /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db (lcnum=16384) (ncnum=8192)
2016-08-09 17:07:23 43701 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568679 - info - returning assuming no I "/mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db": (Failure success); backtrace:; Raised by primitive operation at file "hotc.ml", line 63, characters 12-24; Called from file "src/node/local_store.ml", line 49, characters 2-62; Called from file "src/node/local_store.ml", line 277, characters 2-41; Called from file "src/node/batched_store.ml", line 57, characters 4-34; Called from file "src/node/store.ml", line 159, characters 4-48; Called from file "src/node/collapser.ml", line 183, characters 7-131; Called from file "src/core/lwt.ml", line 686, characters 20-24
2016-08-09 17:07:23 43790 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568680 - info - Going to collapse 485 tlogs
2016-08-09 17:07:23 43849 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568681 - info - Creating db clone at /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone (throttling:0.000000)
2016-08-09 17:07:23 43948 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568682 - info - copy_file /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone (overwrite=true,throttling=0.000000)
2016-08-09 17:07:23 44029 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568683 - info - Unlinking "/mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone.tmp"
2016-08-09 17:07:23 44104 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568684 - info - Unlink of "/mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone.tmp" failed with ENOENT
2016-08-09 17:07:23 75650 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568685 - info - done: copy_file
2016-08-09 17:07:23 124780 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568686 - info - rename /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone.tmp -> /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone
2016-08-09 17:07:23 138782 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568687 - info - Creating local store at /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db.clone (lcnum=16384) (ncnum=8192)
2016-08-09 17:07:23 138916 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568688 - error - Exception during client request ((Failure success)) => rc:ff msg:unknown failure
2016-08-09 17:07:23 138995 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568689 - info - 10.103.2.21:client_service_483879: closing
2016-08-09 17:07:23 313978 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568690 - info - Exception during client request (End_of_file) => rc:ff msg:end of file because the client disconnected
2016-08-09 17:07:23 314034 +0200 - ovs-01.be-gen8-2 - 3691/0 - arakoon - 2568691 - info - 10.103.2.21:client_service_483878: closing

Arakoon went down, Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/global_nsm_2/tlogs/243.tlx")

We tried to remote collapse the arakoon node and found out that the collapse exited with an error.

arakoon: main: backtrace:
arakoon: main: Raised at file "map.ml", line 122, characters 16-25
arakoon: main: Called from file "src/core/lwt.ml", line 161, characters 4-40
arakoon: main: remote_collapsing_failed: End_of_file

arakoon version: 1.9.9

Error from the arakoon log file:

2016-08-30 17:13:49 398254 +0200 - perf-roub-01 - 3318/0 - arakoon - 78744 - warning - Exception during first_i_of /mnt/hdd1/ovh/arakoon/global_nsm_2/tlogs/243.tlx: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/global_nsm_2/
tlogs/243.tlx"); backtrace:; Raised by primitive operation at file "src/unix/lwt_unix.ml", line 241, characters 19-36
2016-08-30 17:13:49 398306 +0200 - perf-roub-01 - 3318/0 - arakoon - 78745 - fatal - Exception in fsm thread: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/global_nsm_2/tlogs/243.tlx"); backtrace:; Raised by primitive operat
ion at file "src/unix/lwt_unix.ml", line 241, characters 19-36
2016-08-30 17:13:49 398382 +0200 - perf-roub-01 - 3318/0 - arakoon - 78746 - fatal - going down: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/global_nsm_2/tlogs/243.tlx"); backtrace:; Raised at file "map.ml", line 122, char
acters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2016-08-30 17:13:49 398398 +0200 - perf-roub-01 - 3318/0 - arakoon - 78747 - fatal - after pick

Node was unable to recover or boot up due to store ahead of tlogs:

2016-08-31 06:51:22 521276 +0200 - perf-roub-01 - 69945/0 - arakoon - 36 - info - Taking job from compression queue...
2016-08-31 06:51:22 521289 +0200 - perf-roub-01 - 69945/0 - arakoon - 37 - info - Creating local store at /mnt/hdd1/ovh/arakoon/global_nsm_2/db/Tqj1zMtIxTlijc5c.db (lcnum=16384) (ncnum=8192)
2016-08-31 06:51:22 521551 +0200 - perf-roub-01 - 69945/0 - arakoon - 38 - info - verify_n_catchup_store; too_far_i=0 current_i=0 si_o:Some ("4603368")
2016-08-31 06:51:22 521564 +0200 - perf-roub-01 - 69945/0 - arakoon - 39 - fatal - too_far_i:0, si_o:Some ("4603368") should not happen: tlogs have been removed?
2016-08-31 06:51:22 521573 +0200 - perf-roub-01 - 69945/0 - arakoon - 40 - fatal - [rc=40] Store ahead of tlogs: s_i=4603368, tlog_i=0

Complete log file:
ovs-arakoon-global_nsm_2.log.1.gz

Adding a Peer-to-Peer (P2P) support

Connectivity between nodes could also support a mode of communication based on P2P to provide automatic NAT traversal. This will be an alternative to current layer 3 connectivity requirement of Arakoon and it will open a completely new window of opportunities for Arakoon cluster.

Node dies with "store counter too low" during catchup

2016-09-01 11:18:57 571875 +0200 - perf-roub-03 - 34126/0 - arakoon - 9903 - info - remote_nodestream :: last_seen = Some ("8411362")
2016-09-01 11:18:57 571905 +0200 - perf-roub-03 - 34126/0 - arakoon - 9904 - info - loop_parts done
2016-09-01 11:18:57 571919 +0200 - perf-roub-03 - 34126/0 - arakoon - 9905 - info - catchup_tlog completed
2016-09-01 11:18:57 571929 +0200 - perf-roub-03 - 34126/0 - arakoon - 9906 - info - CATCHUP phase 1 done (until_i = 8411362); now the store
2016-09-01 11:18:57 571933 +0200 - perf-roub-03 - 34126/0 - arakoon - 9907 - info - replaying log to store
2016-09-01 11:25:34 297698 +0200 - perf-roub-03 - 34126/0 - arakoon - 9950 - info - catchup_store completed, store is @ Some ("8408409")
2016-09-01 11:25:34 297759 +0200 - perf-roub-03 - 34126/0 - arakoon - 9951 - fatal - Exception in fsm thread: Catchup.StoreCounterTooLow("Catchup store failed. Store counter is too low: 8408409 < 8411362"); backtrace:; Raised at file "bytes.ml", line 219, characters 25-34; Called from file "string.ml", line 106, characters 2-26
2016-09-01 11:25:34 297794 +0200 - perf-roub-03 - 34126/0 - arakoon - 9952 - fatal - [rc=41] Store counter too low: Catchup store failed. Store counter is too low: 8408409 < 8411362
2016-09-01 11:25:34 297892 +0200 - perf-roub-03 - 34126/0 - arakoon - 9953 - info - waiting for 1 client_threads

at the same time it was also doing a collapse:

2016-09-01 09:16:22 804865 +0200 - perf-roub-03 - 34126/0 - arakoon - 8562 - info - ... Start collapsing ... (n=2)
2016-09-01 09:16:22 804879 +0200 - perf-roub-03 - 34126/0 - arakoon - 8563 - info - Starting collapse
2016-09-01 09:16:22 805625 +0200 - perf-roub-03 - 34126/0 - arakoon - 8566 - info - Going to collapse 371 tlogs
...
2016-09-01 11:13:03 478336 +0200 - perf-roub-03 - 34126/0 - arakoon - 9852 - info - Completed replay of 3886.tlx, took 33.101315 seconds, 84 to go
2016-09-01 11:13:41 398694 +0200 - perf-roub-03 - 34126/0 - arakoon - 9856 - info - Completed replay of 3887.tlx, took 37.920212 seconds, 83 to go
2016-09-01 11:14:18 528197 +0200 - perf-roub-03 - 34126/0 - arakoon - 9859 - info - Completed replay of 3888.tlx, took 37.129387 seconds, 82 to go
2016-09-01 11:14:51 625402 +0200 - perf-roub-03 - 34126/0 - arakoon - 9863 - info - Completed replay of 3889.tlx, took 33.097096 seconds, 81 to go
2016-09-01 11:15:25 910810 +0200 - perf-roub-03 - 34126/0 - arakoon - 9865 - info - Completed replay of 3890.tlx, took 34.285292 seconds, 80 to go
2016-09-01 11:15:57 624861 +0200 - perf-roub-03 - 34126/0 - arakoon - 9875 - info - Completed replay of 3891.tlx, took 31.713918 seconds, 79 to go
2016-09-01 11:16:31 939269 +0200 - perf-roub-03 - 34126/0 - arakoon - 9879 - info - Completed replay of 3892.tlx, took 34.314290 seconds, 78 to go
2016-09-01 11:17:07 843194 +0200 - perf-roub-03 - 34126/0 - arakoon - 9881 - info - Completed replay of 3893.tlx, took 35.903797 seconds, 77 to go
2016-09-01 11:17:40 516819 +0200 - perf-roub-03 - 34126/0 - arakoon - 9885 - info - Completed replay of 3894.tlx, took 32.673508 seconds, 76 to go
2016-09-01 11:18:16 689074 +0200 - perf-roub-03 - 34126/0 - arakoon - 9897 - info - Completed replay of 3895.tlx, took 36.172111 seconds, 75 to go
2016-09-01 11:18:52 52456 +0200 - perf-roub-03 - 34126/0 - arakoon - 9901 - info - Completed replay of 3896.tlx, took 35.363241 seconds, 74 to go
2016-09-01 11:18:57 571933 +0200 - perf-roub-03 - 34126/0 - arakoon - 9907 - info - replaying log to store
2016-09-01 11:19:25 773770 +0200 - perf-roub-03 - 34126/0 - arakoon - 9909 - info - Completed replay of 3897.tlx, took 33.721211 seconds, 73 to go
2016-09-01 11:20:01 682825 +0200 - perf-roub-03 - 34126/0 - arakoon - 9919 - info - Completed replay of 3898.tlx, took 35.908912 seconds, 72 to go
2016-09-01 11:20:36 179315 +0200 - perf-roub-03 - 34126/0 - arakoon - 9923 - info - Completed replay of 3899.tlx, took 34.496399 seconds, 71 to go
2016-09-01 11:21:10 893929 +0200 - perf-roub-03 - 34126/0 - arakoon - 9925 - info - Completed replay of 3900.tlx, took 34.714481 seconds, 70 to go
2016-09-01 11:21:43 904766 +0200 - perf-roub-03 - 34126/0 - arakoon - 9929 - info - Completed replay of 3901.tlx, took 33.010727 seconds, 69 to go
2016-09-01 11:22:18 955609 +0200 - perf-roub-03 - 34126/0 - arakoon - 9931 - info - Completed replay of 3902.tlx, took 35.050726 seconds, 68 to go
2016-09-01 11:22:53 467596 +0200 - perf-roub-03 - 34126/0 - arakoon - 9935 - info - Completed replay of 3903.tlx, took 34.511855 seconds, 67 to go
2016-09-01 11:23:26 803493 +0200 - perf-roub-03 - 34126/0 - arakoon - 9937 - info - Completed replay of 3904.tlx, took 33.335801 seconds, 66 to go
2016-09-01 11:24:00 434317 +0200 - perf-roub-03 - 34126/0 - arakoon - 9941 - info - Completed replay of 3905.tlx, took 33.630692 seconds, 65 to go
2016-09-01 11:24:35 546330 +0200 - perf-roub-03 - 34126/0 - arakoon - 9944 - info - Completed replay of 3906.tlx, took 35.111882 seconds, 64 to go
2016-09-01 11:25:09 159430 +0200 - perf-roub-03 - 34126/0 - arakoon - 9948 - info - Completed replay of 3907.tlx, took 33.612957 seconds, 63 to go

After upstart restarted the process it was running fine again.

Exception during collapse: "Head db has invalid counter: 105838"

During an active arakoon collapse we experienced this on our 10ms latency multidatacenter environment:

2016/05/20 10:13:12 684721 +0200 - ctl03 - 19683/0 - arakoon - 33779 - debug - tlogs_to_collapse 102425 952109 2 => Some (1138,949193)
2016/05/20 10:13:12 684734 +0200 - ctl03 - 19683/0 - arakoon - 33780 - info - Going to collapse 1138 tlogs
2016/05/20 10:13:12 685096 +0200 - ctl03 - 19683/0 - arakoon - 33781 - info - Creating db clone at /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone (throttling:0.000000)
2016/05/20 10:13:12 685214 +0200 - ctl03 - 19683/0 - arakoon - 33782 - info - copy_file /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/
head.db.clone (overwrite=true,throttling=0.000000)
2016/05/20 10:13:12 685322 +0200 - ctl03 - 19683/0 - arakoon - 33783 - info - Unlinking "/mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone.tmp"
2016/05/20 10:13:12 685447 +0200 - ctl03 - 19683/0 - arakoon - 33784 - info - Unlink of "/mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone.tmp" failed with ENOENT
2016/05/20 10:13:13 375508 +0200 - ctl03 - 19683/0 - arakoon - 33785 - info - done: copy_file
2016/05/20 10:13:13 616942 +0200 - ctl03 - 19683/0 - arakoon - 33786 - info - rename /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone.tmp -> /mnt/ssd1/arakoon/mybackend-a-ns
m_0/tlogs/head.db.clone
2016/05/20 10:13:13 617955 +0200 - ctl03 - 19683/0 - arakoon - 33787 - info - Creating local store at /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone (lcnum=16384) (ncnum=8
192)
2016/05/20 10:13:13 656021 +0200 - ctl03 - 19683/0 - arakoon - 33788 - info - infimum=102425
2016/05/20 10:13:13 688919 +0200 - ctl03 - 19683/0 - arakoon - 33789 - info - Replaying tlog file: 133.tlx (1/1144)
2016/05/20 10:13:13 689507 +0200 - ctl03 - 19683/0 - arakoon - 33790 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:15 108387 +0200 - ctl03 - 19683/0 - arakoon - 33791 - info - Completed replay of 133.tlx, took 1.418846 seconds, 1143 to go
2016/05/20 10:13:15 108834 +0200 - ctl03 - 19683/0 - arakoon - 33792 - info - Replaying tlog file: 134.tlx (2/1144)
2016/05/20 10:13:15 109065 +0200 - ctl03 - 19683/0 - arakoon - 33793 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:16 706671 +0200 - ctl03 - 19683/0 - arakoon - 33794 - info - Completed replay of 134.tlx, took 1.597610 seconds, 1142 to go
2016/05/20 10:13:16 706949 +0200 - ctl03 - 19683/0 - arakoon - 33795 - info - Replaying tlog file: 135.tlx (3/1144)
2016/05/20 10:13:16 707222 +0200 - ctl03 - 19683/0 - arakoon - 33796 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:18 513171 +0200 - ctl03 - 19683/0 - arakoon - 33797 - info - Completed replay of 135.tlx, took 1.805939 seconds, 1141 to go
2016/05/20 10:13:18 513744 +0200 - ctl03 - 19683/0 - arakoon - 33798 - info - Replaying tlog file: 136.tlx (4/1144)
2016/05/20 10:13:18 513909 +0200 - ctl03 - 19683/0 - arakoon - 33799 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:20 267673 +0200 - ctl03 - 19683/0 - arakoon - 33800 - info - Completed replay of 136.tlx, took 1.753726 seconds, 1140 to go
2016/05/20 10:13:20 268058 +0200 - ctl03 - 19683/0 - arakoon - 33801 - info - Replaying tlog file: 137.tlx (5/1144)
2016/05/20 10:13:20 268245 +0200 - ctl03 - 19683/0 - arakoon - 33802 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:21 877769 +0200 - ctl03 - 19683/0 - arakoon - 33803 - info - stats: {start: 1463731701.821941, last: 1463731701.821943,; avg_set_size: 0.000000,; avg_get_size: 0.0
00000,; avg_range_size: 0.000000,; avg_range_entries_size: 0.000000,; avg_rev_range_entries_size: 0.000000,; avg_prefix_size: 0.000000,; avg_del_prefix_size: 0.000000,; harvest_sta
ts: (n:279 min: 1., max: 9., avg: 1.3476702509, dev: 1.47311150372),; set_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; get_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a
),; del_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; mget_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; mget_option_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a
); seq_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; tas_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; replace_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; ra
nge_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; range_entries_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; rev_range_entries_info: (n:0 min: n/a, max: n/a, avg: n/
a, dev: n/a),; prefix_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; delete_prefix_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; ops_info: (n:0 min: n/a, max: n/a, avg: n/a, dev: n/a),; mem_allocated: 275121871.906250 KB,; mem_maxrss: 439628 KB,; mem_minor_collections: 101344,; mem_major_collections: 28514,; mem_compactions: 5441,; node_is: (SywfRodETohwFTW4,952116)(oJ9guhHLBBJWzDnT,657081)(9995UaXuBIjq721q,952118)}
2016/05/20 10:13:22 155802 +0200 - ctl03 - 19683/0 - arakoon - 33804 - info - Completed replay of 137.tlx, took 1.887547 seconds, 1139 to go
2016/05/20 10:13:22 156157 +0200 - ctl03 - 19683/0 - arakoon - 33805 - info - Replaying tlog file: 138.tlog (6/1144)
2016/05/20 10:13:22 156253 +0200 - ctl03 - 19683/0 - arakoon - 33806 - debug - fold_read extension=.tlog => index':Some {filename="/mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/1276.tlog";mapping=(952000,3573298)::(951000,3137088)}
2016/05/20 10:13:22 170317 +0200 - ctl03 - 19683/0 - arakoon - 33807 - info - Completed replay of 138.tlog, took 0.014062 seconds, 1138 to go
2016/05/20 10:13:22 170599 +0200 - ctl03 - 19683/0 - arakoon - 33808 - info - Replaying tlog file: 139.tlx (7/1144)
2016/05/20 10:13:22 170681 +0200 - ctl03 - 19683/0 - arakoon - 33809 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:23 924851 +0200 - ctl03 - 19683/0 - arakoon - 33810 - info - Completed replay of 139.tlx, took 1.754160 seconds, 1137 to go
2016/05/20 10:13:23 925149 +0200 - ctl03 - 19683/0 - arakoon - 33811 - info - Replaying tlog file: 140.tlx (8/1144)
2016/05/20 10:13:23 925234 +0200 - ctl03 - 19683/0 - arakoon - 33812 - debug - fold_read extension=.tlx => index':None
2016/05/20 10:13:25 610909 +0200 - ctl03 - 19683/0 - arakoon - 33813 - info - Completed replay of 140.tlx, took 1.685666 seconds, 1136 to go
2016/05/20 10:13:26 199980 +0200 - ctl03 - 19683/0 - arakoon - 33814 - info - local_store "/mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone" :: closed  () 
2016/05/20 10:13:26 200977 +0200 - ctl03 - 19683/0 - arakoon - 33815 - info - unlinking /mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone
2016/05/20 10:13:26 201150 +0200 - ctl03 - 19683/0 - arakoon - 33816 - info - Unlinking "/mnt/ssd1/arakoon/mybackend-a-nsm_0/tlogs/head.db.clone"
2016/05/20 10:13:26 248670 +0200 - ctl03 - 19683/0 - arakoon - 33817 - error - Exception during client request ((Failure "Head db has invalid counter: 105838")) => rc:ff msg:unknown failure
2016/05/20 10:13:26 249054 +0200 - ctl03 - 19683/0 - arakoon - 33818 - info - 172.19.10.32:client_service_632: closing

Also seen this on the first controller node:

2016/05/19 20:10:27 671283 +0200 - ctl01 - 26562/0 - arakoon - 30828 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/19 20:20:23 755491 +0200 - ctl01 - 26562/0 - arakoon - 31428 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/19 20:21:15 528745 +0200 - ctl01 - 26562/0 - arakoon - 31466 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/19 22:10:29 303992 +0200 - ctl01 - 26562/0 - arakoon - 33183 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/19 22:14:39 466363 +0200 - ctl01 - 26562/0 - arakoon - 33369 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/19 22:15:30 401192 +0200 - ctl01 - 26562/0 - arakoon - 33407 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 00:10:27 704010 +0200 - ctl01 - 26562/0 - arakoon - 34638 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 00:14:15 176851 +0200 - ctl01 - 26562/0 - arakoon - 34726 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 00:15:01 923924 +0200 - ctl01 - 26562/0 - arakoon - 34776 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 02:10:31 657418 +0200 - ctl01 - 26562/0 - arakoon - 36203 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 02:15:52 321904 +0200 - ctl01 - 26562/0 - arakoon - 36443 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 02:16:43 176822 +0200 - ctl01 - 26562/0 - arakoon - 36503 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 04:10:27 805505 +0200 - ctl01 - 26562/0 - arakoon - 37490 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 04:14:00 714565 +0200 - ctl01 - 26562/0 - arakoon - 37548 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 04:14:46 783472 +0200 - ctl01 - 26562/0 - arakoon - 37597 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 06:10:27 307234 +0200 - ctl01 - 26562/0 - arakoon - 38182 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 06:12:39 258997 +0200 - ctl01 - 26562/0 - arakoon - 38230 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 06:13:23 762371 +0200 - ctl01 - 26562/0 - arakoon - 38278 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 08:10:28 711716 +0200 - ctl01 - 26562/0 - arakoon - 38920 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 08:12:50 406347 +0200 - ctl01 - 26562/0 - arakoon - 38979 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 08:13:41 517715 +0200 - ctl01 - 26562/0 - arakoon - 39018 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 10:10:27 854089 +0200 - ctl01 - 26562/0 - arakoon - 39636 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 10:12:51 886109 +0200 - ctl01 - 26562/0 - arakoon - 39685 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure
2016/05/20 10:13:39 547445 +0200 - ctl01 - 26562/0 - arakoon - 39723 - error - Exception during client request ((Failure "Head db has invalid counter: 657082")) => rc:ff msg:unknown failure

But not on controller 2!

Although it succeeded the day before.

Arakoon version:

arakoon:
  Installed: 1.9.3
  Candidate: 1.9.3
  Version table:
 *** 1.9.3 0
        500 http://apt.openvstorage.org/ unstable/main amd64 Packages
        100 /var/lib/dpkg/status

Open vStorage version:

ii  openvstorage                          2.7.0-rev.3451.9fa0267-1                                 amd64        openvStorage
ii  openvstorage-backend                  1.7.0-rev.588.2e534bc-1                                  amd64        openvStorage Backend plugin
ii  openvstorage-backend-core             1.7.0-rev.588.2e534bc-1                                  amd64        openvStorage Backend plugin core
ii  openvstorage-backend-webapps          1.7.0-rev.588.2e534bc-1                                  amd64        openvStorage Backend plugin Web Applications
ii  openvstorage-cinder-plugin            1.2.0-rev.26.fd48d23-1                                   amd64        OpenvStorage Cinder plugin for OpenStack
ii  openvstorage-core                     2.7.0-rev.3451.9fa0267-1                                 amd64        openvStorage core
ii  openvstorage-webapps                  2.7.0-rev.3451.9fa0267-1                                 amd64        openvStorage Web Applications

Use the standard ovs log format (timestamp)

Can you please use the standard ovs log format described on the gitbook.

Only the timestamp isn't correct.

2016/06/06 06:57:27 588953 +0200 - cmp01 - 12197/0 - arakoon - 15593 - info - iuMMCJanEt6V2jUJ is master

should be

2016-06-06 06:57:27 588953 +0200 - cmp01 - 12197/0 - arakoon - 15593 - info - iuMMCJanEt6V2jUJ is master

Also Alba needs this modification.

Client_helper.find_master doesn't handle problem nodes well

If a node for whatever reason (e.g. a bug -- in my case it was a misconfigured cluster_id) does not respond correctly to a who_master call then find_master doesn't try querying the other nodes of the cluster. This means one bad node can result in a cluster not being reachable for (ocaml) clients.

Faulty code is at https://github.com/openvstorage/arakoon/blob/1.9/src/client/client_helper.ml#L266-L270. It should call loop there iso returning.

What if the head.db is not the correct one?

What happens if we copy the main.db to a tlog directory from another cluster?
If this cluster needs to copy the head.db to the main.db will this trigger an exception in the Arakoon or not?
I know there is a function in Arakoon (--copy-db-to-head) to prevent this.

Please make sure we cannot start the Arakoon with another head.db from another cluster.

Arakoon cluster is losing its master in local datacenter

We have a cluster of arakoons in datacenter A.

We've checked the arakoons, and they are running normally.

But it seems like the cluster is losing its master over and over again:

2016-07-05 07:22:56 52076 +0200 - perf-roub-02 - 42745/0 - arakoon - 579836 - info - nhfE8ZP77taJwMBG is master
2016-07-05 07:32:33 451858 +0200 - perf-roub-02 - 42745/0 - arakoon - 586572 - info - nhfE8ZP77taJwMBG is master
2016-07-05 07:33:33 520020 +0200 - perf-roub-02 - 42745/0 - arakoon - 587923 - info - nhfE8ZP77taJwMBG is master
2016-07-05 07:34:33 554092 +0200 - perf-roub-02 - 42745/0 - arakoon - 587927 - info - nhfE8ZP77taJwMBG is master
2016-07-05 07:35:33 618590 +0200 - perf-roub-02 - 42745/0 - arakoon - 587931 - info - nhfE8ZP77taJwMBG is master
2016-07-05 07:36:33 660008 +0200 - perf-roub-02 - 42745/0 - arakoon - 587935 - info - nhfE8ZP77taJwMBG is master
2016-07-05 08:28:42 843502 +0200 - perf-roub-02 - 42745/0 - arakoon - 632403 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:29:42 915662 +0200 - perf-roub-02 - 42745/0 - arakoon - 633952 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:30:42 967519 +0200 - perf-roub-02 - 42745/0 - arakoon - 633959 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:31:43 24771 +0200 - perf-roub-02 - 42745/0 - arakoon - 633963 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:55:30 406956 +0200 - perf-roub-02 - 42745/0 - arakoon - 657281 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:56:30 481026 +0200 - perf-roub-02 - 42745/0 - arakoon - 658980 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:57:30 528125 +0200 - perf-roub-02 - 42745/0 - arakoon - 659001 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:58:30 569325 +0200 - perf-roub-02 - 42745/0 - arakoon - 659008 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 08:59:30 619962 +0200 - perf-roub-02 - 42745/0 - arakoon - 659012 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 09:13:23 40990 +0200 - perf-roub-02 - 42745/0 - arakoon - 675263 - info - 09FydVMSuGL4Fp3k is master

...

2016-07-05 09:32:56 126344 +0200 - perf-roub-02 - 42745/0 - arakoon - 687609 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 09:33:56 176095 +0200 - perf-roub-02 - 42745/0 - arakoon - 688461 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 09:34:56 227279 +0200 - perf-roub-02 - 42745/0 - arakoon - 688469 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 09:35:56 283077 +0200 - perf-roub-02 - 42745/0 - arakoon - 688473 - info - 09FydVMSuGL4Fp3k is master
2016-07-05 09:36:56 347654 +0200 - perf-roub-02 - 42745/0 - arakoon - 688477 - info - 09FydVMSuGL4Fp3k is master

...

2016-07-05 09:50:46 54271 +0200 - perf-roub-02 - 42745/0 - arakoon - 699961 - info - 09FydVMSuGL4Fp3k is master

...

2016-07-05 10:00:07 316757 +0200 - perf-roub-02 - 42745/0 - arakoon - 708913 - info - bKz9fJNRoh8ZGfe7 is master
2016-07-05 10:01:07 372101 +0200 - perf-roub-02 - 42745/0 - arakoon - 709511 - info - bKz9fJNRoh8ZGfe7 is master
2016-07-05 10:02:07 431673 +0200 - perf-roub-02 - 42745/0 - arakoon - 710160 - info - bKz9fJNRoh8ZGfe7 is master
2016-07-05 10:03:07 472199 +0200 - perf-roub-02 - 42745/0 - arakoon - 710412 - info - bKz9fJNRoh8ZGfe7 is master
2016-07-05 10:04:07 522862 +0200 - perf-roub-02 - 42745/0 - arakoon - 710629 - info - bKz9fJNRoh8ZGfe7 is master
2016-07-05 10:05:07 576812 +0200 - perf-roub-02 - 42745/0 - arakoon - 710846 - info - bKz9fJNRoh8ZGfe7 is master

Arakoon restarted without reason

Looking at the log file:

2016-06-30 14:20:52 968514 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235966 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:52 968765 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235967 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "
src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:53 969780 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235968 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:53 969999 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235969 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:54 970793 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235970 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:54 971025 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235971 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:55 971856 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235972 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:55 972099 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235973 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:56 972218 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235974 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:56 972482 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235975 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:57 972949 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235976 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:57 973176 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235977 - info - __open_connection to (188.165.13.24,26413) failed: Unix.Unix_error(Unix.ECONNREFUSED, "connect", ""); backtrace:; Raised at file "src/unix/lwt_unix.ml", line 1419, characters 20-57; Called from file "src/unix/lwt_unix.ml", line 487, characters 13-24
2016-06-30 14:20:58 956135 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235978 - info - should roll over: tlog_entries=5000
2016-06-30 14:20:58 963411 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235979 - info - should roll: i= 45623944 old_outer = 110194 => jump to 110195
2016-06-30 14:20:58 963511 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235980 - info - closed 110194.tlog
2016-06-30 14:20:58 964457 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235981 - info - adding new task to compression queue /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlog,/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlx
2016-06-30 14:20:58 964570 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235982 - info - Compressing: /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlog into /mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlx
2016-06-30 14:20:58 964648 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235983 - info - Compressing "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlog" to "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlx" via "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlx.part"
2016-06-30 14:20:58 964672 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235984 - info - Unlinking "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs/110194.tlx.part"
2016-06-30 14:20:58 974143 +0200 - perf-roub-03 - 8538/0 - arakoon - 16235985 - info - establishing connection to (188.165.13.24,26413) (timeout = 5.000000)
2016-06-30 14:20:59 651237 +0200 - perf-roub-03 - 24678/0 - arakoon - 0 - info - --- NODE STARTED ---
2016-06-30 14:20:59 651267 +0200 - perf-roub-03 - 24678/0 - arakoon - 1 - info - git_revision: tags/1.9.6-0-gea32319 
2016-06-30 14:20:59 651276 +0200 - perf-roub-03 - 24678/0 - arakoon - 2 - info - compile_time: 10/06/2016 13:21:09 UTC 
2016-06-30 14:20:59 651283 +0200 - perf-roub-03 - 24678/0 - arakoon - 3 - info - version: 1.9.6
2016-06-30 14:20:59 651293 +0200 - perf-roub-03 - 24678/0 - arakoon - 4 - info - NOFILE: 1024
2016-06-30 14:20:59 651379 +0200 - perf-roub-03 - 24678/0 - arakoon - 5 - info - cluster_cfg={ cfgs = [{ node_name = "bKz9fJNRoh8ZGfe7"; ips = ["178.33.60.66"]; client_port = 26412; messaging_port = 26413; home = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db"; tlog_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; head_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0. }; { node_name = "nhfE8ZP77taJwMBG"; ips = ["178.33.62.163"]; client_port = 26412; messaging_port = 26413; home = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db"; tlog_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; head_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0. }; { node_name = "09FydVMSuGL4Fp3k"; ips = ["188.165.13.24"]; client_port = 26412; messaging_port = 26413; home = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/db"; tlog_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; head_dir = "/mnt/hdd1/ovh/arakoon/hdd_roub_nsm_1/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0. }]; log_cfgs = []; batched_transaction_cfgs = []; _master = Elected; _lease_period = 10; cluster_id = "hdd_roub_nsm_1"; plugins = ["nsm_host_plugin"]; nursery_cfg = None; tlog_max_entries = 5000; tlog_max_size = 5000000; max_value_size = 8388608; max_buffer_size = 33554432; client_buffer_capacity = 32; lcnum = 16384; ncnum = 8192; tls = None }
2016-06-30 14:20:59 651391 +0200 - perf-roub-03 - 24678/0 - arakoon - 6 - info - Batched_store.max_entries = 100
2016-06-30 14:20:59 651397 +0200 - perf-roub-03 - 24678/0 - arakoon - 7 - info - Batched_store.max_size = 100000
2016-06-30 14:20:59 651405 +0200 - perf-roub-03 - 24678/0 - arakoon - 8 - info - autofix:true
2016-06-30 14:20:59 651412 +0200 - perf-roub-03 - 24678/0 - arakoon - 9 - info - loading plugin nsm_host_plugin

Looking at syslog:

Jun 30 10:12:50 perf-roub-03 kernel: [865141.531602] do_general_protection: 10 callbacks suppressed
Jun 30 10:12:50 perf-roub-03 kernel: [865141.531606] traps: arakoon[37790] general protection ip:ca4469 sp:7ffe7cc4e9d0 error:0 in arakoon[400000+a18000]
Jun 30 10:12:50 perf-roub-03 kernel: [865141.669490] init: ovs-arakoon-hdd_roub_nsm_0 main process (37790) killed by SEGV signal
Jun 30 10:12:50 perf-roub-03 kernel: [865141.669506] init: ovs-arakoon-hdd_roub_nsm_0 main process ended, respawning

When looking at /var/crash we see that an older logfile is kept then the given timestamp:

root@perf-roub-03:~# ls -alsh /var/crash/_usr_bin_arakoon.1000.crash
16M -rw-r----- 1 ovs ovs 16M Jun 24 09:17 /var/crash/_usr_bin_arakoon.1000.crash

catch up arakoon from specific node

Can you add a feature to arakoon to cath up the arakoon for a specific node?
This would be a nice feature when 2 of the 3 arakoons are not able to collapse due to Failure("success") error. Uses the collapsed node to catch up from would help us. Now we need to try multiple catch up to get the correct node.

thx!

Arakoon services in start/fail/start loop and dumping many debug logs to /opt/OpenvStorage

After updating pocops to the latest Fargo release the arakoon services get stuck in a loop where they constantly restart. The ovs homefolder gets populated with files caled console:.debug.TIMESTAMP.xxxxxx that contain these kinds of messages:

1502284724: main debug: 7679026 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679025") -> new_i:7679026 1502284724: main debug: 7679027 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679026") -> new_i:7679027 1502284724: main debug: 7679028 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679027") -> new_i:7679028 1502284724: main debug: after_block 1502284724: main debug: _fold_blocks 1502284724: main info: Completed replay of 1535.tlx, took 0.502017 seconds, 1 to go 1502284724: main info: Replaying tlog file: 1536.tlog [7679030,...] (2/2) 1502284724: tlog_map debug: fold_read extension=.tlog => index':Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: U.fold 7679029 Some ("7679092") ~index:Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: maybe_fast_forward 7679029 with Some {filename="/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog";mapping=} 1502284724: main debug: 7679029 => store 1502284724: main debug: Store.incr_i old_i:Some ("7679028") -> new_i:7679029 1502284724: main debug: 7679030 => skip 1502284724: main debug: 7679029 => store 1502284724: tlog_map debug: filename:/mnt/ssd1/arakoon/flash-10-nsm_15/tlogs/1536.tlog(Failure "update 7679029, store @ 7679029 don't fit") 1502284724: main fatal: going down(Failure "update 7679029, store @ 7679029 don't fit") 1502284724: main fatal: after pick

This eventually fills the disk causing more trouble.

System info
os: Ubuntu 16.04.3 LTS

OVS components

`ii alba 1.3.14 amd64 the ALternative BAckend
ii arakoon 1.9.17 amd64 Simple consistent distributed key/value store
ii openvstorage 2.8.2-1 amd64 openvStorage
ii openvstorage-backend 1.8.1-1 amd64 openvStorage Backend plugin
ii openvstorage-backend-core 1.8.1-1 amd64 openvStorage Backend plugin core
ii openvstorage-backend-webapps 1.8.1-1 amd64 openvStorage Backend plugin Web Applications
ii openvstorage-core 2.8.2-1 amd64 openvStorage core
ii openvstorage-hc 1.8.1-1 amd64 openvStorage Backend plugin HyperConverged
ii openvstorage-health-check 3.2.0-fargo.3-1 amd64 Open vStorage HealthCheck
ii openvstorage-sdm 1.7.1-1 amd64 Open vStorage Backend ASD Manager
ii openvstorage-webapps 2.8.2-1 amd64 openvStorage Web Applications

`

arakoon segfault

2016-08-18 11:54:35 201035 +0200 - ctl02 - 12813/0 - arakoon - 44 - info - Unlink of "/mnt/ssd1/arakoon/mybackend-abm/tlogs/touch-12813" failed with ENOENT
2016-08-18 11:54:35 201522 +0200 - ctl02 - 12813/0 - arakoon - 45 - info - Unlinking "/mnt/ssd1/arakoon/mybackend-abm/tlogs/touch-12813"
2016-08-18 11:54:35 201612 +0200 - ctl02 - 12813/0 - arakoon - 46 - info - File at /mnt/ssd1/arakoon/mybackend-abm/tlogs/head.db does not exist
2016-08-18 11:54:35 201627 +0200 - ctl02 - 12813/0 - arakoon - 47 - info - _init ~check_marker:true
2016-08-18 11:54:35 201798 +0200 - ctl02 - 12813/0 - arakoon - 48 - info - tlog_number:0
2016-08-18 11:54:35 211928 +0200 - ctl02 - 12813/0 - arakoon - 49 - info - _init: tlog_entries:2084 should_roll:false
2016-08-18 11:54:35 212448 +0200 - ctl02 - 12813/0 - arakoon - 50 - info - wrote "Some (\"opened:e7sVBYVD1lAddwjC\")" marker @i=2084 for "e7sVBYVD1lAddwjC"
2016-08-18 11:54:35 212472 +0200 - ctl02 - 12813/0 - arakoon - 51 - info - Taking job from compression queue...
2016-08-18 11:54:35 212489 +0200 - ctl02 - 12813/0 - arakoon - 52 - info - Creating local store at /mnt/ssd1/arakoon/mybackend-abm/db/e7sVBYVD1lAddwjC.db (lcnum=16384) (ncnum=8192)
2016-08-18 11:54:35 213187 +0200 - ctl02 - 12813/0 - arakoon - 53 - info - verify_n_catchup_store; too_far_i=2085 current_i=2084 si_o:None
2016-08-18 11:54:35 213211 +0200 - ctl02 - 12813/0 - arakoon - 54 - info - replaying log to store
[New Thread 0x7fffeffff700 (LWP 12842)]
2016-08-18 11:54:35 216097 +0200 - ctl02 - 12813/0 - arakoon - 55 - info - Replaying tlog file: 000.tlog (1/1)

Program received signal SIGSEGV, Segmentation fault.
0x0000000000c5e75b in caml_darken_all_roots_slice ()
Thread 1 (Thread 0x7ffff7fe2780 (LWP 12813)):
#0  0x0000000000c5e75b in caml_darken_all_roots_slice ()
#1  0x0000000000c6110f in mark_slice ()
#2  0x0000000000c61a8a in caml_major_collection_slice ()
#3  0x0000000000c5f63b in caml_garbage_collection ()
#4  0x0000000000c752ac in caml_call_gc ()
#5  0x000000000079ad1f in camlTlogreader2___fold_2599 () at src/tlog/tlogreader2.ml:144
#6  0x0000000000b9d326 in camlLwt__fun_32562 () at src/core/lwt.ml:653
#7  0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#8  0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#9  0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#10 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#11 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#12 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#13 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#14 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#15 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#16 0x0000000000b9edf7 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:201
#17 0x0000000000b9edc9 in camlLwt__run_waiters_rec_1372 () at src/core/lwt.ml:206
#18 0x0000000000b9f0dc in camlLwt__safe_run_waiters_1432 () at src/core/lwt.ml:299
#19 0x0000000000bc03ec in camlArray__iter_1247 () at array.ml:80
#20 0x0000000000c7548e in caml_start_program ()
#21 0x0000000000c70d09 in caml_callback ()
#22 0x00007ffff6e03e45 in ev_invoke_pending () from /usr/lib/x86_64-linux-gnu/libev.so.4
#23 0x0000000000c48b54 in lwt_libev_loop (val_loop=<optimized out>, val_block=<optimized out>) at src/unix/lwt_libev_stubs.c:98
#24 0x000000000082d679 in camlLwt_engine__fun_2546 () at src/unix/lwt_engine.ml:151
#25 0x000000000082fa38 in camlLwt_main__run_1324 () at src/unix/lwt_main.ml:41
#26 0x00000000006936b2 in camlArakoon__do_server_1869 () at src/main/arakoon.ml:529
#27 0x00000000006979b5 in camlArakoon__main_1530 () at src/main/arakoon.ml:574
#28 0x00000000006987f2 in camlArakoon__entry () at src/main/arakoon.ml:587
#29 0x000000000068b2a9 in caml_program ()
#30 0x0000000000c7548e in caml_start_program ()
#31 0x0000000000c5db35 in caml_main ()
#32 0x0000000000688b0c in main ()

arakoon version 1.9.9

this issue is reproducable, everytime when trying to start a node

so it looks like this is a different segfault from the one that was fixed earlier

Experiencing this issue on environment: 172.19.10.31

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.