Comments (7)
arakoon version ?
from arakoon.
root@perf-roub-03:~# arakoon --version
version: 1.9.7
from arakoon.
same issue with flash_roub_nsm_2 on the same node:
tlogs + head.db:
root@perf-roub-03:/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs# ls -alth | head
total 4.5G
-rw-r--r-- 1 ovs ovs 333M Jul 19 14:12 head.db
drwxr-xr-x 2 ovs ovs 20K Jul 19 14:11 .
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1139.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1142.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1141.tlx
-rw-r--r-- 1 ovs ovs 133K Jul 13 16:39 1140.tlog
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:38 1138.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:37 1137.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:36 1134.tlx
root@perf-roub-03:/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs# arakoon --dump-tlog 1142.tlog | tail -n 1
5300332:(Vc ([Sequence([Assert ;"/nsm_host_version/";4; Replace ;"/nsm/\177D\000\000osds/objects/\000\000\000\"/\235\209\153t\178\018\168\016r\138\220@c\t\015t\246\195\205\247\144\133n\018tf\196\253D(-\134";0; Replace ;"/nsm/\177D\000\000osds/objects/\000\000\000\"/\235\210\024\159x\131\174?$\025\203\198\217\229\170\156x\224h?\198\210v\2383K\141\197\027\220`L";0; Replace ;"/nsm/\177D\000\000osds/objects/\000\000\000\"/\235\210d\r\139H\231\028\213\183\175\176\220|\191\218\011B\129\137Q\251\248\170\231\139A\2269\254:[";0; Replace ;"/nsm/\177D\000\000osds/objects/\000\000\000\"/\235\210q\231\015\022f\011\017\175\1465Qof\244\182l\192\200'9\022 \b\018\031\np\224^\146";0; Replace
root@perf-roub-03:/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs# arakoon --dump-store ../db/nhfE8ZP77taJwMBG.db
i: Some ("5299231")
master: Some(09FydVMSuGL4Fp3k,1468930544.044213)
routing : --
interval: {(_,_),(_,_)}
logfile (first fatal error):
K\158^\235\221\157\217\160\024\028\1356\232\031GZF_\132u#K\152\222";0; Replace ;"/nsm/\177D\000\000osds/objects/\000\000\000\028/\148X/J\203\154\195o\147<Er\239K.\179G(\186Pf\025P\174\176\218=\160\251wg0";0; Replace
;"/nsm/\177D\000\000osds/objects/\000\000\000\028/\148X\156\157)<\151\173\135\004\182\004rI\248$\169\172\208BZ\188\143<T\132\n\027\157\211NY";0]);],false)
2016-07-13 16:39:23 664736 +0200 - perf-roub-03 - 2947/0 - arakoon - 812400 - info - CATCHUP start: I'm @ 5299233 and 09FydVMSuGL4Fp3k is more recent
2016-07-13 16:39:23 664767 +0200 - perf-roub-03 - 2947/0 - arakoon - 812401 - info - catchup_tlog 5299233 from 09FydVMSuGL4Fp3k
2016-07-13 16:39:23 695677 +0200 - perf-roub-03 - 2947/0 - arakoon - 812402 - info - loop_entries
2016-07-13 16:39:24 187953 +0200 - perf-roub-03 - 2947/0 - arakoon - 812403 - info - should roll over: tlog_entries=489
2016-07-13 16:39:24 188180 +0200 - perf-roub-03 - 2947/0 - arakoon - 812404 - info - should roll: i= 5299355 old_outer = 1139 => jump to 1140
2016-07-13 16:39:24 188256 +0200 - perf-roub-03 - 2947/0 - arakoon - 812405 - info - closed 1139.tlog
2016-07-13 16:39:24 244026 +0200 - perf-roub-03 - 2947/0 - arakoon - 812406 - info - adding new task to compression queue /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog,/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx
2016-07-13 16:39:24 244272 +0200 - perf-roub-03 - 2947/0 - arakoon - 812407 - info - Compressing: /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog into /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx
2016-07-13 16:39:24 244338 +0200 - perf-roub-03 - 2947/0 - arakoon - 812408 - info - Compressing "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog" to "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx" via "/mnt/hdd1/ovh/arakoo
n/flash_roub_nsm_2/tlogs/1139.tlx.part"
2016-07-13 16:39:24 244352 +0200 - perf-roub-03 - 2947/0 - arakoon - 812409 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx.part"
2016-07-13 16:39:24 399005 +0200 - perf-roub-03 - 2947/0 - arakoon - 812410 - info - remote_nodestream :: last_seen = Some ("5299356")
2016-07-13 16:39:24 399091 +0200 - perf-roub-03 - 2947/0 - arakoon - 812411 - info - retrieving start_i:5299357 (extension:.tlx, 17059486 bytes)
2016-07-13 16:39:24 399125 +0200 - perf-roub-03 - 2947/0 - arakoon - 812412 - info - closed 1140.tlog
2016-07-13 16:39:24 399163 +0200 - perf-roub-03 - 2947/0 - arakoon - 812413 - info - save_tlog_file: first_i:5299357 extension:.tlx in /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1141.tlx.tmp
2016-07-13 16:39:24 930392 +0200 - perf-roub-03 - 2947/0 - arakoon - 812414 - info - rename /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1141.tlx.tmp -> /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1141.tlx
2016-07-13 16:39:25 26561 +0200 - perf-roub-03 - 2947/0 - arakoon - 812415 - info - retrieving start_i:5299847 (extension:.tlx, 17048145 bytes)
2016-07-13 16:39:25 26592 +0200 - perf-roub-03 - 2947/0 - arakoon - 812416 - info - save_tlog_file: first_i:5299847 extension:.tlx in /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1142.tlx.tmp
2016-07-13 16:39:25 470746 +0200 - perf-roub-03 - 2947/0 - arakoon - 812417 - info - rename /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1142.tlx.tmp -> /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1142.tlx
2016-07-13 16:39:25 495634 +0200 - perf-roub-03 - 2947/0 - arakoon - 812418 - info - loop_entries
2016-07-13 16:39:25 495663 +0200 - perf-roub-03 - 2947/0 - arakoon - 812419 - info - remote_nodestream :: last_seen = Some ("5299356")
2016-07-13 16:39:25 495671 +0200 - perf-roub-03 - 2947/0 - arakoon - 812420 - info - loop_parts done
2016-07-13 16:39:25 495687 +0200 - perf-roub-03 - 2947/0 - arakoon - 812421 - info - catchup_tlog completed
2016-07-13 16:39:25 619213 +0200 - perf-roub-03 - 2947/0 - arakoon - 812422 - info - rename /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx.part -> /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx
2016-07-13 16:39:26 41013 +0200 - perf-roub-03 - 2947/0 - arakoon - 812423 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog"
2016-07-13 16:39:26 70450 +0200 - perf-roub-03 - 2947/0 - arakoon - 812424 - info - end of compress : /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog -> /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlx
2016-07-13 16:39:26 70477 +0200 - perf-roub-03 - 2947/0 - arakoon - 812425 - info - Finished compression task, lets loop
2016-07-13 16:39:26 70487 +0200 - perf-roub-03 - 2947/0 - arakoon - 812426 - info - Taking job from compression queue...
2016-07-13 16:39:34 790833 +0200 - perf-roub-03 - 2947/0 - arakoon - 812427 - fatal - Exception in fsm thread: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog"); backtrace:; Raised by primitive operation at file "src/unix/lwt_unix.ml", line 241, characters 19-36
2016-07-13 16:39:34 790914 +0200 - perf-roub-03 - 2947/0 - arakoon - 812428 - fatal - going down: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog"); backtrace:; Raised at file "map.ml", line 117, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2016-07-13 16:39:34 790925 +0200 - perf-roub-03 - 2947/0 - arakoon - 812429 - fatal - after pick
2016-07-13 16:39:34 790991 +0200 - perf-roub-03 - 2947/0 - arakoon - 812430 - info - going to drop outgoing connection as well: Lwt.Canceled; backtrace:; Raised at file "map.ml", line 117, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
log file (last error):
2016-07-19 14:11:14 731688 +0200 - perf-roub-03 - 12925/0 - arakoon - 23 - info - Unlink of "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925" failed with ENOENT
2016-07-19 14:11:14 732477 +0200 - perf-roub-03 - 12925/0 - arakoon - 24 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925"
2016-07-19 14:11:14 732584 +0200 - perf-roub-03 - 12925/0 - arakoon - 25 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925"
2016-07-19 14:11:14 732624 +0200 - perf-roub-03 - 12925/0 - arakoon - 26 - info - Unlink of "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925" failed with ENOENT
2016-07-19 14:11:14 733314 +0200 - perf-roub-03 - 12925/0 - arakoon - 27 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925"
2016-07-19 14:11:14 733413 +0200 - perf-roub-03 - 12925/0 - arakoon - 28 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925"
2016-07-19 14:11:14 733452 +0200 - perf-roub-03 - 12925/0 - arakoon - 29 - info - Unlink of "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925" failed with ENOENT
2016-07-19 14:11:14 734107 +0200 - perf-roub-03 - 12925/0 - arakoon - 30 - info - Unlinking "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/touch-12925"
2016-07-19 14:11:14 734224 +0200 - perf-roub-03 - 12925/0 - arakoon - 31 - info - copy_file /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/head.db /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/db/nhfE8ZP77taJwMBG.db (overwrite=false,throttling=0.000000)
2016-07-19 14:11:14 734271 +0200 - perf-roub-03 - 12925/0 - arakoon - 32 - info - Not copying /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/head.db to /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/db/nhfE8ZP77taJwMBG.db because target already exists
2016-07-19 14:11:19 915199 +0200 - perf-roub-03 - 12925/0 - arakoon - 33 - info - Taking job from compression queue...
2016-07-19 14:11:19 915229 +0200 - perf-roub-03 - 12925/0 - arakoon - 34 - info - Creating local store at /mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/db/nhfE8ZP77taJwMBG.db (lcnum=16384) (ncnum=8192)
2016-07-19 14:11:19 915445 +0200 - perf-roub-03 - 12925/0 - arakoon - 35 - info - verify_n_catchup_store; too_far_i=0 current_i=0 si_o:Some ("5299231")
2016-07-19 14:11:19 915457 +0200 - perf-roub-03 - 12925/0 - arakoon - 36 - fatal - too_far_i:0, si_o:Some ("5299231") should not happen: tlogs have been removed?
2016-07-19 14:11:19 915465 +0200 - perf-roub-03 - 12925/0 - arakoon - 37 - fatal - [rc=40] Store ahead of tlogs: s_i=5299231, tlog_i=0
2016-07-19 14:11:20 123579 +0200 - perf-roub-03 - 12970/0 - arakoon - 0 - info - --- NODE STARTED ---
2016-07-19 14:11:20 123626 +0200 - perf-roub-03 - 12970/0 - arakoon - 1 - info - git_revision: tags/1.9.7-0-g43f2999
2016-07-19 14:11:20 123642 +0200 - perf-roub-03 - 12970/0 - arakoon - 2 - info - compile_time: 24/06/2016 09:53:13 UTC
2016-07-19 14:11:20 123657 +0200 - perf-roub-03 - 12970/0 - arakoon - 3 - info - version: 1.9.7
2016-07-19 14:11:20 123677 +0200 - perf-roub-03 - 12970/0 - arakoon - 4 - info - NOFILE: 1024
from arakoon.
Saw this issue again on the OVH env.
My current guess is that it has been introduced by #88, as we've only started seeing this since version 1.9.7.
from arakoon.
2016-07-20 16:12:12 767818 +0200 - perf-roub-02 - 26974/0 - arakoon - 2587455 - info - CATCHUP start: I'm @ 8989143 and nhfE8ZP77taJwMBG is more recent
2016-07-20 16:12:12 767859 +0200 - perf-roub-02 - 26974/0 - arakoon - 2587456 - info - catchup_tlog 8989143 from nhfE8ZP77taJwMBG
2016-07-20 16:27:37 53781 +0200 - perf-roub-02 - 26974/0 - arakoon - 2587576 - info - remote_nodestream :: last_seen = Some ("8993085")
2016-07-20 16:27:37 53796 +0200 - perf-roub-02 - 26974/0 - arakoon - 2587577 - info - loop_parts done
2016-07-20 16:27:37 53831 +0200 - perf-roub-02 - 26974/0 - arakoon - 2587578 - info - catchup_tlog completed
2016-07-20 17:26:06 420398 +0200 - perf-roub-02 - 26974/0 - arakoon - 2588096 - info - CATCHUP phase 1 done (until_i = 0); now the store
2016-07-20 17:26:06 420449 +0200 - perf-roub-02 - 26974/0 - arakoon - 2588097 - info - replaying log to store
2016-07-20 17:26:06 420464 +0200 - perf-roub-02 - 26974/0 - arakoon - 2588098 - error - Store counter (8989143) is ahead of tlog counter (1). Aborting.
from arakoon.
I've been investigating this issue.
It looks like
2016-07-19 11:24:48 756513 +0200 - perf-roub-03 - 8782/0 - arakoon - 35 - info - verify_n_catchup_store; too_far_i=0 current_i=0 si_o:Some ("51186823")
2016-07-19 11:24:48 756525 +0200 - perf-roub-03 - 8782/0 - arakoon - 36 - fatal - too_far_i:0, si_o:Some ("51186823") should not happen: tlogs have been removed?
2016-07-19 11:24:48 756532 +0200 - perf-roub-03 - 8782/0 - arakoon - 37 - fatal - [rc=40] Store ahead of tlogs: s_i=51186823, tlog_i=0
Is due to the fact that the last tlog/tlx is a tlx.
root@perf-roub-03:/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs# ls -alth | head
total 4.5G
-rw-r--r-- 1 ovs ovs 333M Jul 19 14:12 head.db
drwxr-xr-x 2 ovs ovs 20K Jul 19 14:11 .
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1139.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1142.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:39 1141.tlx
-rw-r--r-- 1 ovs ovs 133K Jul 13 16:39 1140.tlog
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:38 1138.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:37 1137.tlx
-rw-r--r-- 1 ovs ovs 17M Jul 13 16:36 1134.tlx
This means it should be easy to work around in case we see this again: just uncompress the last tlx to a tlog, and start the node again.
In theory we could also end up in such a situation when a node crashes unexpectedly due to a power outage (so not just due to a bug such as what happened now), so we may want to be able to handle this.
Still investigating how the node got into this, now zooming in on
2016-07-13 16:39:34 790833 +0200 - perf-roub-03 - 2947/0 - arakoon - 812427 - fatal - Exception in fsm thread: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog"); backtrace:; Raised by primitive operation at file "src/unix/lwt_unix.ml", line 241, characters 19-36
2016-07-13 16:39:34 790914 +0200 - perf-roub-03 - 2947/0 - arakoon - 812428 - fatal - going down: Unix.Unix_error(Unix.ENOENT, "open", "/mnt/hdd1/ovh/arakoon/flash_roub_nsm_2/tlogs/1139.tlog"); backtrace:; Raised at file "map.ml", line 117, characters 16-25; Called from file "src/core/lwt.ml", line 161, characters 4-40
2016-07-13 16:39:34 790925 +0200 - perf-roub-03 - 2947/0 - arakoon - 812429 - fatal - after pick
from arakoon.
Based on the logs and some code inspection it seems to me as if the initial crash of arakoon was due to a race between reiniting the tlog_map and tlog compression.
A fix should be coming soon.
from arakoon.
Related Issues (20)
- arakoon crashed with exit status 40/n/a HOT 3
- copy-db-to-head not working as expected HOT 3
- Arakoon cluster status HOT 1
- Change copy-head-to-db CLI signature to use -config HOT 3
- Output when using collapse-local or copy-db-to-head is stderr instead of stdout HOT 3
- Change severity level of ECONNRESET error HOT 1
- High memory usage on ABM HOT 4
- Reduce Arakoon logging by removing coming and going of clients
- copy-db-to-head results in removal of 1 tlog too many and catchup problems HOT 1
- master can generate tlog entries that slaves refuse to read from the network.
- Extending a cluster to an extra node never seems to catch up HOT 3
- Arakoon services in start/fail/start loop and dumping many debug logs to /opt/OpenvStorage HOT 4
- arakoon checksum error HOT 7
- Arakoon dies when --copy-db-to-head while --optimize-db was already running HOT 1
- cluster took a long time to re elect a master HOT 3
- Allow keys to be excluded in prefix_entries and range_entries HOT 3
- diverging history in 5 node cluster HOT 2
- Arakoon_exc.Exception(4, "Not_found") HOT 1
- Add delete_prefix to sequence HOT 1
- Support terse version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arakoon.