Giter VIP home page Giter VIP logo

super-speedy-syslog-searcher's Introduction

Hello

I am a software engineer. Most of my experience is within testing and tools.



My Open-Source Work Samples

The following lists are pubicly available examples of my work:

github Projects


Contribution Stats


Continuous Integration and Code Coverage

I've used several Continuous Integration (CI) services for the sake of learning about them. Here are example runs of each.

The archived links are provided because most CI Service providers expire detailed records.

Azure Pipelines

CircleCI

codecov.io

Github Actions

Travis CI

StackExchange Questions and Answers

Some of my favorite StackExchange posts:

Bug Reports and Feature Requests

Some public bug reports and feature requests I have made:






Pull Requests

My other github commits:

github Forum Posts

My favorite github forum posts:

Other Links

Contributing to Open Source

Among many other software projects and organizations that I have voluntarily donated!

Recommended podcasts

Software-oriented podcasts that I listen to irregularly.


profile for @JamesThomasMoon on Stack Exchange, a network of free, community-driven Q&A sites

super-speedy-syslog-searcher's People

Contributors

dependabot[bot] avatar jtmoon79 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

baoyachi

super-speedy-syslog-searcher's Issues

support processing bzip compressed files (`.bz`, `.bz2`)

Support processing .bz and .bz2 files; both compressed by bzip2.

Such compression is rarely used for log files common on Mac OS and other BSD.

The good news: IIUC, bzip only compresses one file to one compressed file.
From Wikipedia

It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting

Hopefully this "one file to one file" compression means bzip is easier than gz and xz support in that there is no need for hacks to handle/workaround "multiple streams" (see Issue #11 , Issue #8 ).


Rust crates available:

  • bzip2_rs
    bzip2_rs is a pure Rust bzip2 decoder.
  • bzip2
    This library contains bindings to libbz2 to support bzip compression and decompression for Rust.
  • bzip2_sys
    Does not look ready for consumption.

allow HMS only in -a -b values

Currently, an end-user can pass only a date, e.g. -a "2020-01-02". The Hours, Minutes, Seconds (HMS) is then set to 00:00:00 for that DateTime instance, e.g. the -a "2020-01-02" arguments become Local offset DateTime value "2020-01-02T00:00:00".

Allow the same for HMS. If passed, then the Date portion is set to chrono::Local::today(). For example, passing CLI arguments -a 01:02:03 would become DateTime instance 2022-09-21T01:02:03 (today) with Local timezone offset.

Similar area as #35

lines with '\r' but no '\n' will overwrite prepended options

Syslog lines with '\r' but no '\n' may overwrite prepended options like -u or -p.

For example, from /var/log/apt/term.log

$ s4 -p /var/log/apt/term.log
...
/var/log/apt/term.log:
/var/log/apt/term.log:Log started: 2022-09-01  06:21:45
(Reading database ... 111770 files and directories currently installed.)
/var/log/apt/term.log:Removing linux-modules-5.15.0-43-generic (5.15.0-43.46) ...

The line (Reading database ... 111770 files and directories currently installed.) is several visually overlapping Reading database ... statements.
As seen in hexyl

$ grep -Fe '(Reading database ... 111770 files' /var/log/apt/term.log | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000010│ 65 20 2e 2e 2e 20 0d 28 ┊ 52 65 61 64 69 6e 67 20 │e ... _(┊Reading │
│00000020│ 64 61 74 61 62 61 73 65 ┊ 20 2e 2e 2e 20 35 25 0d │database┊ ... 5%_│
│00000030│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000040│ 65 20 2e 2e 2e 20 31 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 10┊%_(Readi│
│00000050│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000060│ 31 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │15%_(Rea┊ding dat│
│00000070│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 32 30 25 0d 28 52 │abase ..┊. 20%_(R│
│00000080│ 65 61 64 69 6e 67 20 64 ┊ 61 74 61 62 61 73 65 20 │eading d┊atabase │
│00000090│ 2e 2e 2e 20 32 35 25 0d ┊ 28 52 65 61 64 69 6e 67 │... 25%_┊(Reading│
│000000a0│ 20 64 61 74 61 62 61 73 ┊ 65 20 2e 2e 2e 20 33 30 │ databas┊e ... 30│
│000000b0│ 25 0d 28 52 65 61 64 69 ┊ 6e 67 20 64 61 74 61 62 │%_(Readi┊ng datab│
│000000c0│ 61 73 65 20 2e 2e 2e 20 ┊ 33 35 25 0d 28 52 65 61 │ase ... ┊35%_(Rea│
│000000d0│ 64 69 6e 67 20 64 61 74 ┊ 61 62 61 73 65 20 2e 2e │ding dat┊abase ..│
│000000e0│ 2e 20 34 30 25 0d 28 52 ┊ 65 61 64 69 6e 67 20 64 │. 40%_(R┊eading d│
│000000f0│ 61 74 61 62 61 73 65 20 ┊ 2e 2e 2e 20 34 35 25 0d │atabase ┊... 45%_│
│00000100│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│00000110│ 65 20 2e 2e 2e 20 35 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 50┊%_(Readi│
│00000120│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000130│ 35 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │55%_(Rea┊ding dat│
│00000140│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 36 30 25 0d 28 52 │abase ..┊. 60%_(R│
│00000150│ 65 61 64 69 6e 67 20 64 ┊ 61 74 61 62 61 73 65 20 │eading d┊atabase │
│00000160│ 2e 2e 2e 20 36 35 25 0d ┊ 28 52 65 61 64 69 6e 67 │... 65%_┊(Reading│
│00000170│ 20 64 61 74 61 62 61 73 ┊ 65 20 2e 2e 2e 20 37 30 │ databas┊e ... 70│
│00000180│ 25 0d 28 52 65 61 64 69 ┊ 6e 67 20 64 61 74 61 62 │%_(Readi┊ng datab│
│00000190│ 61 73 65 20 2e 2e 2e 20 ┊ 37 35 25 0d 28 52 65 61 │ase ... ┊75%_(Rea│
│000001a0│ 64 69 6e 67 20 64 61 74 ┊ 61 62 61 73 65 20 2e 2e │ding dat┊abase ..│
│000001b0│ 2e 20 38 30 25 0d 28 52 ┊ 65 61 64 69 6e 67 20 64 │. 80%_(R┊eading d│
│000001c0│ 61 74 61 62 61 73 65 20 ┊ 2e 2e 2e 20 38 35 25 0d │atabase ┊... 85%_│
│000001d0│ 28 52 65 61 64 69 6e 67 ┊ 20 64 61 74 61 62 61 73 │(Reading┊ databas│
│000001e0│ 65 20 2e 2e 2e 20 39 30 ┊ 25 0d 28 52 65 61 64 69 │e ... 90┊%_(Readi│
│000001f0│ 6e 67 20 64 61 74 61 62 ┊ 61 73 65 20 2e 2e 2e 20 │ng datab┊ase ... │
│00000200│ 39 35 25 0d 28 52 65 61 ┊ 64 69 6e 67 20 64 61 74 │95%_(Rea┊ding dat│
│00000210│ 61 62 61 73 65 20 2e 2e ┊ 2e 20 31 30 30 25 0d 28 │abase ..┊. 100%_(│
│00000220│ 52 65 61 64 69 6e 67 20 ┊ 64 61 74 61 62 61 73 65 │Reading ┊database│
│00000230│ 20 2e 2e 2e 20 31 31 31 ┊ 37 37 30 20 66 69 6c 65 │ ... 111┊770 file│
│00000240│ 73 20 61 6e 64 20 64 69 ┊ 72 65 63 74 6f 72 69 65 │s and di┊rectorie│
│00000250│ 73 20 63 75 72 72 65 6e ┊ 74 6c 79 20 69 6e 73 74 │s curren┊tly inst│
│00000260│ 61 6c 6c 65 64 2e 29 0d ┊ 0a                      │alled.)_┊_       │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

allow any leading timezone

Currently

CLI options -u and -l will preprint the UTC or Local timezone. However, a user may want to review syslogs according to a different timezone.

Feature Request

Allow CLI option for any timezone to be preprinted. e.g.

   -pz, --prepend-timezone    Prepend DateTime in the passed Timezone for every line.
                              Accepts all strftime timezone formats.

extra TODOs

Update other CLI options.

   -u , --prepend-utc   Prepend DateTime in the UTC Timezone for every line. This is the same as passing "-pt UTC".
   -l , --prepend-local Prepend DateTime in the UTC Timezone for every line. This is the same as passing "-pt Local".

Add -pz to the same CLI option grouping.

Add example to README.md

            Print only the syslog lines that occurred two days ago during the noon hour in
            Bengaluru, India (timezone offset +05:30) but shown in timezone Bengaluru, India.

            ```lang-text
            s4 /var/log -pz "+05:30" -a "$(date -d "2 days ago 12:00" '+%Y-%m-%dT%H:%M:%S') +05:30" -b "$(date -d "2 days ago 13:00" '+%Y-%m-%dT%H:%M:%S') +05:30"
            ```

printing prepended data may cause errant �

tldr Printing with prepended data may insert an errant unicode "unknown glyph" .

Reproduction

Given file ./logs/other/tests/dtf9c-12-baddate.log, with sample of contents

$ head -n2 ./logs/other/tests/dtf9c-12-baddate.log
Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃
Feb 2 02:00:00 1 😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀

run s4 as-is (it will print with color, no prepended data).

$ ./target/release/s4 ./logs/other/tests/dtf9c-12-baddate.log | head -n1
Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃

run s4 with a prepended filename

$ ./target/release/s4 -p ./logs/other/tests/dtf9c-12-baddate.log | head -n1
./logs/other/tests/dtf9c-12-baddate.log:Jan 1 01:00:00 0 😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻�😽😾😿🙀🙁🙂🙃�

ERROR a two errant is in the output.

This errant output will occur for any mode of prepended data, but at different rates:

  • with -l errant occurs 4 times
  • with -p errant occurs 20 times
  • with -l -p errant occurs 4 times

Other observations


dpo!("print_color_line_highlight_dt! slice_a.len() {} slice_b_dt.len() {} slice_c.len() {}", slice_a.len(), slice_b_dt.len(), slice_c.len());

Running the debug version of the build, with and without -p showed the selected slices were the same., e.g.

without -p, debug output

     print_color_line_highlight_dt! slice_a.len() 0 slice_b_dt.len() 15 slice_c.len() 277
Dec 12 12:00:00 11 😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀😁😂😃😄😅😆😇😈😉😊

with -p, debug output

     print_color_line_highlight_dt! slice_a.len() 0 slice_b_dt.len() 15 slice_c.len() 277
./logs/other/tests/dtf9c-12-baddate.log:Dec 12 12:00:00 11 😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃😀😁�😃😄😅😆😇😈😉😊�

  • Comparing the first line of output when using -p, colordiff.pl shows no errant difference
$ ./target/release/s4 -p ./logs/other/tests/dtf9c-12-baddate.log | head -n1 > out-p1.txt

$ ./target/release/s4 ./logs/other/tests/dtf9c-12-baddate.log | head -n1 > out1.txt

$ perl ./colorbindiff.pl out1.txt out-p1.txt
OFFSET   00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F                   OFFSET   00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  .. ................  0x0000* +2e +2f +6c +6f +67 +73 +2f +6f +74 +68 +65 +72 +2f +74 +65 +73 ./logs/other/tes
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  .. ................  0x0010* +74 +73 +2f +64 +74 +66 +39 +63 +2d +31 +32 +2d +62 +61 +64 +64 ts/dtf9c-12-badd
0x0000*  ..  ..  ..  ..  ..  ..  ..  ..                                 ........          0x0020* +61 +74 +65 +2e +6c +6f +67 +3a                                 ate.log:
0x0000   1b  5b  30  6d  1b  5b  33  36  6d  1b  5b  30  6d  1b  5b  34 .[0m.[36m.[0m.[4  0x0028   1b  5b  30  6d  1b  5b  33  36  6d  1b  5b  30  6d  1b  5b  34 .[0m.[36m.[0m.[4
0x0010   6d  1b  5b  33  36  6d  4a  61  6e  20  31  20  30  31  3a  30 m.[36mJan 1 01:0  0x0038   6d  1b  5b  33  36  6d  4a  61  6e  20  31  20  30  31  3a  30 m.[36mJan 1 01:0
0x0020   30  3a  30  30  1b  5b  30  6d  1b  5b  33  36  6d  20  30  20 0:00.[0m.[36m 0   0x0048   30  3a  30  30  1b  5b  30  6d  1b  5b  33  36  6d  20  30  20 0:00.[0m.[36m 0
0x0030   f0  9f  98  80  f0  9f  98  81  f0  9f  98  82  f0  9f  98  83 ................  0x0058   f0  9f  98  80  f0  9f  98  81  f0  9f  98  82  f0  9f  98  83 ................
0x0040   f0  9f  98  84  f0  9f  98  85  f0  9f  98  86  f0  9f  98  87 ................  0x0068   f0  9f  98  84  f0  9f  98  85  f0  9f  98  86  f0  9f  98  87 ................
0x0050   f0  9f  98  88  f0  9f  98  89  f0  9f  98  8a  f0  9f  98  8b ................  0x0078   f0  9f  98  88  f0  9f  98  89  f0  9f  98  8a  f0  9f  98  8b ................
0x0060   f0  9f  98  8c  f0  9f  98  8d  f0  9f  98  8e  f0  9f  98  8f ................  0x0088   f0  9f  98  8c  f0  9f  98  8d  f0  9f  98  8e  f0  9f  98  8f ................
0x0070   f0  9f  98  90  f0  9f  98  91  f0  9f  98  92  f0  9f  98  93 ................  0x0098   f0  9f  98  90  f0  9f  98  91  f0  9f  98  92  f0  9f  98  93 ................
0x0080   f0  9f  98  94  f0  9f  98  95  f0  9f  98  96  f0  9f  98  97 ................  0x00A8   f0  9f  98  94  f0  9f  98  95  f0  9f  98  96  f0  9f  98  97 ................
0x0090   f0  9f  98  98  f0  9f  98  99  f0  9f  98  9a  f0  9f  98  9b ................  0x00B8   f0  9f  98  98  f0  9f  98  99  f0  9f  98  9a  f0  9f  98  9b ................
0x00A0   f0  9f  98  9c  f0  9f  98  9d  f0  9f  98  9e  f0  9f  98  9f ................  0x00C8   f0  9f  98  9c  f0  9f  98  9d  f0  9f  98  9e  f0  9f  98  9f ................
0x00B0   f0  9f  98  a0  f0  9f  98  a1  f0  9f  98  a2  f0  9f  98  a3 ................  0x00D8   f0  9f  98  a0  f0  9f  98  a1  f0  9f  98  a2  f0  9f  98  a3 ................
0x00C0   f0  9f  98  a4  f0  9f  98  a5  f0  9f  98  a6  f0  9f  98  a7 ................  0x00E8   f0  9f  98  a4  f0  9f  98  a5  f0  9f  98  a6  f0  9f  98  a7 ................
0x00D0   f0  9f  98  a8  f0  9f  98  a9  f0  9f  98  aa  f0  9f  98  ab ................  0x00F8   f0  9f  98  a8  f0  9f  98  a9  f0  9f  98  aa  f0  9f  98  ab ................
0x00E0   f0  9f  98  ac  f0  9f  98  ad  f0  9f  98  ae  f0  9f  98  af ................  0x0108   f0  9f  98  ac  f0  9f  98  ad  f0  9f  98  ae  f0  9f  98  af ................
0x00F0   f0  9f  98  b0  f0  9f  98  b1  f0  9f  98  b2  f0  9f  98  b3 ................  0x0118   f0  9f  98  b0  f0  9f  98  b1  f0  9f  98  b2  f0  9f  98  b3 ................
0x0100   f0  9f  98  b4  f0  9f  98  b5  f0  9f  98  b6  f0  9f  98  b7 ................  0x0128   f0  9f  98  b4  f0  9f  98  b5  f0  9f  98  b6  f0  9f  98  b7 ................
0x0110   f0  9f  98  b8  f0  9f  98  b9  f0  9f  98  ba  f0  9f  98  bb ................  0x0138   f0  9f  98  b8  f0  9f  98  b9  f0  9f  98  ba  f0  9f  98  bb ................
0x0120   f0  9f  98  bc  f0  9f  98  bd  f0  9f  98  be  f0  9f  98  bf ................  0x0148   f0  9f  98  bc  f0  9f  98  bd  f0  9f  98  be  f0  9f  98  bf ................
0x0130   f0  9f  99  80  f0  9f  99  81  f0  9f  99  82  f0  9f  99  83 ................  0x0158   f0  9f  99  80  f0  9f  99  81  f0  9f  99  82  f0  9f  99  83 ................
0x0140   0a                                                             .                 0x0168   0a

The difference is as-expected: the output with -p has extra ASCII bytes to print the prepended filename. However, there are no errant control codes or partial unicode characters.


Using Windows Terminal 1.15, Ubuntu 22.

printed errors while piping data will errantly colorize error messages

Errors printed are the same color as the previously printed sysline.

Reproduction

Given some syslog file with multiple syslines, run

$ s4 /var/log/syslog | head -n1

Prints

Aug  7 00:00:00 ubuntu22-s4b systemd[1]: logrotate.service: Deactivated successfully.
ERROR: failed to print Broken pipe (os error 32)

The first line is colorized, which is expected. However, the line ERROR: failed to print Broken pipe (os error 32) is also colorized. It should be printed with the default text settings.

This was run in bash shell within Windows Terminal.

More...

The reason is the shell escape codes to return to default text settings occur after the newline. This escape code to default text settings is written to the pipe | and then head drops that default text escape code. So the terminal does not see the default text escape code. Thus, proceeding lines retain the same colorization settings as the prior sysline.

This relates to a larger problem that user-facing errors are currently handled in a generic way eprintln!. However, user-facing errors should have awareness of TermColor settings, and handle their own colorization. This Issue may require significant refactoring, depending upon the completeness of the solution.

partial hour timezone offset not correctly correlated

Passing a "partial hour" timezone offset, e.g. +05:30, is not interpreted to UTC correctly.

$ s4 /var/log -u -a "$(date -d "2 days ago 12:00" '+%Y-%m-%dT%H:%M:%S') +05:30" -b "$(date -d "2 days ago 12:05" '+%Y-%m-%dT%H:%M:%S') +05:30"
20220915T063000.000000 +0000:Sep 14 22:30:00 ubuntu22-s4b sshd[1426912]: Received disconnect from 187.35.147.87 port 56531:11: Bye Bye [preauth]

Notice the first -a datetime should be a :30 minute mark. Yet it appears the time Sep 14 22:30:00 is associated with -u datetime 20220915T063000.000000 +0000, which has the same :30 minute offset. These times should be 30 minutes different.

lines ending in only '\r' will overwrite prepended data

Lines that end in \r and not \n or \r\n will overwrite prepended data (-l, -p, -u).

Reproduction

Given a file with lines ending in \r, prepended output will be overwritten.

Using file ./logs/other/tests/dtf5-3-LF-CR.log, run

$ ./target/release/s4 -p ./dtf5-3-r-n.log
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:00 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:second line, first sysline LF
2000-01-01 00:00:02 [dtf5-6a] LFlog:2000-01-01 00:00:01 [dtf5-6a] CR
./logs/other/tests/dtf5-3-LF-CR.log:sixth line, third sysline LF

Hoping for

$ ./target/release/s4 -p ./dtf5-3-r-n.log
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:00 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:second line, first sysline LF
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:01 [dtf5-6a] CR
./logs/other/tests/dtf5-3-LF-CR.log:fourth line, second sysline CR
./logs/other/tests/dtf5-3-LF-CR.log:2000-01-01 00:00:02 [dtf5-6a] LF
./logs/other/tests/dtf5-3-LF-CR.log:sixth line, third sysline LF

This is a contrived example. However, log files that record updating output will have many lines ending in \r, e.g. a percent indicator that is repeatedly overwritten. The s4 output will look strange.

preprinted datetime dropping fractional information

Problem

Preprinted datetimes (-l, -u) are not setting fractional. (it's always zero).

For example, from an Ubuntu host

$ ./target/release/s4 -l -n -w -a 20220813T120000 -b 20220813T200000 /var/log/
...
ubuntu-advantage-timer.log:20220813T180826.000000 +0000:2022-08-13 18:08:26,175 - timer.py:(46) [DEBUG]: Executed job: update_messaging
...

In the terminal, the datetime is correctly underlined, including the fractional, 2022-08-13 18:08:26,175 (notice 175).
But the preprinted datetime has fractional value 0, 20220813T180826.000000 +0000,

Solution

Set the fractional value in the preprinted datetime.

parse CLI -a -b dates as written expressions, e.g. "two days ago"

Currently, dates must be given exactly and in full, e.g. -a "2000-01-01T03:04:05" or -a "2000-01-01". But often, end-users will be thinking in terms of relative past dates, e.g. "12 hours ago". It can be a little bit of mental difficulty to figure out a relative verbal expression "two days ago" into an exact numeric date. And often, users might be Test Engineers elbows-deep into investigating a difficult bug, and sparing those brain cycles is a burden.

Add parsing for written expressions of relative dates, e.g. "two days ago at noon", "10 hours ago", "1 year ago", etc., that can be passed to CLI options -a and -b.

This need not handle of every written form and abbreviation of some expression with super-duper AI "2 d noonish", "two days in the past!", "seek the past two days!", etc. Only needs a few simple written patterns, similar to GNU date, "2 days ago", "2 days ago at 12:00", "2 hours ago", "12:00" (today at 12:00), etc.

support journal files

Problem

systemd .journal files are common and will become more common. Not being able to process a .journal file loses very important information (which degrades the utility of this project).

Solution

Support processing .journal files. They are remarkably different than processing ad-hoc syslog files, so the strategies within super-speedy-syslog-search will not apply. However, they are designed well and should be fast to process.

A "journal processor class" would be adjacent to a SyslogProcessor. In other words, all functionality of a SyslogProcessor (and it's contained SyslineReader, LineReader, BlockReader) would be created anew in some JournalProcessor "class".

Here is a technical description of .journal file format (archived).

handle text encodings besides ASCII and UTF-8

Problem

Only ASCII encoded and UTF-8 encoded files can be processed. Log files with different encodings are processed but no text is found by s4.

For example, on a Windows 11 host, among .log files under C:\Windows, 28 of 128 files were not printed. A spot check of a few of those non-printed files showed they were UTF-16 encoded.

PS> Get-ChildItem -Filter '*.log' -File -Path "C:\Windows" -Recurse -ErrorAction SilentlyContinue `
   | Select-Object -ExpandProperty FullName `
   | s4.exe - --summary

Solution

Handle other text encodings.

Handling UTF-16 and UTF-32 would be satisfactory.

support parsing more written languages

Problem

Currently, only English language logging phrases are matched for regular expressions, e.g

    // from file `logs/Windows10Pro/debug/mrt.log`
    // example with offset:
    //
    //               1         2         3         4          5         6         7         8         9
    //     01234567890123456789012345678901234567890012345678901234567890123456789012345678901234567890
    //     ---------------------------------------------------------------------------------------
    //     Microsoft Windows Malicious Software Removal Tool v5.83, (build 5.83.13532.1)
    //     Started On Thu Sep 10 10:08:35 2020
    //     ...
    //     Results Summary:
    //     ----------------
    //     No infection found.
    //     Successfully Submitted Heartbeat Report
    //     Microsoft Windows Malicious Software Removal Tool Finished On Tue Nov 10 18:54:47 2020
    //
    DTPD!(
        concatcp!("(Started On|Started on|started on|STARTED|Started|started|Finished On|Finished on|finished on|FINISHED|Finished|finished)[:]?", RP_BLANK, CGP_DAYa, RP_BLANK, CGP_MONTHb, RP_BLANK, CGP_DAYde, RP_BLANK, CGP_HOUR, D_T, CGP_MINUTE, D_T, CGP_SECOND, RP_BLANK, CGP_YEAR, RP_NOALNUM),
        DTFSS_YbdHMS, 0, 140, CGN_DAYa, CGN_YEAR,

https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.6.62/src/data/datetime.rs#L2715-L2740

In a non-English version of Windows, the phrases will be different; e.g. the phrase Started on in Windows set to locale es_MX might be comenzo en (just a guess, I haven't checked). I know for a fact that Windows does a fair amount of work to use localized strings for all user-facing messages (since I worked on that project at Microsoft 😁). This might not affect all the obscure text logs under C:\Windows but I bet it will affect some.

The only way to know the phrasing used in the non-English version of mrt.log is to prop up an instance of Windows using a different locale/language. Then do whatever is necessary to generate the file (save it as part of ./logs), and then update the code to match that non-English phrase. It's a bit of work for just one locale instance. There are many possible locales for Windows to use.

I would skip an approach that merely switched the language of the running Windows system. IME on every OS, only some applications correctly switch locale when the user or system changes locale. Most applications permanently stick to the locale that was present when the application first ran.

The same work would be necessary for Linux and Linux-based things.

allow setting prepended datetime format

Currently

CLI option -u, -l are hardcoded to strftime format "%Y%m%dT%H%M%S%.6f %z:".

Feature Request

All the user to pass a string that is a strftime format. A format with a timezone is required.

When passed, do a sanity check that a DateTime can be created from it

support for datetime format for Synology custom log synocrond-execute.log

Problem

Synology DiskStation OS has an unsupported syslog datetime format in file synocrond-execute.log

12-06 00:59:44 running job: builtin-synosharing-default with command: /usr/syno/bin/synosharingcron as user root
12-06 05:33:44 running job: pkg-DownloadStation-DownloadStationUpdateJob with command: /var/packages/DownloadStation/target/bin/synodlupdate --update as user DownloadStation
12-06 09:00:44 running job: builtin-synosharesnapshot-default with command: /usr/syno/sbin/synosharesnapshot misc subvol-clean as user root

Solution

Either:

  • Support the ad-hoc message datetime format. Add test cases in datetime.rs and comment where they originate.
  • Do not support this poor ad-hoc format, mark as wontfix. The MM-DD format is ambiguous, as it could also mean DD-MM. While it could be determined via checking values, i.e. is the first two chars in 1-12?, that get's troublesome to do, and may not be possible to determine (e.g. syslog file has one message like 01-02 00:00:00 hello; what then?).

refactor chained paths

Problem

Currently, a file within a .tar file is represented as file.tar:log.txt using the : separator. Only one "depth" is supported, e.g. cannot have file.tar:log.tar:syslog.

Solution

  • refactor path passing to use something that understands "chained depth". It should not rely on arbitrary separator character like :.
  • use : for user-facing printed separator of such paths

read `.xz` file by requested block

Problem

An .xz file is entirely read during BlockReader::new.
This may cause problems for very large compressed files (the s4 program will hold the entire uncompressed file in memory; it would use too much memory).

The crate lzma-rs does not provide API xz_decompress_with_options which would allow limiting the bytes returned per call. It only provides xz_decompress which decompresses the entire file in one call. See gendx/lzma-rs#110

Solution

Read an .xz file per block request, as done for normal files.


Update: see Issue #283

Meta-Issue #182

CLI passed %Z fails to parse

Problem

The following command-line fails:

$ ./target/debug/s4 /var/log/syslog -a '2022-08-08 12:00:00 PST'
...
     process_dt: datetime_parse_from_str("2022-08-08 12:00:00 PST", "%Y-%m-%d %H:%M:%S %Z", true, +00:00)
        →datetime_parse_from_str: datetime_parse_from_str(pattern "%Y-%m-%d %H:%M:%S %Z", tz_offset +00:00, data "2022-08-08 12:00:00 PST")
        ←datetime_parse_from_str: DateTime::parse_from_str("2022-08-08 12:00:00 PST", "%Y-%m-%d %H:%M:%S %Z") failed ParseError: input is not enough for unique date and time
...
ERROR: Unable to parse a datetime from "2022-08-08 12:00:00 PST"

The --help message suggests possible datetimes:

$ ./target/release/s4 --help
super_speedy_syslog_searcher 0.0.25
....
DateTime Filter patterns may be:
    "%Y%m%dT%H%M%S"
    "%Y%m%dT%H%M%S%z"
    "%Y%m%dT%H%M%S%Z"
    "%Y-%m-%d %H:%M:%S"
    "%Y-%m-%d %H:%M:%S %z"
    "%Y-%m-%d %H:%M:%S %Z"
    "%Y-%m-%dT%H:%M:%S"
    "%Y-%m-%dT%H:%M:%S %z"
    "%Y-%m-%dT%H:%M:%S %Z"
    "%Y/%m/%d %H:%M:%S"
    "%Y/%m/%d %H:%M:%S %z"
    "%Y/%m/%d %H:%M:%S %Z"
    "%Y%m%d"
    "%Y%m%d %z"
    "%Y%m%d %Z"
    "+%s"
...

Solution

Support %Z parsing for command-line passed -a or -b.

process return code is confusing

The process return code for success 0 or failure 1 is confusing.


A few successful return codes:

$ ./target/release/s4 /var/log/wtmp
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

$ ./target/release/s4 /var/log/ 1> /dev/null
WARNING: not a parseable type "/var/log/journal/c3a57680c1d26ca313b9c7ec36a5beaa/system.journal"
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

Failure return code for some unknown files.

$ ./target/release/s4 /var/log/journal/c3a57680c1d26ca313b9c7ec36a5beaa/system.journal /var/log/wtmp /var/log/lastlog
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
1

Yet this succeeds:

$ ./target/release/s4 /var/log/wtmp /var/log/lastlog
WARNING: no syslines found "/var/log/lastlog"
WARNING: no syslines found "/var/log/wtmp"

$ echo $?
0

No permissions causes error code (notice file noaccess)

$ ls -a /tmp/a
total 500
drwxr-xr-x  2 root root   4096 Aug  8 21:26 .
drwxrwxrwt 18 root root 479232 Aug  8 21:26 ..
-rw-r--r--  1 root root     32 Aug  6 03:50 dtf5-1.log
-rw-r--r--  1 root root      0 Aug  6 06:30 emptry
----------  1 root root     32 Aug  6 06:09 noaccess
-rw-r--r--  1 root root   4345 Aug  8 21:26 syslog.gz
-rw-r--r--  1 root root    200 Aug  6 05:50 wtmp200
-rw-r--r--  1 root root   1319 Aug  8 21:26 wtmp.gz

$ ./target/release/s4 /tmp/a 1>/dev/null
WARNING: no syslines found "/tmp/a/wtmp.gz"
WARNING: no syslines found "/tmp/a/wtmp200"

$ echo $?
0

$ sudo -u nobody -- ./target/release/s4 /tmp/a 1>/dev/null
WARNING: no syslines found "/tmp/a/wtmp200"
WARNING: no syslines found "/tmp/a/wtmp.gz"

$ echo $?
1

Dead symlink does not cause an error

$ ln -vs /tmp/DOES_NOT_EXIST /tmp/DEAD_LINK
'/tmp/DEAD_LINK' -> '/tmp/DOES_NOT_EXIST'

$ ./target/release/s4 /tmp/DEAD_LINK

$ echo $?
0

Parsing a symlink to an unparsable file does cause an error

$ ls -l /lib/libdmmp.so
lrwxrwxrwx 1 root root 16 Feb 21 21:18 /lib/libdmmp.so -> libdmmp.so.0.2.0

$ ./target/release/s4 /lib/libdmmp.so

$ echo $?
1


Overall, when and why an return code is success or failure requires further thought. Additionally, some unparsed files print an error while others do not (somewhat relates to Issue #3).

What does similar program GNU grep do?

support for datetime format for nvidia-installer

Problem

NVidia program nvidia-installer has unsupported datetime format parsing

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri May 31 13:49:08 2019
installer version: 340.107

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/bin

This program runs in Ubuntu and so should be supported.

Solution

Add support for the datetime format. Add built-in tests to datetime.rs and comment which test cases are that format.

support ISO 8601 format; Unicode "minus sign"

According to Wikipedia

To represent a negative offset, ISO 8601 specifies using a minus sign, (). If the interchange character set is limited and does not have a minus sign character, then the hyphen-minus should be used, (-). ASCII does not have a minus sign, so its hyphen-minus character (code is 45 decimal or 2D hexadecimal) would be used. If the character set has a minus sign, then that character should be used. Unicode has a minus sign, and its character code is U+2212 (2212 hexadecimal); the HTML character entity invocation is −.

Problem

chrono parse_from_str fails to parse a Unicode "minus sign" character (U+2212).

Solution

To more fully support the ISO 8601 standard, s4lib should parse Unicode "minus sign" character (U+2212) in numeric timezone offsets, e.g. −07:00.


Follows from chronotope/chrono#835

prepend -w includes unprinted files

Problem

Passing -p -w will create a prepended file width length for all files that might be printed.

Given files

  • syslog with two syslines
  • fileWithLongNameNoSyslines with no syslines

The prepended filename syslog will have many extra spaces:

$ s4 -p -w syslog fileWithLongNameNoSyslines
syslog                    : 20220101T000001 sysline 1
syslog                    : 20220101T000002 sysline 2

Solution

Only get prepended file width length for syslogs that will be printed.

dmesg file errantly parsed

File dmesg (and dmesg.1, dmesg.2.gz, etc.) is errantly parsed because of a datetime within the file.

Problem

dmesg contents

...
[    0.868675] kernel: rtc_cmos 00:03: setting system clock to 2022-07-31T20:39:27 UTC (1659299967)
[    0.869775] kernel: rtc_cmos 00:03: alarms up to one day, y3k, 242 bytes nvram, hpet irqs
[    0.870734] kernel: i2c_dev: i2c /dev entries driver
...

The datetime 2022-07-31T20:39:27 UTC is successfully parsed. The remainder of lines in the file are (wrongly) presumed to be part of one very large sysline. They are individual syslines.

$ ./target/release/s4 -l /var/log/dmesg
20220804T052008.000000 +0000:[    1.203365] kernel: rtc_cmos 00:03: setting system clock to 2022-08-04T05:20:08 UTC (1659590408)
20220804T052008.000000 +0000:[    1.204496] kernel: rtc_cmos 00:03: alarms up to one day, y3k, 242 bytes nvram, hpet irqs
20220804T052008.000000 +0000:[    1.205674] kernel: i2c_dev: i2c /dev entries driver
...

Solution

File dmesg format needs special handling. It is a fair amount of refactoring to support it.
or
Have special handling to reject parsing of dmesg files.

read tar file by requested block

Problem

An .tar file is entirely read during BlockReader::read_block_FileTar.
This may cause problems for very large compressed files (the s4 program will have the entire unarchived file in memory; it will use too much memory).

This is due to design of the tar crate. The crate does not provide a method to store tar::Archive<File> instance and tar::Entry<'a, R: 'a + Read> instance due to inter-instance references and explicit lifetimes. (or is prohibitively complex; I made many attempts using various strategies involving references, lifetimes, pointers, etc.)

A tar::Entry holds a reference to data within the tar::Archive<File>. I found it impossible to store both related instances during new() or read_block_FileTar() and then later, during another call to read_block_FileTar(), utilize the same tar::Entry.
A new tar::Entry could be created per call to read_block_FileTar(). But then to read the requested BlockOffset, the entire .tar file entry would have to re-read. This means reading an entire file entry within a .tar file would be an O(n^2) algorithm.

Solution

Read an .tar file per block request, as done for normal files.


Meta Issue #182
Similar problem as Issue #12.

chained block reads

Problem

Currently, only one "depth" of compressed or archived file is supported.
e.g. can read syslog stored in logs.tar. Cannot read syslog.gz in logs.tar, nor logs.tar stored in logs.tar.xz.
e.g. can read syslog stored in syslog.gz. Cannot read syslog.gz stored in syslog.gz.xz. Cannot read a the special gzip+tar file logs.tgz.

Related, only plain text files are extractable from compressed files or archived files. EVTX, Journal files, and utmp files stored as a compressed or archived file are not readable. See FileType

Solution

Refactor BlockReader reading to handle arbitrary "chains" of reads for text files and UTMPX files.

Currently, JournalReader reads Journal files using libsystemd calls for reading. BlockReader is not used by the JournalReader. Processing Journal files that are compressed or archived are outside the scope of this issue.

Currently, EvtxReader reads EVTX files using EvtxParser. BlockReader is not used by the EvtxReader. Processing EVTX files that are compressed or archived are outside the scope of this issue.

Relates to Issue #7.

user-passed `--prepend-dt-format` with variable width format string will be vertically unaligned

Problem

Recently added features --prepend-dt-format (Issue #28) allows users to pass a variable width strftime format string, e.g. "%A".
The following command will print vertically unaligned output:

$ cat /tmp/a.log
2000-01-01T00:00:00 [dtf2-2]

2000-01-02T00:00:01 [dtf2-2]a

$ s4 -d '%A:' -u /tmp/a.log
Saturday:2000-01-01T00:00:00 [dtf2-2]
Saturday:
Sunday:2000-01-02T00:00:01 [dtf2-2]a

Previously, the prepended datetime format was the consistent width strftime format, %Y%m%dT%H%M%S%.3f%z:.

This vertical un-alignment can be visually difficult to grok for end-users.

Solution

Allow the user to pass a variable width --prepend-dt-format format strings, e.g. "%A", but maintain vertical alignment.

This could be mandatory behavior, or, another CLI option, like --prepend-dt-align (similar to current option --prepend-file-align).

support parsing datetime without seconds

Problem

Arch package manager pacman uses the following syslog message format:

[2019-03-01 16:56] [PACMAN] synchronizing package lists

(there are no seconds)

Solution

Support parsing datetime without seconds. The seconds value would be 0.

Add built-in datetime.rs test cases the comment on the format.

refactor regex parsing of datetime, build strftime pattern during regex capture processing

tl;dr removing const variables CGP_* representing strftime patterns, and instead building the strftime pattern dynamically based on regex capture results, would simplify processing and simplify declarations of DTPD!

Currently

Parsing a datetime from a string consists of several tedious

  1. call fn bytes_to_regex_to_datetime https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2910
  2. does a regex capture to named capture groups
  3. passes named capture groups to fn captures_to_buffer_bytes https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2690
  4. fn captures_to_buffer_bytes rearranges captured data into a buffer
  5. the buffer and a strftime pattern is passed to fn datetime_parse_from_str https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2992-L2999

The strfrtime pattern is an unnecessary duplicate information. They are defined at compile time within the DATETIME_PARSE_DATAS
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L1535
The DTP_* variables complicate the transformation of bytes to regex captures to a datetime string.
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L446-L494

The complicating part is the regular expression data must be transformed by fn captures_to_buffer_bytes to fit the predefined pattern (declared as const CGP_*).

Improvement Approach 1

Instead, the strftime pattern could be created dynamically alongside the extracted data within captures_to_buffer_bytes.
https://github.com/jtmoon79/super-speedy-syslog-searcher/blob/0.0.32/src/data/datetime.rs#L2689-L2696
The predefined global patterns (DTP_* variables) could be removed.
And the listings of DTPD! would be significantly simpler. And, the captures_to_buffer_bytes would not have to transform variables to match a predefined DTP_* pattern.

Example

For example, currently fn captures_to_buffer_bytes must look at the length of the named capture group fractional. From there it transforms the captured data to fit strftime specifier %f. This is tedious.

Instead fn captures_to_buffer_bytes can look at the captured fractional and determine if the associated strftime specifier should be %3f, %6f, or %9f. Then the function need not modify the data, it only needs to set the correct strftime specifier, say %3f for captured data "123". The fn captures_to_buffer_bytes can dynamically create the strftime pattern to pass to fn datetime_parse_from_str.

There would be no need for the tedious predeclared CGP_* variables within DTPD! declarations.

Improvement Approach 2

Instead, the strftime pattern could be determined from the extracted capture group names found in the capture groups.

Currently, there are capture group names "<month>", "<day>", etc.

These capture group names could be modified to represent the format of data they have, e.g. "<month_b>" would signify data for strftime month specifier %b, "<month_m>" would signify data for strftime month specifier %m, etc.

In captures_to_buffer_bytes, a sequence of name queries would be made to the captures variable, e.g. if captures.name("month_b") { copy to buffer the "Jan"-like data; month = DTFS_Month::b; } else if captures.name("month_m") { copy to buffer the "01"-like data; month = DTFS_Month::m; } ....

Then in captures_to_buffer_bytes, a long match statement would choose the correct strftime specifier string. e.g.

   let strftime_format: &str = match(year, month, day, ...) {
      (DTFS_Year::Y, DTFS_Month::m, DTFS_Day::d, ...) => {
           "%Y%m%d..."
      }
     ...
   }

Later, the filled buffer and variable strftime_format would be passed to local fn datetime_parse_from_str.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.