jahewson / node-byline Goto Github PK

View Code? Open in Web Editor NEW

323.0 323.0 53.0 394 KB

Line-by-line Stream reader for node.js

Home Page: https://github.com/jahewson/node-byline

License: Other

JavaScript 100.00%

node-byline's Introduction

Hi there 👋

🔭 I’m currently working at a Stealth Startup
⚛️ I like to use React
💬 Ask me about TypeScript

node-byline's People

Contributors

Stargazers

Watchers

node-byline's Issues

Issue when writing EOL in stream mode

I have an issue when writing a file in stream mode. The EOLseems not to be respected when writing a new line.

This is the sync version, that works as expected.

var i = fs.openSync(self._options.trainFile, 'r');
        var o = fs.openSync(tmpFilePath, 'w');

        var buf = new Buffer(1024 * 1024), len, prev = '';

        while(len = fs.readSync(i, buf, 0, buf.length)) {

            var a = (prev + buf.toString('utf-8', 0, len)).split('\n');
            prev = len === buf.length ? '\n' + a.splice(a.length - 1)[0] : '';
            var out = '';
            a.forEach(function(text) {
                if(!text) return;
                text=text.toLowerCase()
                .replace(/^/gm, '__label__')
                .replace(/'/g, " ' ")
                .replace(/"/g, '')
                .replace(/\./g, ' \. ')
                .replace(/,/g, ' \, ')
                .replace(/\(/g, ' ( ')
                .replace(/\)/g, ' ) ')
                .replace(/!/g, ' ! ')
                .replace(/\?/g, ' ! ')
                .replace(/;/g, ' ')
                .replace(/:/g, ' ')
                out += text + '\n';
            });
            var bout = new Buffer(out, 'utf-8');
            fs.writeSync(o, bout, 0, bout.length);
        }

        fs.closeSync(o);
        fs.closeSync(i);

while this is the stream mode with byline

var os= require("os");
        var Transform = require('stream').Transform
        var writeStream = fs.createWriteStream(tmpFilePath, {flags: 'w', encoding: 'utf-8'});
        var stream = byline(fs.createReadStream(self._options.trainFile, { flags: 'r', encoding: 'utf8'}));
        //stream.pipe(writeStream);
        stream.on('end', function() {
            return resolve({
                trainFile: tmpFilePath
            });
        });
        stream.on('data', function(text) { /// read line by line
            text=text.toLowerCase()
            .replace(/^/gm, '__label__')
            .replace(/'/g, " ' ")
            .replace(/"/g, '')
            .replace(/\./g, ' \. ')
            .replace(/,/g, ' \, ')
            .replace(/\(/g, ' ( ')
            .replace(/\)/g, ' ) ')
            .replace(/!/g, ' ! ')
            .replace(/\?/g, ' ! ')
            .replace(/;/g, ' ')
            .replace(/:/g, ' ');
            writeStream.write(text + os.EOL);
        });

LICENSE

... says node-deflate, but should say node-byline

\x85 is an important part of international characters

\x85 is, unfortunately, a hexadecimal escape sequence that refer to a code point shared by many international characters. It's all in the encoding but since I'm digging into some legacy code I could not avoid getting ISO-8859-1 strings from ending up being UTF-8-ised.

The example below illustrates that:

// UTF-8-ized Latin-1/ISO-8859-1 strings
var str0 = 'Nguyá»�n ThÃ¡i Ngá»�c Duy',
    str1 = 'Adam PiÄ�tyszek',
    str2 = 'ì¦�ë��',
    str3 = 'å��å½¦',
    str4 = 'å�¥é�¨æ��ç¨�',
    str5 = 'é��å��è¿�',
    str6 = 'QQé�³ä¹� å�¨æ°�Kæ� QQç©ºé�´ QQ',
    str7 = 'é��äº¨è²¡ç¶�æ�°è��';

// decode UTF-8-ized Latin-1/ISO-8859-1 to UTF-8
var decode = function(str) {
  var s;
  try {
    // if the string is UTF-8, this will work and not throw an error.
    s = decodeURIComponent(escape(str));
  } catch(e) {
    // if it isn't, an error will be thrown, and we can asume that we have an ISO string.
    s = str;
  }
  return s;
};

console.log('str0: ' + decode(str0)); // str0: Nguyễn Thái Ngọc Duy
console.log('str1: ' + decode(str1)); // str1: Adam Piątyszek
console.log('str2: ' + decode(str2)); // str2: 즈눅
console.log('str3: ' + decode(str3)); // str3: 元彦
console.log('str4: ' + decode(str4)); // str4: 入门教程
console.log('str5: ' + decode(str5)); // str5: 陈光远
console.log('str6: ' + decode(str6)); // str6: QQ音乐 全民K歌 QQ空间 QQ
console.log('str7: ' + decode(str7)); // str7: 鉅亨財經新聞

PS: It seems that the � (\x85) character is omitted while I'm entering text on Github's editor... so I don't know if the code above will run correctly.

This refers to the change introduced by this line. I'm sticking to v4.2.2 for now, great stuff! 👍

linebreak at wrong character

I got a file containing

"Hello a�" � nein, das sparen w

according to google the 2 chars (after Hello a and after " ) is "REPLACEMENT CHARACTER"

Why is byline doin a linebreak here? I would like to have linebreaks only on \r?\n.

line-reader has no problems with it.

question please (insert parsed data to a DB)

Hi,
I am new here, I did some java work, we are looking to have a script to read huge comma separated file then insert it in a database, one of my coworkers talked about using python, I found your node.js project called node-byline
please tell me if this code should work better than python ?
how to insert what was parsed to a database ?
thanks lot.

byline w/ Node 0.10.16 emits buffers instead of strings

I'm not sure if you consider this a bug or not, but it broke one of my programs so I thought I'd submit it:

test.js

'use strict';

var
    fs = require('fs'),
    byline = require('byline');

var stream = byline(fs.createReadStream('test.txt'));

stream.on('data', function(line) {
    console.log(Buffer.isBuffer(line));
});

test.txt

one
two
a three
four

> node test
true
true
true
true

I had previously been using Node 0.10.3 and after upgrading to 0.10.16 I found that a function calling trim (from string) on line in the data event handler suddenly blew up. Turns out that under Node 0.10.3 the event emitted lines and in 0.10.16 it emitted Buffers.

DeprecationWarning: The Buffer() and new Buffer() constructors are not recommended

This module will cause Node to print warnings in specific conditions.

Buffer.from should be preferred if it's available.

Frequent pause/resume on large file causes data loss

When processing a large file (e.g. 20000 rows), with frequent pause/resume calls, lines are dropped from the end of the file (e.g. about 4000).

I'm using pause/resume to control the # of lines being processed. Each line goes into a queue ... when the queue reaches a certain size the input is paused ... when it drops to 0 the input is resumed. This works well when the # of pause/resume calls is very few. As they get more frequent (i.e. I control the size of the queue) more data loss occurs.

A code sample (ugly) without the input file ...

var Asynchronous = require('async');
var ByLine = require('byline');
var FileSystem = require('fs');
var Readline = require('readline');
var Utilities = require('util');

var countOfTasks = 0;
var countOfProcessed = 0;

var output = Readline.createInterface({
  input: process.stdin,
  output: process.stdout
});

output.setPrompt('');
output.write(Utilities.format('tasks:%d processed:%d ', countOfTasks, countOfProcessed));

var interval = setInterval(function() {

  output.write(null, {
    ctrl: true,
    name: 'u'
  });

  output.write(Utilities.format('tasks:%d processed:%d ', countOfTasks, countOfProcessed));

}, 250)

Asynchronous.waterfall([
  function(callback) {

    var queue = Asynchronous.queue(function(task, callback) {

      FileSystem.appendFileSync('debug-processing-preTimeout', Utilities.format('%s\n', task.line));

      setTimeout(function() {
        FileSystem.appendFileSync('debug-processing-inTimeout', Utilities.format('%s\n', task.line));
        callback();
      }, 0);

    }, 1);

    // queue.drain = function() {
    //   callback();
    // };

    var lines = ByLine.createStream(FileSystem.createReadStream('lines'));

    lines
      .on('data', function(line) {

        countOfTasks ++;

        var task = {
          line: line
        };

        FileSystem.appendFileSync('debug-pushing', Utilities.format('%s\n', line));

        queue.push(task, function(error) {

          FileSystem.appendFileSync('debug-processed', Utilities.format('%s\n', line));

          countOfProcessed ++;
          countOfTasks --;

          if (countOfTasks == 0)
            lines.resume();

        });

        if (countOfTasks >= 5)
          lines.pause();

      })
      .on('error', callback);

  }
], function(error) {

  clearInterval(interval);

  if (error) {

    output.write(null, {
      ctrl: true,
      name: 'u'
    });

    output.write(Utilities.format('Error ... %s\n\n', error.message));

  }

  output.write(null, {
    ctrl: true,
    name: 'u'
  });

  output.write(Utilities.format('Done ... processed:%d\n\n', countOfProcessed));
  output.close();

});

Notice that the size of the input file (lines) is different from the size of an output file (e.g. debug-pushed). It seems as if lines are missing from the end.

.idea directory in npm package

Published source code contains .idea directory.

pause causes subsequent stream chunked together

Hello, I encounter this behavior while using byline, using node v0.10. Not really sure whether it's one feature or a bug. The script will read one file and print the content line by line. Without stream.pause, the lines are print one by one. However, if I use pause, the lines are chunked together.

fs = require 'fs'
byline = require 'byline'

readfile = (filename) ->
  stream = byline fs.createReadStream(filename)

  stream.on 'data', (line) ->
    console.log line.toString()
    stream.pause()
    setTimeout((-> stream.resume()), 1000)

readfile 'line.coffee'

It prints:

fs = require 'fs'
byline = require 'byline'readfile = (filename) ->  stream = byline fs.createReadStream(filename)  stream.on 'data', (line) ->    console.log line.toString()    stream.pause()    setTimeout((-> stream.resume()), 1000)readfile 'line.coffee'

The first line is print properly, but the rest are all chunked together.

pause() in data callback

If I write

stream.on('data',function(line){
  stream.pause();
}

then the pause doesn't really work as expected -- the callback might still be called again.

This is because byline only pauses the underlying stream, but sends data events all at once, so if the stream receives a single block that contains multiple lines, byline will still fire events even though it is supposedly paused.

(I'm not 100% sure what the expected behavior is for the Stream API in this case; i.e. whether data events are synchronous or asynchronous wrt pause.)

Breaks with node 0.10

Node 0.10 has a new Readable API and while trying the example from byline's README

var fs = require('fs'), 
    byline = require('byline');

var stream = fs.createReadStream('sample.txt');
stream = byline.createLineStream(stream);
stream.pipe(fs.createWriteStream('nolines.txt'));

one gets

_stream_readable.js:720
    throw new Error('Cannot switch to old mode now.');
          ^
Error: Cannot switch to old mode now.
    at emitDataEvents (_stream_readable.js:720:11)
    at ReadStream.Readable.pause (_stream_readable.js:711:3)
    at LineStream.pause (/srv/home/stephane/node_modules/byline/lib/byline.js:95:12)
    at LineStream.ondata (stream.js:52:16)
    at LineStream.EventEmitter.emit (events.js:95:17)
    at LineStream.write (/srv/home/stephane/node_modules/byline/lib/byline.js:74:12)
    at write (_stream_readable.js:573:24)
    at flow (_stream_readable.js:582:7)
    at ReadStream.pipeOnReadable (_stream_readable.js:614:5)
    at ReadStream.EventEmitter.emit (events.js:92:17)

(Tested with node v0.10.11 and byline 2.0.3.)

A workaround is to do

var stream = fs.createReadStream('sample.txt');
stream.resume();

to force the "old mode" compatibility to kick in early enough.

WARNING: byline#createLineStream is deprecated and will be removed soon

I'm fine with not using it but neither the docs or the message give me a clue on what to use instead?

Variable "encoding" is not defined

In LineStream.prototype._flush, (encoding) should be (this.encoding)

// see Readable::push
LineStream.prototype._reencode = function(line, chunkEncoding) {
  if (this.encoding && this.encoding != chunkEncoding) {
    return new Buffer(line, chunkEncoding).toString(encoding);
  }

// see Readable::push
LineStream.prototype._reencode = function(line, chunkEncoding) {
  if (this.encoding && this.encoding != chunkEncoding) {
    return new Buffer(line, chunkEncoding).toString(this.encoding);
  }

Calling toString() on potentially incomplete buffers

It seems byline is just calling toString() on buffers without correctly respecting unicode encoding rules. Since the buffers going into transform can be on any arbitrary position inside the original byte stream it is possible for it to be in the middle of a character.

This would potentially influence all characters encoded with more than 1 bytes.

if (encoding == 'buffer') {
  chunk = chunk.toString(); // utf8
  encoding = 'utf8';
}

Is this handled somewhere else ?

Stop output after one line.

Hello, I have a text file with a few lines. I am trying to read them in one by one via node.js. The idea is to randomly read lines from the list when a command is executed (discord bot), and then send that singular, randomly selected, line.

I've gotten the code working, but as of now, byline is reading the entire text file. Is there any way to make it read only one line at a time, selected by line number?

Thanks!

Browser complains about setImmediate() not being defined

The setImmediate function is defined within the Node.js timers module.
https://github.com/jahewson/node-byline/blob/master/lib/byline.js#L126

This can fix browser compatility without using global variables.

var timers = require('timers');

timers.setImmediate(function() {
});

nextTick is causing errors

I am getting problems with "recursive nextTicks". would sggest looking into setImmediate as a replacment.

[Feature request] Ability to skip first line

Hey @jahewson,
First of all thanks for all the effort you put in this amazing module.

When processing CSV or TSV files it would be very handy to have an option to skip the first line.

E.g.

import { createReadStream } from 'fs';
import byline from 'byline';

const sourceFile = 'somefile.csv';

byline.createStream(
  createReadStream(sourceFile, {
    skipFirstLine: true
  })
);

This functionality might help to keep the user code cleaner and shorter, but I understand it's a quite specific extra feature you might not want to include and leave to the users.

If it makes sense to you to include it I can try to submit a PR as soon as I have some free time.

this is almost a `through stream`

what about making this into a readable writable stream aka, a through stream?

basically, supporting this interface:

var a = fs.createReadStream(file);
var b = byline.createLineStream();

a.pipe(b)

var c = anotherStreamProcessor()

b.pipe(c)

//and so on...

this would be really handy, because then it would be another module which followed the proper stream interface. and would be compatible and consistent with all the other modules.

Read previous/next line?

I was wondering if it was at all possible to read the very last line of a text file. And then, read the one before that. I can see all the data in the console, but I have no idea how to just display one line.

Doesn't work with codepoint 133

My file has a space which is actually codepoint 133 (paragraph I guess).
This module breaks with that character.

Alternate encodings?

I'm wondering whether LineStream.write should support the encoding parameter?

LineStream.prototype.write = function(data,encoding) {
    if (Buffer.isBuffer(data)) {
      data = data.toString(encoding);
    }

(If no encoding is present Buffer.toString() defaults to using "utf8".)

Better not to `console.log` while running

I'm writing a CLI program using byline and notice something like "LineStream encoding buffer" popped out in the console and it's caused by the following line in lib/byline.js:

console.log('LineStream encoding %s', encoding);

I believe, as a library, it's better not to console.log while running, isn't it ? 😉

How does this compare to node.js readline?

https://nodejs.org/docs/latest-v14.x/api/readline.html#readline_example_read_file_stream_line_by_line

node.js docs say

Currently, for await...of loop can be a bit slower. If async / await flow and speed are both essential, a mixed approach can be applied:

and it seems to be an exact replacement for this package, byline. Are there use cases where one may work while the other doesn't. Any benchmarking?

"race condition" for \r\n split on buffer length

Pretty academic, but it seems to me that if you read a \r\n-encoded block which ends with \r (so the next block starts with \n), you will insert an extra empty line.
I suppose special-casing it by checking the last character of the buffer and the first one of the next one is the easiest fix but a little ugly.
Nicer would be to determine the line ending style from the first line ending and then split on only that, in which case there would not be a conflation of \r and \r\n endings. It might be faster, too.

Duplex Stream

When passing a stream.Duplex, it would be really conveinient to also have a by line interface for writing.

server.on("connect", function(socket) {
  var lines = byline(socket);
  lines.write("HELLO"); // -> "HELLO\n"
  lines.on("data", handleLine.bind(this, socket)); // Handle every line.
});

supporting files with linefeeds only

We have some Excel-exported CSV files that we import and process. The exported files are from the latest Mac version of Excel (Excel for Mac 2011, version 14.2.5) which still unfortunately uses the old Mac EOL character for its default CSV export.

0Dh 0Ah for DOS
0Dh for older Macs
0Ah for Unix/Linux

So we updated your regex to include that and tests with all three formats are working well. Thanks for the software, John!

var parts = data.split(/\r\n|\n|\r/g)

`pipe` assumes `_readableState` is present

The following fails:

var request = require('request');
var byline = require('byline');

var stream = byline(request.get('http://www.google.com'));
stream.on('end',function(){console.log('end')});
stream.on('data',function(line){console.log(line+'\n')});

(using byline 4.1.1, request 2.27.0), with the following error:

.../check-byline/node_modules/byline/lib/byline.js:75
      this.encoding = src._readableState.encoding;
                                        ^
TypeError: Cannot read property 'encoding' of undefined
    at LineStream.<anonymous> (.../check-byline/node_modules/byline/lib/byline.js:75:41)
    at LineStream.EventEmitter.emit (events.js:95:17)
    at Request.Stream.pipe (stream.js:123:8)

Running with previous versions, byline 3.1.2 is the last one to work with this code.

Does this package work?

I get no output, no errors, no way to debug.

Nothing.

README issues!

Hi there!

There are two main README issues:

byline.createStream is undefined, should be byline.createLineStream
byline(stream...) does not work anymore, it's "not a function", as the error says

LICENSE?

add to package.json as per https://npmjs.org/doc/json.html#license

support for npm test

why not having mocha in scripts/test in package.json ?
and mocha in devDependencies ?

Trailing EOL treated as an additional empty line

byline with keepEmptyLines: true behaves slightly differently than most line readers I've encountered, including node's readline module. I created sample.txt by running:

echo foo> sample.txt
echo bar>> sample.txt
echo>> sample.txt
echo and baz>> sample.txt

byline will return 5 chunks for this file (or 3 with keepEmptyLines: false). The readline sample returns four lines:

Line from file: foo
Line from file: bar
Line from file:
Line from file: and baz

As does this C# program:

using System;
using System.IO;

class Program
{
    static void Main(string[] args)
    {
        using (var stream = new FileStream("sample.txt", FileMode.Open))
        using (var reader = new StreamReader(stream))
        {
            string line;
            while (null != (line = reader.ReadLine()))
            {
                Console.WriteLine($"Line from file: {line}");
            }
        }
    }
}

(perl -ne 'print "Line from file: $_";' sample.txt also prints 4 lines, but with newlines still attached.)

If byline's current behavior is considered by design, it would be nice if there was a new option to ignore the trailing EOL in a stream. This would make byline more of a drop-in replacement for readline.

Undoing byline(streamX)

It's not clear how to dispose of streams created by the byline wrapper invocation, like the following to wrap bytestreams coming out of an invoked shell...

var outStream = byline(shell.stdout);
var errStream = byline(shell.stderr);

Currently there is a point in my program where my on 'data' handler is unsubscribed and I was expecting to be able to dispose of the line-oriented wrapper at that point, but can't find a destroy() method or anything equivalent.

jahewson / node-byline Goto Github PK

node-byline's Introduction

Hi there 👋

node-byline's People

Contributors

Stargazers

Watchers

Forkers

node-byline's Issues

Recommend Projects

Recommend Topics

Recommend Org