Giter VIP home page Giter VIP logo

unrtf's Introduction

unrtf

Project Status: Active – The project has reached a stable, usable state and is being actively developed. CRAN_Status_Badge CRAN RStudio mirror downloads

Extract Text from Rich Text Format (rtf) Documents

Wraps the unrtf utility to extract text from rtf files.

Installation

install.packages("unrtf")

Hello World

The function has only a single function unrtf(). It takes either a local file path or a URL to a word document:

library(unrtf)
text <- unrtf("https://jeroen.github.io/files/sample.rtf", format = "text")
html <- unrtf("https://jeroen.github.io/files/sample.rtf", format = "html")
cat(text)
###  Translation from RTF performed by UnRTF, version 0.21.9 
### font table contains 11 fonts total

TITLE: It is an example test rtf-file to RTF2XML bean for testing

AUTHOR: kissj
### creation date: 17 April 2000 15:34 
### revision date: 19 April 2000 09:34 
### total pages: 2
### total words: 217
### total chars: 1240

-----------------
It is an example test rtf-file to RTF2XML bean for testing

Font size 10, plain text;
Font size 12, bold text. Underline,bold text.
 Underline,italic,bold text. 
Font size 22, plain text.
 Bold text.

unrtf's People

Contributors

jeroen avatar maelle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

unrtf's Issues

memory leak in malloc.c

hi, recently i am studying fuzz. after some experiences,afl has found some crashes which asan shows memory leak.
the output is below
➜ unrtf2 ../unrtf-0.21.9/unrtf ./output/crashes/id:000005,sig:11,src:000552,op:havoc,rep:32

e Jans ;Opa e9ansi ;1e _paOpa e9Opa e9ansi ;Je _paOpa e9ansi ;Je _pame Jbns ;umme Jans ;Opa e9ansi ;Je _paOpapa. s&cdaa. um e9ansi ;Je _paJans ;Smpca me Jans ;um ccaa$pa. s&ccaa. um ccasbccacpca a a. e9ansi ;Je _paOpa e9ansi ^Je _pame Jans ;umme Jans ;Opa e9ansi ;Je _paOpapa. s&cdaa. um e9ansi ;Je _paJans ;Smme Jans ;um ccaa$pa. s&ccaa. um cca

=================================================================
==121801==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 5722 byte(s) in 571 object(s) allocated from:
#0 0x7fbf9dfbf602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
#1 0x433bfc in rpl_malloc /home/greydog/fuzz/unrtf-0.21.9/src/malloc.c:166
#2 0x433bfc in my_malloc /home/greydog/fuzz/unrtf-0.21.9/src/malloc.c:73

SUMMARY: AddressSanitizer: 5722 byte(s) leaked in 571 allocation(s).

after that, i view the malloc.c . i found malloc without memset.it's really unsafe.

Text encoding unknown or inconsistent

Example 1:

> txt <- unrtf("http://kenbenoit.net/files/Hungarian.rtf")
trying URL 'http://kenbenoit.net/files/Hungarian.rtf'
Content type 'application/rtf' length 1003 bytes
==================================================
downloaded 1003 bytes

> cat(iconv(txt, from = "ISO-8859-2", "UTF-8"))
###  Translation from RTF performed by UnRTF, version 0.21.9 
### font table contains 0 fonts total
### invalid font number 0

-----------------
Nem tudjuk, mikor kezddhetnek meg Nagy-Britanniával a kilépési tárgyalások, csak azt, mikor kell befejezdniük - közölte Donald Tusk, az Európai Tanács elnöke. Jean-Claude Juncker, az Európai Bizottság elnöke bízik abban, hogy a brit parlamenti választások eredménye nem lesz hatással a Brexitrl szóló tárgyalásokra, így azok minél hamarabb megkezddnek Nagy-Britannia az Európai Unió között. A német külügyminiszter szerint a brit választás eredménye a Brexit elutasítását tükrözi. 

The first line should be:

Nem tudjuk, mikor kezdődhetnek 

(the long "ő" is obliterated in the conversion)

memory leak in malloc.c

hi, recently i am studying fuzz. after some experiences,afl has found some crashes which asan shows memory leak.
the output is below
➜ unrtf2 ../unrtf-0.21.9/unrtf ./output/crashes/id:000005,sig:11,src:000552,op:havoc,rep:32

e Jans ;Opa e9ansi ;1e _paOpa e9Opa e9ansi ;Je _paOpa e9ansi ;Je _pame Jbns ;umme Jans ;Opa e9ansi ;Je _paOpapa. s&cdaa. um e9ansi ;Je _paJans ;Smpca me Jans ;um ccaa$pa. s&ccaa. um ccasbccacpca a a. e9ansi ;Je _paOpa e9ansi ^Je _pame Jans ;umme Jans ;Opa e9ansi ;Je _paOpapa. s&cdaa. um e9ansi ;Je _paJans ;Smme Jans ;um ccaa$pa. s&ccaa. um cca

=================================================================
==121801==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 5722 byte(s) in 571 object(s) allocated from:
#0 0x7fbf9dfbf602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
#1 0x433bfc in rpl_malloc /home/greydog/fuzz/unrtf-0.21.9/src/malloc.c:166
#2 0x433bfc in my_malloc /home/greydog/fuzz/unrtf-0.21.9/src/malloc.c:73

SUMMARY: AddressSanitizer: 5722 byte(s) leaked in 571 allocation(s).

after that, i view the malloc.c . i found malloc without memset.it's really unsafe.

unrtf Version: 1.0 is not compiling/installing

Just now, I tried to install. I get the following errors.

Warning messages:
1: running command '"W:/R-3.4._/App/R-Portable/bin/x64/R" CMD INSTALL -l "W:\R-3.4._\R_LIBS_USER_3.4._" W:\R-3.4._\R_USER_3.4.__R_STUDIO\AppData\Local\Temp\Rtmp4q6eEV/downloaded_packages/unrtf_1.0.tar.gz' had status 1
2: In install.packages("unrtf") :
  installation of package 'unrtf' had non-zero exit status
>
> install.packages("unrtf")
Installing package into 'W:/R-3.4._/R_LIBS_USER_3.4._'
(as 'lib' is unspecified)
Package which is only available in source form, and may need
  compilation of C/C++/Fortran: 'unrtf'
Do you want to attempt to install these from sources?
y/n:

The end is ...

W:/Rtools34/mingw_64/bin/gcc  -O2 -Wall  -std=gnu99 -mtune=core2 -o unrtf64 libunrtf/attr.o libunrtf/convert.o libunrtf/error.o libunrtf/hash.o libunrtf/main.o libunrtf/malloc.o     libunrtf/my_iconv.o libunrtf/output.o libunrtf/parse.o libunrtf/path.o libunrtf/unicode.o libunrtf/user.o libunrtf/util.o libunrtf/word.o  -lws2_32 -L"W:/R-34~1._/App/R-PORT~1/bin/x64" -lR -lRiconv
mkdir -p ../inst/bin
cp -f unrtf64 ../inst/bin/
cp -Rf share ../inst/
cp: cannot create regular file '../inst/share/html.conf': Permission denied
cp: cannot create regular file '../inst/share/latex.conf': Permission denied
cp: cannot create regular file '../inst/share/SYMBOL.charmap': Permission denied
cp: cannot create regular file '../inst/share/troff_mm.conf': Permission denied
cp: cannot create regular file '../inst/share/vt.conf': Permission denied
cp: cannot create regular file '../inst/share/rtf.conf': Permission denied
cp: cannot create regular file '../inst/share/text.conf': Permission denied
make: *** [unrtf64] Error 1
Warning: running command 'make -f "Makevars" -f "W:/R-34~1._/App/R-PORT~1/etc/x64/Makeconf" -f "W:/R-34~1._/App/R-PORT~1/share/make/winshlib.mk" SHLIB="unrtf.dll" WIN=64 TCLBIN=64 OBJECTS="register.o"' had status 2
ERROR: compilation failed for package 'unrtf'
* removing 'W:/R-3.4._/R_LIBS_USER_3.4._/unrtf'
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0

Should suppress header info

I see no need to include the header info in the output, so either disable by default or make this an argument. Similar to

unrtf --text --quiet thefile.rtf

Ideally the output would not include any meta-information.

Note: These can be set in the .conf files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.