orange-opensource / documentare-simdoc Goto Github PK

New Developments are now done on Gitlab.com: https://gitlab.com/Orange-OpenSource/documentare/documentare-simdoc . Library and tools for similarity measurement, classification and clustering of digital content and segmentation images from digitized document

License: GNU General Public License v2.0

Shell 2.72% Java 80.24% Batchfile 0.39% C 8.71% QMake 0.06% C++ 5.23% CMake 2.46% Makefile 0.19%

documentare-simdoc's Introduction

Documentare, SimDoc library & tools

Library and tools for similarity measurement, classification and clustering of digital content and segmentation images from digitized document.

Build

Prerequisites:

java 8
you need to install a rebuilt version of graphviz, provided here: https://gitlab.com/Orange-OpenSource/documentare/documentare-graphviz
you need to install convert (imagemagick)

More information about graphviz are provided under doc/graphviz/README.md

To build the core library: cd simdoc/core/java && ./mvnw clean install

To build tools (LineDetection,NCD, etc): cd simdoc/apps/ && ./mvnw clean install

Introduction

This software bundle is aimed for computer-aided transcription of digitised document produced with scanners or digital cameras. But other uses are allowed for NCD and SimClustering. Following tools are implemented:

LineDetection: picture segmentation and potential glyphs extraction. OpenCV is needed.
NCD: similarity distance matrix computing of content or parts of content (i.e. glyphs) with an unsupervised and agnostic way of measurement.
SimClustering: unsupervised and agnostic clustering of similar pattern. For visualisation and post-treatments, GraphViz (including gvmap) is needed.

General syntax

used notation

$: shell prompt.
{fic}: file, directory or program name.
[param n]: optional computing parameter, 
if present, n is the value of the current parameter.

general syntax

command launching is done with a classical way in a shell window.

$ java –jar /{jar_directory}/{prog_name} {fic1} {fic2}… [param1 [value]] [param2 [value]]…
prog_name: {LineDetection-1.0.0.jar|Ncd-1.0.0.jar|Simclustering-1.0.0.jar}

All results are stored in json format (and csv for short analysing with Excel in NCD) which names are beginning with the name of the program that produces this json file.

Following processes must be done for obtaining a visualization graph :

- For image of documents : LineDetection, NCD, PrepClustering, SimClustering, Graph 
and your GraphViz script

- For content in directories : NCD, PrepClustering, SimClustering, Graph
and your GraphViz script

Programs functionalities

LineDetection

Option for LinDetection

{file_name}

Goal

This tool is used for digitised document segmentation and glyphs (characters or symbols) extraction. Segmentation is done in a very classical way by using adaptive thresholding and connected component extraction in a digital picture using OpenCV algorithms. Identification and extraction of glyphs use a statistical approach on pattern size and topological neighbourhood for computing

Text lines in the document,
Glyphs and character size in a text line.

Results are stored in a JSON format file named ld_page_geom_ready_for_ncd.json.gz , pattern representing glyphs and characters are stored in a directory named ld_out, and a layered picture showing obtained segmentation is produced, named bin.png. Produced data are stored in current directory.

Computing a matrix distance of similarity and preparing clustering

Since version 1.13, This function offers two modes in order to allow computing on a big set of data and to reduce the size of results : mode "Regular" applyied on files directories and "SimDoc" applyied to digitized pictures of documents.

Global goal

The goal of this set of tools consists in computing a distance lattice between pairs of files in a directory or pairs of segmented pictures containing glyphs and obtained with LineDetection in a digitized document.

The regular or SimDoc modes are invoked through a parameter in NCD command line (-simDocJsonGz ) : by using this parameter NCD will work in SimDoc mode (need of a json file generated by LineDetection. Otherwise, NCD will work in regular mode.

NCD

SimDoc mode

This mode is dedicated to segmented image of documents produced with LineDetection with the aim to apply our experimental OCR, using Zenobie interface.

It is NOT available for general content analysis.

In order to produce ready for clustering data, one have just to apply NCD program before clustering With program SimClustering.

Regular mode

In this mode we use 2 programs NCD and PrepClustering

NCD computes a distance matrix and stores results in a json file with a generic presentation of array and named ncd_regular_files_model.json.gz

PrepClustering builds an index of each files and computes a kind of topology based on distance triangles computed from the matrix generated by NCD. This matrix can be filtered by specifying a value of neighborhood for each element. This parameter samples a pannel of relevant distances : we take each element with its k nearest neigbours.

json file name generated by PrepClustering is prep_clustering_ready.json.gz

Options for NCD

-file1 <file path>          first file (must be a directory)
-file2 <file path>          second file (must be a directory), not
                            mandatory (we will assume file2 = file1)
-help                       print this message
-simDocJsonGz <file path>   SimDoc model json gzip file

Options for PrepClustering

-f <file path>   NCD input file
-help            print this message
-k <knn>         k nearest neighbours
-writeCSV        write WRITE_CSV files (matrix, nearest)

SimClustering

Option for SimClustering

-ccut                                              enable clusters scalpel cut post treatments
-h                                                 print this message
-i <distances json gzip file path>                 path to json gzip file containing items distances
-scut                                              enable subgraphs scalpel cut post treatments
-sdArea <graph scissor Area SD factor>             graph scissor area's standard deviation factor
-sdQ <graph scissor Q SD factor>                   graph scissor equilaterality's standard deviation factor
-sdScut <subgraph scalpel SD factor>               subgraph scalpel standard deviation factor
-simDocJsonGz <SimDoc json gzip file path>         path to json gzip file containing SimDoc model ready for clustering
-tileCcut <cluster scalpel percentile threshold>   cluster scalpel percentile threshold
-wcut                                              enable subgraphs wonder cut post treatments

Goal

Cluster computing of pattern where similarities are evaluated with NCD. This tool includes different steps corresponding with a statistical and progressive refinement of parameters. At each step, patterns are excluded from a subgraph or cluster and all excluded patterns are coded as “singletons”, representing a rejecting method in cluster computing. We plan a recall strategy for these “singletons” in a next version, based on using NCD in classifier mode on multisets assembled from consistent, and indexed by values, clusters.

Step 1: this step uses a global view of distance matrix produced with NCD: A triangulation is processed on matrix distance; each triangle is build from the current element. The second summit is the first neighbour of current element and the third summit is the first neighbour of second summit. Consequently, edges of a triangle are given with: distance between first and second summit, second and third summit, first and third summit. For each triangle, area and an equilaterality factor are computed and stored in a histogram. This allows computing averages and standard deviation that are used to prune triangles and build subgraphs representing a rough segmentation of similar patterns. The sensitivity of the pruning can be adjusted by a factor applied on standard deviation on triangles surface and equilaterality (default value 2). This algorithm induces redundancy in edges, matching to adjacent triangles. This redundancy is used to evaluate the best representative pattern in a cluster.
Step 2: this step is focused on edges of subgraphs obtained in step 1. The computing method deals no more with triangles but with edges of subgraphs. The goal consists of obtaining homogenous and isotropic clusters of similar patterns. The method consists of pruning from edges lengths using statistical factors based on averages and standard deviation. We have implemented two modes of pruning: a fast, but less accurate, mode, and an iterative mode.
- The fast mode preserves only minimal distances edges, connected on centroids in subgraphs. Centroids are determined from redundant edges matching to adjacent triangles computed in step 1.
- The iterative mode cuts edges until all edges lengths in a cluster are included in a threshold computed from average and standard deviation of each remaining subgraphs. The sensitivity of pruning can be adjusted with a factor applied on standard deviation (default value 1). Obtaining cluster from subgraphs uses a near Voronoï network algorithm, as implemented in GraphViz (gvmap package). This method computes cluster boundaries inside a subgraph, depending with local minimal distances on centroids.
Step 3: pruning clusters. This step is optional. It consists on a edge pruning using a quantile parameter to eliminate isolated extrema (default value, the third quartile).

Results are stored with a json mode file in sc_page_geom.json.gz to be used with Zenobie. For using with GraphViz, the file is sc_graph_input.json.gz.

Sequences of processes

Depending of using regular or SimDoc mode, these sequences of programs must be applyed :

regular mode : NCD-PrepClustering-SimClustering

SimDoc mode : NCD-SimClustering

About thumbnails

NCD computes by default thumbnails, using "convert" program from ImageMagick package (ensure that this distribution is installed on your computer) of documents coded in ".png", ".jpg", ".jpeg", ".tif", ".tiff" and ".pdf" format. Pictures are stored in "./thumbnails" subdirectory of current NCD launch directory.

Other utilities

Exporting graphs to GraphViz

GraphViz use a dedicated file format named “dot”. Its an ASCII-like format whose specification can be found in GraphViz website. To build a dot file from documentare, use the “Graph” utility (Graph-1.0.0.jar). It produces a file named graph.dot wich is fully manageable with GraphViz. The Graph utility syntax allows you to add thumbnails of pictures (the content in ld_out) ore reduced pictures from ld_out to a specific directory. Option of graph utility

-d <image directory>             directory containing images of vertices
-h                               print this message
-i <graph json gzip file path>   path to json gzip file containing graph
-simDocJsonGz                    simDocJson mode, will add .png extension automatically

Viewing graphs with GraphViz

A complete documentation is stored on GraphViz Website. Here is an example of syntax usable to view a graph in svg format.

$ sfdp -Goverlap=prism -Gcharset=latin1 graph.dot | gvmap -e | neato -n2 -Tsvg > g.svg

Helpful java options

-Xmx<n>                Specifies the maximum size, in bytes, of the memory
                       allocation  pool.  This value must be a multiple of
                       1024 greater than 2 MB.  Append the letter k  or  K
                       to  indicate  kilobytes, the letter m or M to indi-
                       cate megabytes, the letter g or G to indicate giga-
                       bytes,  or the letter t or T to indicate terabytes.
                       The default value is 64MB. (useful on regular mode 
                       with a big set of data)

-XX:+HeapDumpOnOutOfMemoryError
					   As named : will dump memory content when program crashes
					   in a ".hprof" file

License

This software is distributed under the terms of the GPL v2 license, please see license file.

Authors

Joël Gardes and Christophe Maldivi

documentare-simdoc's People

Contributors

Stargazers

Watchers

Forkers

isabella232

documentare-simdoc's Issues

About thumbnails and data preparation

For increasing graph lisility, here are some enhancements concerning thumbnails program

non graphical files
Writing file name in a empty bitmap
General syntax :
"convert -pointsize 36 -fill red -background white label:<file_name> ../thumbnails/<index_name>.png"
including subdirectories names of main directory
but with discrete and tasteful colors

graphical files
Adding files name label to thumbnail
General syntax to add to present convert command:
"label:'<initial_filename>' -gravity Center -append"
<initial_filename> does not contain directory path, other option stays unchanged.

Informations on closed issue #17

using extracted data from documents (like PDF files)
Such data must be considered as metadata from source document. We can be led to be able to compute clusters from such metadata.
First example exists in MyLib: metadata are vectors of words extracted from PDF which are content.
The need is to preserve relation between document name, metadata name and thumbnail name.

Actually, this relation is preserved with filenames themselve. Vectors and thumbnails are produced by external tools. That implies : ensuring that graph program allows ever to work with thumbnails directories (that means "with real filenames) or adapt thumbnails program including "convert" scripts.

For "prep-data" that means the ability to store source document, vector name and thumbnail presentation (see joined document in french on pivot format definition).

This issue replace existing issues #17 #38 #39 (which are closed)

roadmap LT documentare.pdf

thumbnail computing in docker image configuration

Review the thumbnail computing process for adapting it to docker implementation (main problem : troubleshooting with path and naming of content and thumbnails)

file conversion : about content and metadata

We should be able to compute clusters from metadata extracted from content.
First example exists in MyLib: metadata are vectors of words extracted from PDF which are content.
In NCD process we have no more relations between documents name in PDF and vector files except filenames (without termination).
The only way to display a readable graph is to have thumbnails of PDF inside.
Because these are "external" relations, the only way to create thumbnails is to execute an external program and the only way to add these thumbnails in graph is to use option -d in graph program.
So, maintain directories options in coming evolution for allowing external thumbnails.

SimdocServer: new API to load data from files

Current version of the API supports requests which provide a directory as a parameter to load data: we load all files from the provided directory.

BUT, we would like to provide a new way to load data, with a request which will look like this:

{
 [
  { "id": "34838434" ,
    // Here the data will be loaded from a file resource
    "file": "dat/pdf/bill.pdf"
  },
  ...
  { "id": "45435345" ,
    // Here the data are already present as a Base 64 string
    "bytes": "Sm9UT3Bo..."
  },
  ...
 ]
}

So compared to current version (input directory only), it will give us the following possibilities:

work only on a subpart of a directory
do not work with files, but provide bytes directly when data are small

Subgraph build: refactoring

ClusteringGraphBuilder class needs to be refactored, since it seems to be difficult to understand the main steps.

Add 'sloop' boolean parameter in the ClusteringParameter class.

if scut != 0 => scutStart = scut
else scutStart = 3
The test ClusteringParametersTest needs to be updated

NCD "JoTophe" tag prefix added to all input files

We want to generalize the use of a tag prefix to improve NCD results (suggested tag is "JoTophe").

So we suggest to:

remove the tag from raw files generated by the LineDetection segmentation tool
in NCD, add this tag as prefix for all input files

This single change will provide a new version which can be tested alone.

API Documentare : example query shown in swagger

I suggest to display a functionnal example on our server with:

default values for clustering parameters,
true directoires (for avoiding "mismatched" directories, create a kind of junk under /data-mylib, for example : /data-mylib/example)
when launching server with SimdocServer: optional parameter indicating root directory for shared data (i.e. "/data-mylib" or other name depending client application, then by testing API with swagger, one can better understand used directories tree).

To be discussed.

Memory management with huge data

There is perhaps a problem with memory management:

even with largest memory (managed with -Xmx java option), java heap error on large directories containing raw pictures (1500 files of 1,8 MB jpeg files).
ncd process can work at 100% memory in some cases, without java heap error, why?

(directory "/Claudia/NewYork" on both servers z620 and z820 under home directory of jyig5563)

[enhancement] suggestion for options in command lines

Need of harmonization in naming options in command lines

to be discussed.

LineDetection

no option -> can be maintained
[possible upcoming enhancement allowing to input a directory of pictures ou a multipage picture]

NCD

-file1 and -file2 -> change to -d1 and -d2
[possible upcoming enhancement allowing to compare a file with a directory or two files]
-simDocJsonGz -> change to -simdoc
-help -> change to -h or -help indifferently

PrepClustering

-f -> change to -json (what need this parameter? a compressed json file!)
-help -> change to -h or -help indifferently

SimClustering

-h -> change to -h or -help indifferently
-i -> change to -json
-simDocJsonGz -> change to -simdoc
-scut and -sdScut -> change -scut as optional parameter with optional standard deviation value (default 2)
-ccut and -tileCcut -> change to -ccut as optional parameter with optional quantile value (default 75)
-sdArea -> change to -sdarea as optional parameter with optional value (default 2)
-sdQ -> change to -sdq as optional parameter with optional value (default 2) <- previous mismatch typing value 3 (sorry!)
[explanation : use of KNN allows to avoid Area and equilaterality computing with a short value of k]

Graph

-h -> change to -h or -help indifferently
-i -> change to -json

Using image raw format (transcoded or not in Base64) for computing NCD

A very strong improvement has been noticed during tests by using raw image format. The reasons are:

true bitmap used for measurement (no dependence with file coding specifications)
blocking the length of byte sequences by the use of a fixed separation word (the "JoTophe" word)

One can also foresee that a simple Base64 coding of this raw sequence of bytes will occurs the same improvement (to be discussed).

Therefore, using raw or Base64 format represents a good way ahead to improve NCD.

The general format of measurement sequence should be "JoTophe"& or "JoTophe"&Base64()

This evolution should be made in two steps :

raw format application for all kinds of pictures
encapsulation in a Base64 coding with consequences in models (filesystem and json modes) convergence (a new document model story to define).

No change in NCD, PrepClustering and SimClustering. generalization of dictionary principle to keep relation between measured content and data representation for measurement : changes in thumbnails (preparing an import module) and graph.

turing@home: treatments dispatch improvments

Code review to clean up current impl of ncd remote & simdoc-server
Clustering API: must rely on Tasks now as the Distance service
add services discovery

Simdoc server, expose REST API for clustering

Having a discussion on API spec

[All subjects] Voronoï network computing enhancement necessary

Despite correct results, need to straighten things out with Voronoï process in SimClustering
Technical target : becoming progressively independent of GraphViz).
Functional : Voronoï's networks are useful to increase clustering from triangulation : if a cluster candidate has many areas with short distances, one can suspect many clusters inside (important for OCR and other application needing a thin and accurate segmentation)

Bug in ncd-remote

See message below.

jyig5563@g-z820-cm:~/NY_ref$ java -jar ../Bastet/ncd-remote-1.40.1.jar -j1 bytes-data.json

[NCD REMOTE - Start]
Assumes j2 = j1
J - Mem 6% 0.5/7.0GBNo suitable constructor found for type [simple type, class com.orange.documentare.core.prepdata.PreppedBytesData]: can not instantiate from JSON object (missing default constructor or creator, or perhaps need to add/enable type information?)
at [Source: java.io.FileInputStream@149494d8; line: 2, column: 3]

[Question] time consuming with NCD on our raw format

Do you have any estimation on the time consuming increase of NCD due to raw format choose compared to png ? (not so important question, but a good indice for use definitions)

[API] avoiding hidden files in processing

avoiding hidden files in processing

clustering server

cannot handle swagger query for demo

distances >= 1

Just exclude these distances in similarity program like in -scut but with the only criterion of distance >=1

mode of correction : same decision test as in -scut parameter (exclude triangle) and same actions (if no more edges connected to a node, flag it as singleton), normally, we should have no change to operate in other programs.
observations : a triangle containing an edge with length > 1 is in all cases a source of ambiguity for clustering particularly with noisy data (images). In theory, a distance equal to 1 appears when byte-string are completely different, that explain the choice to exclude all distances >=1
functional usage : no new parameter to define, the cause of this problem comes from length of byte strings after computing run-length and is corresponding to an exception due to limit conditions of distance measurement (equation right, but wrong data)
This is a kind of "palliative surgery". When this problem of run-length will be corrected, this test will become "dead code".

Server: support sloop option + clustering request should rely on Task pattern

suggested evolution of clustering (use of KNN) to be discussed

Using a "KNN" approach in SimClustering

Actually, SimClustering is computing clusters on all distances matrix computed by NCD. This give good results on coherent alignments in bytes sequences, occuring a large interval of distances values betwwen 0 and 1. Typical coherent alignment is found with text files or DNA sequences.

The problem of misaligned sequences, like pictures is that occurs difficulties to evaluate common parts between sequences with the consequency of computed distances included in a short interval, near from 1.

One can put the hypothesis that we have very short common parts between two pictures, and this property generates clusters including too much non representative data, corresponding to a noise in measurement.

The idea consists of using just firsts nearest distances in SimClustering, eliminating, by the way, the noise generated by non representative data : this will not increase intervals of distances, but will just compute clusters on really relevant distances.

The other advantage of this KNN use in clusters computing is that SimClustering will work on a reduced set of data.

The sequence of algorithms

NCD

The distances matrix computing program will not be modified.

PrepClustering

Other parameters will not be changed.

In command line, a parameter ("-k") and a value (the number of retained neighbours) will filter the distances matrix, retaining k+1 firt neighbours. The +1 operation is related to clusters boundaries which are situated between k and k+1 neighborhood of patterns.

PrepClustering will compute triangles where we find :

vertex1 : current pattern
vertex2 : first neighbour of vertex1
vertex3 : first neighbour of vertex2

And 3 segments whose length are distances and joining :

vertex1-vertex2
vertex2-vertex3
vertex3-vertex1

By retaining only k+1 neighbours, we will just write existing triangles in knn intervals. In fact, if vertex3 is not found in filtered matrix, that means that vertex3 and vertex2 (which is, in fact, the k+1 neighbour of vertex1) will not belong to current cluster, because distance represented by vertex3-vertex1 is outside knn limit.

SimClustering

There is, normally, no modifications to do here. SimClustering will compute Heron's and equilarerality formulas, but on a reduced set of triangles. pattern eliminated by this filter will be considered as singletons.

Question and remarks to discuss :

Are Heron's and equilaterality formulas computed in SimClustering or am I wrong ?
This evolution should be developed in a "separate branch", because it needs a validation through experience with a large set of data.

ncd-remote : abnormal message

[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
Service http://g-z440-cm:8080 can not handle more tasks
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@714c002a; line: 1, column: 2]
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@6d86613c; line: 1, column: 2]
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@5232be6d; line: 1, column: 2]
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@658e1d75; line: 1, column: 2]
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@2e04eaed; line: 1, column: 2]
Request to http://g-z440-cm:8080 failed: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.BufferedReader@301e6f87; line: 1, column: 2]
[SUCCESS 55%] from http://g-z440-cm:8080 took 229s (1.41s/elem)
[POST REQUEST] http://g-z440-cm:8080
[SUCCESS 55%] from http://g-z440-cm:8080 took 229s (1.42s/elem)
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080
[POST REQUEST] http://g-z440-cm:8080

Tooling: tool to convert images to our internal RAW image format

the tool will be used this way: java -jar ImageToRaw.jar -d path/to/input/image/directory

=> it will generate a raw output directory which will contain the initial images converted to our raw image file format, ie: image matrix as a byte array, with the '\n' character at the end of each row.

[enhancement] filename and directory management based on a dictionnary

make filenames management reliable ( and other non alphanumerical characters).

(despite the fact that we recommand "normalized names" input)

[enhancement] Naming non graphical thumbnails

an useful enhancement : using the ability of imagemagick to transform ASCII files to bitmap.

General syntax :
"convert -pointsize 36 -fill red -background white label:<file_name> ../thumbnails/<index_name>.png"
including subdirectories names of main directory

but with discrete and tasteful colors

thumbnails computing

If using simple text files (without termination), on sfdp script we obtain this example of message :
Warning: No or improper image="./platypus.png" for node "c18_platypus_92".

That means that Graph build systematically a list of thumbnails from filenames before exporting .dot file. (conditional test missing?)

[evolution] on error after launching mismatched options in command line

Display help message rather than a system error message.

Naming nodes in similarity program

Most applications name duplicate file name with [filename (n)]

That's occurring a problem in program similarity when computing temporary graph for Voronoï process by the fact that "(" and ")" characters are incompatible (reserved markups) in dot files.

Consequence : similarity program is stopped at Voronoï computing.

Applying the same rules than in final graph building for temporary graphs?

problem with thumbnails from PDF files

[Thumbnails - Start]
Can not create thumbnail, provided image '/Users/jyig5563/Desktop/plib/./pdf/1Is6Kqopl9ErAdhgrS9R8eXAWm0zZz.PDF' is invalid
Niels:plib jyig5563$

guilty file joined

1hmzZzIxvPcKymqyDBZx6avrXnO3wzZz.PDF

[Enhancement] Thumbnail presentation

For increasing graph lisility, here are two enhancements using imagemagick's convert properties

non graphical files
Adding file name in a empty bitmap
General syntax :
"convert -pointsize 36 -fill red -background white label:<file_name> ../thumbnails/<index_name>.png"
including subdirectories names of main directory
but with discrete and tasteful colors
graphical files
Adding files name label to thumbnail
General syntax to add to present convert command:
"label:'<initial_filename>' -gravity Center -append"

<initial_filename> does not contain directory path, other option stays unchanged.

Joined : produced exemples

Non graphical:

Graphical:

ncd-remote question

Am I right if I suppose that we cannot launch more than one program ncd-remote ?

Gitlab.com: ci build failure

A test does not pass on a debian VM in virtualbox

core libray test don't pass on Denis PC

ncd-remote enhancement

Add a canceling of eventual distant running ncd processes at ncd-remote starting.

[question] meaning of a part of message in SimClustering

"kNearestNeighboursThreshold=2147483646" : Was ist das?

Simdoc plaftform dockerization

To ease deployment and run, we could package simdoc servers (UI & backends) in a docker image or several docker images (compose).

file conversions : about PDF

In case of PDF files used in NCD, no thumbnails produced. Add this point to thumbnails conversion.

[Warning] Gap between "simdoc model" mode and "files" mode

Having a meeting to plan this emerging problem.

clustering-remote to demonstrate clustering server API

Documentation for simdoc server docker usage

[evolution] recapturing singletons in similarity programm

The aim of this functionality is to find new clusters inside the set of singletons produced by similarity program. A further evolution will consist to study a fusion methods of clusters which will be the recall stategy of our clustering method, by using multisets (one cluster = one multiset).

First step : copying singletons in a dedicated directory. This function should be optional for debugging and testing new ncd.
Second step : including a "-enroll" parameter in similarity program which will compute new clusters from all singletons. In this case, similarity should work on existing distance matrix. This relaunch of similarity will use same parameters than in first launch.
Third step : complete integration of this functionality (docker).

[update build doc] option on clean installation

take "mnvw clean install" in account in build documentation

troubleshooting with thumbnails with data produced in docker container

Need for reviewing processing :

if mounted directory and local subdirectory are different : symbolic links are containing the path of docker container.
detected in source-symlinks : "1" symbolic link ever missing
Impossible have a coherent graph (wrong link between node and thumbnail). Below, messages displayed when executing graphviz script (concerning all thumbnails)

Warning: No such file or directory while opening /home/jyig5563/data/outpdf/./safe-working-dir/0.png
Warning: No or improper image="/home/jyig5563/data/outpdf/./safe-working-dir/0.png" for node "c1_0_99"

Subgraph build: add unit test to test subgraph build

including "block text" level of hierarchical document model in interface

This functionality will allow users to select part of document before processing.

blocks will be processed from data computed by linedetection program. Blocks could be paragraphs, columns or non textual information. These blocks are pointable by a user click and sequence of pointing will be remained. Pointed text block will be sent to clustering process.
blocks could be also pointed by user tracing a bounding box. For example, for non textual blocks like pictures). In this case, the user should have the ability to indicate a "block type"
- if image or graphical : erasing all segmentation inside the block to produce a non textual node in json description,
- if text : storing this new text block in model before sending it to clustering process).

ncd-remote

problem if data to be clustered directory rights are not idle : no treatments done but ncd-remote keeps running

problem with thumbnails (convert process)

jyig5563@g-z620-4lfq:/home/tesfaces/png

There are not converted pictures to ./thumbnails directory (trace.txt joined)

traces.txt

continuous integration test: speed up test by cropping source image

The test is based on the "anatomie" source image. We suggest to keep only half of the image to speed up the test.

Base 64 experiment

Here are some explanations for needed tools to evaluate similarity measurement using base 64 information transcoding.

The purpose

Base 64 is a generic information coding based on a alphabet composed of 64 bytes that will allow to free 64 other bytes to improve similarity distance by replacing run length with a more efficient algorithm. An other way ahead will consist of applying "combinatorial pattern matching" algorithm to evaluate direct byte sequences segmentation (delimiters will be coded of the 64 released bytes after alphabet compression).

This purpose means operating a reversible alphabet compression for data coding.

A first step for this experiment consists on evaluating measurement based on current algorithm, of similarities of information coded in base64 which would become a pivot format if good results obtained.

Globally, we have just to add a base 64 transcoding option inside data preparation.

For pictures

We need to preserve linearity of bitmap, so, before base64 coding, a raw format transcoding could be necessary. If possible, maintain the -raw option.

For other content

Generic (raw PDF, for example): add base64 coding facility.
Work on metadata (vector of extracted words for example): transcode in base 64 extracted data and preserve alignment with source files through canonical filename (i.e. filename without termination) for permitting thumbnails computation in graph program with -src option.

Is that OK ?

[Bug] computing thumbnails from lfw

Thumbnail 0.png not produced. Message :

Exception in thread "main" com.orange.documentare.core.comp.nativeinterface.NativeException: Command line 'convert /home/jyig5563/lfw/lfw-deepfunneled/Aaron_Peirsol/._Aaron_Peirsol_0004.jpg[0] -thumbnail x300 -background white -alpha remove -polaroid -0 /home/jyig5563/lfw/thumbnails/0.png > /home/jyig5563/lfw/thumbnails/0.png.log' returned error code 1
at com.orange.documentare.core.comp.nativeinterface.NativeInterface.launchCmd(NativeInterface.java:59)
at com.orange.documentare.core.comp.nativeinterface.NativeInterface.launch(NativeInterface.java:48), etc...

All other thumbnails OK, traces of NCD joined
trace.txt

orange-opensource / documentare-simdoc Goto Github PK

documentare-simdoc's Introduction

Documentare, SimDoc library & tools

Build

Introduction

General syntax

used notation

general syntax

Programs functionalities

LineDetection

Option for LinDetection

Goal

Computing a matrix distance of similarity and preparing clustering

Global goal

NCD

SimDoc mode

Regular mode

Options for NCD

Options for PrepClustering

SimClustering

Option for SimClustering

Goal

Sequences of processes

About thumbnails

Other utilities

Exporting graphs to GraphViz

Viewing graphs with GraphViz

Helpful java options

License

Authors

documentare-simdoc's People

Contributors

Stargazers

Watchers

Forkers

documentare-simdoc's Issues

Need of harmonization in naming options in command lines

LineDetection

NCD

PrepClustering

SimClustering

Graph

Using a "KNN" approach in SimClustering

The sequence of algorithms

NCD

PrepClustering

SimClustering

Question and remarks to discuss :

The purpose

For pictures

For other content

Recommend Projects

Recommend Topics

Recommend Org