I have spoken with Jack Dabrowski today regarding some problems with processing a large input file on Windows. The input and output files can be found below. Here is a sample command execution for reproduction purposes:
This was run on a Windows 10 machine and yields the following debug information...
C:\Path\To\Repo\cleora>C:\Path\To\Repo\cleora\target\debug\cleora.exe --input edges.tsv --dimension 100 --number-of-iterations 10 --columns="media complex::tropes" --output-dir Output
[2022-06-24T16:47:38Z INFO cleora] Reading args...
[src\[main.rs:202](http://main.rs:202/)] &config = Configuration {
produce_entity_occurrence_count: true,
embeddings_dimension: 100,
max_number_of_iteration: 10,
seed: None,
prepend_field: false,
log_every_n: 10000,
in_memory_embedding_calculation: true,
input: "edges.tsv",
file_type: Tsv,
output_dir: Some(
"Output",
),
output_format: TextFile,
relation_name: "emb",
columns: [
Column {
name: "media",
transient: false,
complex: false,
reflexive: false,
ignored: false,
},
Column {
name: "tropes",
transient: false,
complex: true,
reflexive: false,
ignored: false,
},
],
}
[2022-06-24T16:47:38Z INFO cleora] Starting calculation...
[src\[pipeline.rs:25](http://pipeline.rs:25/)] &sparse_matrices = [
SparseMatrix {
col_a_id: 0,
col_a_name: "media",
col_b_id: 1,
col_b_name: "tropes",
edge_count: 0,
hash_2_id: {},
id_2_hash: [],
row_sum: [],
pair_index: {},
entries: [],
},
]
[2022-06-24T16:47:38Z INFO cleora::sparse_matrix] Number of entities: 6629
[2022-06-24T16:47:38Z INFO cleora::sparse_matrix] Number of edges: 13985
[2022-06-24T16:47:38Z INFO cleora::sparse_matrix] Number of entries: 27970
[2022-06-24T16:47:38Z INFO cleora::sparse_matrix] Total memory usage by the struct ~ 0 MB
[2022-06-24T16:47:40Z INFO cleora::pipeline] Number of lines processed: 10000
[2022-06-24T16:47:41Z INFO cleora::pipeline] Number of lines processed: 20000
[2022-06-24T16:47:43Z INFO cleora::pipeline] Number of lines processed: 30000
[2022-06-24T16:47:44Z INFO cleora::pipeline] Number of lines processed: 40000
[2022-06-24T16:47:46Z INFO cleora::pipeline] Number of lines processed: 50000
[2022-06-24T16:47:49Z INFO cleora::pipeline] Number of lines processed: 60000
[2022-06-24T16:47:53Z INFO cleora::pipeline] Number of lines processed: 70000
[2022-06-24T16:47:56Z INFO cleora] Finished Sparse Matrices calculation in 18 sec
[2022-06-24T16:47:56Z INFO cleora::embedding] Start initialization. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO cleora::embedding] Done initializing. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO cleora::embedding] Start propagating. Number of iterations: 10.
[2022-06-24T16:47:56Z INFO cleora::embedding] Done iter: 0. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 1. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 2. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 3. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 4. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 5. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO cleora::embedding] Done iter: 6. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO cleora::embedding] Done iter: 7. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO cleora::embedding] Done iter: 8. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO cleora::embedding] Done iter: 9. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO cleora::embedding] Done propagating.
[2022-06-24T16:47:58Z INFO cleora::embedding] Start saving embeddings.
[2022-06-24T16:47:58Z INFO cleora::embedding] Done saving embeddings.
[2022-06-24T16:47:58Z INFO cleora::embedding] Finalizing embeddings calculations!
[2022-06-24T16:47:58Z INFO cleora] Finished in 20 sec
C:\Path\To\Repo\cleora>
1. my binary (gnu-linux) on your input file throws an exception during data loading phase and fails immediately:
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/entity.rs:227:29
2. This is caused by lines containing only a single entity, without any other corresponding entities (e.g. line number 101 in your input file).
Such inputs are meaningless (because they do not represent an edge in the graph), and we do not handle them currently.
3. The code should throw an error & abort, but apparently on Windows, the exception happens silently, and the code proceeds to the next phase, despite not having loaded all inputs into memory successfully.
We will introduce a proper workaround (handle the case without errors + display a warning that "such lines are meaningless and will be skipped").
If there are any other materials necessary for addressing this issue, please reach out. Thank you very much!