Giter VIP home page Giter VIP logo

alfresco-bulk-import's Introduction

Alfresco Bulk Import Tool

Build Status Downloads Open Issues License Codacy Code Climate GitHub Stats Project Stats

What Is It?

A high performance bulk import tool for the open source Alfresco Document Management System.

"'High Performance', you say?"

Why yes. Alfresco's built-in mechanisms for moving large amounts of content into the repository (the various file-server protocols, the venerable ACP mechanism, the mind-bogglingly inefficient CMIS standard etc.) all suffer from a variety of limitations that make them a lot slower than the core Alfresco repository. This tool cuts out virtually all of that nonsense, attempts to maximise "mechanical sympathy" (which, for Alfresco, basically means treating your database nicely), and makes one or two large and opinionated assumptions that allows it to be a lot faster than anything else out there.

In terms of benchmarks, the old v1.x versions of the tool have regularly demonstrated sustained ingestion rates of over 500 documents per second in production environments, and in testing, the v2.x version has been shown to be up to 4X faster than 1.x (in specific circumstances, notably for streaming imports).

Documentation

Resources

Older resources (less relevant for v2.0+):

What's New?

Contributing

Please see Contributing.

Attributions

Commercial Support

This extension is not supported by Alfresco Software Inc., although a fork of an early, pre-release version of this tool has been included in Alfresco Enterprise since v4.0, and has (at times) been supported by Alfresco support.

Please note that the embedded fork has never been rebased against upstream, meaning that it is ancient - equivalent to v1.0-RC1 (circa mid-2010). It also introduced a number of serious bugs (e.g. incorrect "source striping" algorithm, no support for Alfresco clusters) that the original edition never had. The embedded fork has also been independently measured to be around 25% slower than the original edition available here.

tl;dr: use of the embedded fork is STRONGLY discouraged!

License

Copyright Peter Monks 2007. Licensed under the Apache 2.0 License.

alfresco-bulk-import's People

Contributors

jeremie-lesage avatar pmonks avatar zhihailiu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alfresco-bulk-import's Issues

Hybrid Inplace bulk import

I am generating large sets of data for BFSIT to do testing within Alfresco. I am using Inplace ingestion as I am creating well-balanced folder hierarchies. while doing this, it makes sense for a current client to perform InPlace ingestions but their src content is not in a well-balanced structure (too many Foster's the night before).

What (I think) would be an interesting approach is a hybrid model. In otherwords separating out the metadata.properties.xml from the content. This would allow time to move the content files into a well balanced folder hierarchy and drop this under the contentstore and have a metadata.properties.xml folder hierarchy representative of how this would like like in Share.

I guess to achieve this, we would need to specify the cm:content property

contentUrl=store://2014/2/16/9/10/eb7368ab-adc2-413e-a275-066cb2d72417.bin|mimetype=application/vnd.visio|size=107008|encoding=UTF-8|locale=en_US_|id=9507372

Being able to define the contentUrl=store://my/new/ingestion/folder/is/well/balanced/1234.vsd

to point to where the content would reside within the contentstore may solve this but I assume we would still need to also define the rest of the values like in the example?

Cheers,
Colin.

Understanding the directory structure

Hey Peters!
I managed to successfully import this project in eclipse but it then further divided in 4 separate projects:

  1. alfresco-bulk-import
  2. alfresco-bulk-import-api
  3. alfresco-bulk-import-parent
  4. alfresco-bulk-import-source

Basically I want to change into some UI and procedure like:

  1. showing the progress of files uploading
  2. showing the files failed during operation

It would be great if I understand which project(s) I need to play with. Also after code editing, the build order in which I would get a right .amp and .war files. Also the goals that I need to assign each project.

Thanks in advance.

Add ability to set the content URL in the metadata properties file

This would be useful in some instances where the files are coming from another repository where the internal folder structure does not match the repository folder structure (e.g. migrating from another alfresco repository). This would make it possible to just move the content store files over without re-arranging them

Residual property support

Currently the tool ignores properties it can't find the definition of in the running data dictionary, which precludes importing residual properties.

This restriction should be removed, though a log entry (probably at "info" level) should still be emitted if a property can't be found in the data dictionary.

Migrations can be long and dull, BFSIT could be more entertaining...

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=130

It would be ideal if the bulk filesystem import tool had a humour option to provide the user, performing the ingestion, something to help pass the time.

Maybe after each batch is successfully ingested, a witty story could be displayed? Maybe this could connect to an external feed server where other users of BFSIT could provide words of wisdom and encourage (or witty anecdotes of migrations gone wrong).

Or we can continue to watch the graphs ;)

This could involves something as simple as periodically showing a fortune e.g. from http://iheartquotes.com/api

In thinking about this more, I think clippy.js [1] would provide a more "Enterprise grade" humour experience.

[1] https://www.smore.com/clippy-js

That would be fun and nostalgic. Of course, you could just link into your twitter feed ....

Folders with custom sub-types of cm:folder are not recursed into

I'm fiddling with bulk importer here and I ran into trouble importing custom folders (Alfresco 5.0d and v2 RC2 bulk importer BTW).

I am importing a hierarchy of folders with a custom folder defined in my model -- the root folder gets imported but it's subfolders aren't

The root folder has its metadata file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <entry key="separator"> # </entry>  <!-- 3 character delimiter: space, hash, space -->
    <entry key="namespace">http://www.neoburg.com/model/application/1.0</entry>
    <entry key="parentAssociation">cm:contains</entry>
    <entry key="type">nbp:folder</entry>
</properties>

where nbp:folder is my custom folder, which is nothing more than a relabeled cm:folder

<!-- base folder for all NBP folders -->
        <type name="nbp:folder">
          <title>Neoburger base folder</title>
          <parent>cm:folder</parent>
        </type>

The import directory is

source/
|------->mycustomfolder_1/
       |-> subfolder_11/
       |-> subfolder_12/
|------->mycustomfolder_2/
       |-> subfolder_21/

where the subfolders are just simple folders.

What I get in the destination is just:

|------->mycustomfolder_1/
|------->mycustomfolder_2/

(No errors in the logs.)

Just to try things out, I also created metadata files for the subfolders, but to no avail. However, if I bulk import the subfolders seperately into mycustomfolder_1 they all get inserted into the repository.

Importing without the custom folder type works well too and I can manually subclass the folders into nbp:folder without breaking anything.

Allow imports to be paused and resumed

It should be possible for the user to pause an import and resume it later on. This would complete all in-progress batches but block (awaiting resumption) before any new batches are processed.

Provide a custom repository action for initiating the bulk import tool

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=80

Provide a custom repository action [1] for initiating the bulk import tool.

Related to this, there should also be a way to directly monitor the status of an in-progress import, although this can't be done via a custom repository action because actions only support commands, not queries (in the "Command Query Separation Principle" sense of those words).

[1] http://wiki.alfresco.com/wiki/Custom_Actions

Need browsers for both Files Path and Target Fields in Import UI

Even with the magic dropdown functionality it can still be difficult to get the Files Path and Target fields populated especially if you are doing a series of imports.

Would like to see both fields have browsers so we can browse and picks Folders and Files as well as Target spaces.

Add travis CI build

Even though the unit tests can't be run by Travis, it would be ideal to have basic compile-checking via Travis.

Update JQuery

Update JQuery to whatever the latest is and fix any fallout.

Versions numbers as they appear on disk should be used for versions in the repository

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=85

Currently the tool ignores the version numbers that appear on disk (with one exception - see below) and instead starts the version history at 1.0 and increments by 0.1 or 1.0 for each additional version. Instead the tool should use the exact version numbers that appear on disk, even when a version series is gappy.

Note: this is currently not possible - Alfresco does not allow version labels to be explicitly set. Technically speaking, the "cm:versionLabel" property is a protected property that cannot be set directly. See http://issues.alfresco.com/jira/browse/ALF-10155

Implement a sample pluggable source

Implement a sample pluggable source in the config folder, to both demonstrate how it's done and to prove that the pluggable source APIs are at an appropriate level of abstraction.

Relates to issue #2.

Incorrect exception handling around out-of-order retries

I am trying to use this tool with Alfresco version 4.2.x and I had the following error during one of my import:

2015-08-12 10:57:11,021  INFO  [bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Import (in-place) started from Default.
 2015-08-12 10:57:32,290  ERROR [bulkimport.impl.Scanner] [BulkImport-Importer-0001] BULKIMPORT: Bulk import from 'Default' failed.
 org.alfresco.extension.bulkimport.impl.ItemImportException: Unexpected exception:
 class org.alfresco.extension.bulkimport.impl.OutOfOrderBatchException: null
While importing item: May (1 version):
    HEAD: <no content> <metadata>
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importItem(BatchImporterImpl.java:230)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatchImpl(BatchImporterImpl.java:193)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.access$200(BatchImporterImpl.java:67)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl$2.execute(BatchImporterImpl.java:161)
    at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:454)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatchInTxn(BatchImporterImpl.java:152)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.access$000(BatchImporterImpl.java:67)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl$1.doWork(BatchImporterImpl.java:128)
    at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:548)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatch(BatchImporterImpl.java:122)
    at org.alfresco.extension.bulkimport.impl.Scanner$BatchImportJob.run(Scanner.java:478)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
Caused by: org.alfresco.extension.bulkimport.impl.OutOfOrderBatchException
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.getParent(BatchImporterImpl.java:334)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.findOrCreateNode(BatchImporterImpl.java:252)
    at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importItem(BatchImporterImpl.java:209)
    ... 13 more
2015-08-12 10:57:34,446  ERROR [bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Bulk import from 'Default' failed.
 java.util.concurrent.RejectedExecutionException: Task org.alfresco.extension.bulkimport.impl.Scanner$BatchImportJob@3b1acf89 rejected from org.alfresco.extension.bulkimport.impl.BulkImportThreadPoolExecutor@5d2f49af[Shutting down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 4]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
    at org.alfresco.extension.bulkimport.impl.Scanner.submitBatch(Scanner.java:359)
    at org.alfresco.extension.bulkimport.impl.Scanner.submitCurrentBatch(Scanner.java:331)
    at org.alfresco.extension.bulkimport.impl.Scanner.submit(Scanner.java:300)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:205)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:236)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:236)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:236)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:236)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:236)
    at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanFolders(FilesystemBulkImportSource.java:158)
    at org.alfresco.extension.bulkimport.impl.Scanner.run(Scanner.java:178)
    at java.lang.Thread.run(Thread.java:724)
2015-08-12 10:57:37,877  INFO  [bulkimport.impl.LoggingBulkImportCompletionHandler] [BulkImport-Scanner] BULKIMPORT: In place bulk import completed (Failed) in 26s 855.785ms.
    Batches:        7 imported of 9 submitted (0.261 / sec)
    Nodes:          700 (26.065 / sec)
    Bytes:          0 (0.000 / sec)
    Versions:       0
    Metadata properties:    4900
    Files:          0 in-place, 0 streamed, 0 skipped
    Out-of-order batches:   0

Looking at the code I think there is a bug in the class org.alfresco.extension.bulkimport.impl.BatchImporterImpl line 227: you are catching all the exceptions and not throwing the OutOfOrderBatchException which will be in charge of re-queue the batch at line 170. In fact, all the batches which have OutOfOrderBatchException are not re-queueing and an I get an error instead.

I have fixed this issue using the following code:

private final void importItem(final NodeRef                 target,
                                  final BulkImportItem<BulkImportItemVersion> item,
                                  final boolean                 replaceExisting,
                                  final boolean                 dryRun)
        throws InterruptedException, OutOfOrderBatchException
    {
        try
        {
            if (trace(log)) trace(log, "Importing " + (item.isDirectory() ? "directory " : "file ") + String.valueOf(item) + ".");

            NodeRef nodeRef     = findOrCreateNode(target, item, replaceExisting, dryRun);
            boolean isDirectory = item.isDirectory();

            if (nodeRef != null)
            {
                // We're creating or replacing the item, so import it
                if (isDirectory)
                {
                    importDirectory(nodeRef, item, dryRun);
                }
                else
                {
                    importFile(nodeRef, item, dryRun);
                }
            }

            if (trace(log)) trace(log, "Finished importing " + String.valueOf(item));
        }
        catch (final OutOfOrderBatchException oobe)
        {
            throw oobe; // gmalanga: catch and throw the exception
        }
        catch (final Exception e)
        {
            // Capture the item that failed, along with the exception
            throw new ItemImportException(item, e);
        }
    }

cannot import into an RM site

When I try to import into an RM site I get access denied exception.

2015-07-17 15:25:51,448 INFO [org.alfresco.extension.bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Import (streaming) started from Default.
2015-07-17 15:25:51,550 DEBUG [org.alfresco.extension.bulkimport.impl.BatchImporterImpl] [BulkImport-Scanner] BULKIMPORT: Importing Batch #1, 1 items, 0 bytes.
2015-07-17 15:25:51,566 DEBUG [org.alfresco.extension.bulkimport.impl.BatchImporterImpl] [BulkImport-Scanner] BULKIMPORT: Finding parent folder ''.
2015-07-17 15:25:51,806 ERROR [org.alfresco.extension.bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Bulk import from 'Default' failed.
org.alfresco.extension.bulkimport.impl.ItemImportException: Unexpected exception:
class org.alfresco.repo.security.permissions.AccessDeniedException: 06170051 Access Denied. You do not have the appropriate permissions to perform this operation.
While importing item: Military Assignment Documents (1 version):
HEAD:
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importItem(BatchImporterImpl.java:230)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatchImpl(BatchImporterImpl.java:193)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.access$200(BatchImporterImpl.java:67)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl$2.execute(BatchImporterImpl.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:454)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatchInTxn(BatchImporterImpl.java:152)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.access$000(BatchImporterImpl.java:67)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl$1.doWork(BatchImporterImpl.java:128)
at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:548)
at org.alfresco.extension.bulkimport.impl.BatchImporterImpl.importBatch(BatchImporterImpl.java:122)
at org.alfresco.extension.bulkimport.impl.Scanner.submitCurrentBatch(Scanner.java:336)
at org.alfresco.extension.bulkimport.impl.Scanner.run(Scanner.java:199)
at java.lang.Thread.run(Thread.java:745)
2015-07-17 15:25:51,810 INFO [org.alfresco.extension.bulkimport.impl.LoggingBulkImportCompletionHandler] [BulkImport-Scanner] BULKIMPORT: Streaming bulk import completed (Failed) in 00s 352.001ms.
Batches: 0 imported of 1 submitted (0.000 / sec)
Nodes: 0 (0.000 / sec)
Bytes: 0 (0.000 / sec)
Versions: 0
Metadata properties: 0
Files: 0 in-place, 0 streamed, 0 skipped
Out-of-order batches: 0
Regards

Brian

Add support for d:content properties

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=62

Need to add support for metadata properties of type d:content.

Note to self: this isn't as trivial as it sounds, at least if "reads during write transactions" are to be avoided. The problem is in figuring out which metadata properties are of type d:content - this requires a call to the DictionaryService, which in turn reads (i.e. SELECTs from) the database (or may, depending on a whole slew of factors).

Further note to self: we're already hitting DictionaryService to determine whether a property is multi-valued or not. Given that we're already taking the hit we might as well go ahead and implement this too.

Additional note to self: there may be some complexity in figuring out how to stream large content into a d:content field. I don't think we'd want it inline in the XML properties file, for example...

Source APIs need to be separated into their own API JAR

The various interfaces & classes that support the development of pluggable sources need to be separated out of the core into their own API JAR. Amongst other benefits this will help to prevent custom sources from accidentally becoming dependent on internal (private) classes.

Changes to default contentstore wiring in 5.x prevent in-place from working

There have been two separate reports that the logic that detects whether an in-place import is possible or not is broken on Alfresco 5.0 in the 1.x versions of the tool. Given that that logic hasn't changed in 2.0, we need to look into enhancing that logic to handle the new default Alfresco 5.0 contentstore bean configuration.

Issue importing files in linux where the filenames contain portuguese characters

This has only been reproduced in a linux environment with the attached dataset. To reproduce the issue install an alfresco 5 instance and apply the latest bulk import tool amp and follow the steps below:

  • Unzip the provided tar in the target server in a directory of your choice, take note of this directory including the directory tony/ this contains the files for the test.
  • Login to alfresco as admin and create a folder for the import test
  • Open the bulk import tool interface (http://localhost:8080/alfresco/s/bulk/import)
  • Enter the location of the "tony" folder (ie /home/alfowner/testdirs/tony)
  • Enter the location of the folder in the repo (ie /mytest1)

In the alfresco logs there are warnings about files being unreadable

WARN [source.fs.DirectoryAnalyser] [BulkImport-Scanner] BULKIMPORT: Skipping '/home/alfowner/testdirs/tony/135 Carbon���13 NMR spectroscopy_DS_NS_final_cau2.txt' as Alfresco does not have permission to read it.

Even though the permissions are right on the server side I believe the issue is with the File.canRead() method. In my tests I found that using java.nio.file.DirectoryStream and java.nio.file.Files.isReadable() is able to read these files properly.

public static void main(String[] args) throws IOException {
System.out.println("Folder provided was:" + args[0]);
List originalListing = null;
File folder = new File(args[0]);
File[] listOfFiles = folder.listFiles();

    int fileNotFoundNew = 0;
    DirectoryStream<Path> ds = Files.newDirectoryStream(folder.toPath());
    for(Iterator<Path> it = ds.iterator();it.hasNext();) {
        Path p = it.next();
        if (!Files.isReadable(p)) {
            fileNotFoundNew++;
            System.out.println("File not readable: " + p);
        } else {
            System.out.println("Read file:     " + p);
        }
    }
}

testdirsforbi.tar.gz

Status graphs don't work on SSL connections

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=134

What steps will reproduce the problem?

  1. Initiate an import on an Alfresco server configured with HTTPS access only
  2. View the status screen

What is the expected output? What do you see instead?
Expected: The live graph showing the number of files read/written.
Actual: A screen with graphs missing and an error on the console saying:

[blocked] The page at https:///alfresco/service/bulk/import/filesystem/status ran insecure content from http://yui.yahooapis.com/3.8.0/build/simpleyui/simpleyui-min.js.
status:1
Uncaught ReferenceError: Y is not defined

What version of the product are you using? On what operating system?
alfresco-bulk-filesystem-import-33-1.2.1.amp
OSX Mavericks and Ubuntu 10.0.4 LTS

systematic failure (IllegalStateException) importing a directory from the file system

Hi there,

I am experiencing a systematic failure trying to import a directory from the file system.
The import fails with the following message:


2015-09-14 01:57:16,142 INFO [bulkimport.impl.LoggingBulkImportCompletionHandler] [BulkImport-Scanner] BULKIMPORT: In place bulk import completed (Failed) in 11m 48s 131.246ms.
Batches: 103 imported of 130 submitted (0.145 / sec)
Nodes: 10300 (14.545 / sec)
Bytes: 0 (0.000 / sec)
Versions: 0
Metadata properties: 72080
Files: 0 in-place, 0 streamed, 0 skipped
Out-of-order batches: 65409


The problem seems related to the high number od out-of-order batches, the following exception is printed in the log files before the bulk import failure message:


2015-09-14 01:57:13,032 ERROR [bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Bulk import from 'Default' failed.
java.lang.IllegalStateException: Attempt to register more than 65535 parties for java.util.concurrent.Phaser@173710b6[phase = 0 parties = 65535 arrived = 65508]
at java.util.concurrent.Phaser.doRegister(Phaser.java:429)
at java.util.concurrent.Phaser.register(Phaser.java:579)
at org.alfresco.extension.bulkimport.impl.Scanner.submitBatch(Scanner.java:358)
at org.alfresco.extension.bulkimport.impl.Scanner.submitCurrentBatch(Scanner.java:331)
at org.alfresco.extension.bulkimport.impl.Scanner.submit(Scanner.java:300)


Reading the log files, we noticed many WARN lines saying that a node cannot be created in Alfresco because a parent folder is missing and for that reason the whole batch is re-submitted. It seems when too many submitted batches, the threshold of 65535 parties is reached and the IllegalStateException is raised.
We don't know why the parent folder was not created in advance, there are no message in the log about an error creating a folder.

The root folder we are trying to import has:

  • 17,331 folders (it's a complex folder structure)
  • 206,526 files (including the metadata files)
  • about 91 GB of data

We have tried to import it with the following configurations:

  • 5 threads, batch of 50
  • 5 threads, batch of 1
  • 1 threads, batch of 50

Many thanks in advance,
Alessio

Regression test against Alfresco v5.1

Since v2.0.1 of the tool was released, Alfresco released v5.1. This task is to regression test the code against Alfresco v5.1 (specifically the Enterprise trial).

Name of offending BulkImportItem should be included in all fatal exceptions

In v1.x of the bulk import tool, the name of the offending ImportableItem is not available when a fatal exception occurs, making it harder than necessary to troubleshoot the issue.

v2+ of the tool should include identifying information for the offending BulkImportItem (which superceded ImportableItem) for all fatal (unexpected) exceptions.

Metadata-only folders get imported as files

For example:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
  <entry key="type">cm:folder</entry>
  <entry key="cm:description">This should be imported as a directory.</entry>
</properties>

Bulk import from 'Default' failed - java.util.concurrent.RejectedExecutionException

Running bulk import tool 2.0.1 on alfresco community edition 5.1.0 (r127059-b7) with postgres on ubuntu 16.04 (8GB ram, 8 vcpu)

I am attempting to import 565921 pdf files that had been exported from alfresco Community - v4.2.0 (r63893-b12) using version 0.0.6 of this bulk export tool https://github.com/gsdenys/alfresco-bulk-export

The export generated a total of 195614 folders, 565921 pdf files and 772691 properties.xml files (about 40GB). Export results


    Performed Export with the following Parameters :
       export folder   : /export
       node to export  : workspace://SpacesStore/8ac68ee8-31b2-47ed-8c3f-eac8020a5935
       ignore exported : false
       export versions : false
       bulk import revision scheme: true
    Export elapsed time: minutes:568 , seconds: 34131

During import, approximately 221599 pdf files get imported or skipped (initial import, and also re-importing with replace = false) and then the import process fails with an exception.

The import process was run with these options

source default
source directory /import/Company Home/User Homes/
target space /User Homes
replace = not checked

sample tree output of root of source directory (/import)

Most folders have 1 pdf file, a few have 2 pdf files. No folders have more than 3 pdf files

+-- 8ac68ee8-31b2-47ed-8c3f-eac8020a5935.cache
+-- Company Home
|   +-- User Homes
|       +-- general-user
|       |   +-- 000000
|       |   |   +-- 1308
|       |   |   |   +-- entry
|       |   |   |   |   +-- 12112005200321
|       |   |   |   |   |   +-- 12112005200321_cdc_20130827162012.pdf
|       |   |   |   |   |   +-- 12112005200321_cdc_20130827162012.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005200321.metadata.properties.xml
|       |   |   |   |   +-- 12112005200478
|       |   |   |   |   |   +-- 12112005200478_cdc_20130827155542.pdf
|       |   |   |   |   |   +-- 12112005200478_cdc_20130827155542.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005200478.metadata.properties.xml
|       |   |   |   |   +-- 12112005200537
|       |   |   |   |   |   +-- 12112005200537_cdc_20130826164512.pdf
|       |   |   |   |   |   +-- 12112005200537_cdc_20130826164512.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005200537.metadata.properties.xml
|       |   |   |   |   +-- 12112005226138
|       |   |   |   |   |   +-- 12112005226138_cdc_20130827155609.pdf
|       |   |   |   |   |   +-- 12112005226138_cdc_20130827155609.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005226138.metadata.properties.xml
|       |   |   |   |   +-- 12112005226241
|       |   |   |   |   |   +-- 12112005226241_cdc_20130830082522.pdf
|       |   |   |   |   |   +-- 12112005226241_cdc_20130830082522.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005226241.metadata.properties.xml
|       |   |   |   |   +-- 12112005226285
|       |   |   |   |   |   +-- 12112005226285_cdc_20130827155619.pdf
|       |   |   |   |   |   +-- 12112005226285_cdc_20130827155619.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005226285.metadata.properties.xml
|       |   |   |   |   +-- 12112005226398
|       |   |   |   |   |   +-- 12112005226398_cdc_20130830090013.pdf
|       |   |   |   |   |   +-- 12112005226398_cdc_20130830090013.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005226398.metadata.properties.xml
|       |   |   |   |   +-- 12112005226489
|       |   |   |   |   |   +-- 12112005226489_cdc_20130830154047.pdf
|       |   |   |   |   |   +-- 12112005226489_cdc_20130830154047.pdf.metadata.properties.xml
|       |   |   |   |   +-- 12112005226489.metadata.properties.xml

The majority of pdf files were created by scanning documents on a Dell scanner.

Sorry I cannot provide any sample content.

I've re-run the import process with TRACE enabled, sifted through 40GB catalina.out file for this data

     2016-06-27 00:42:42,559 User:admin INFO  [bulkimport.impl.BatchImporterImpl] [BulkImport-Importer-0011] BULKIMPORT: Skipping '12112005832290-invoice.pdf' as it already exists in the repository and 'replace existing' is false.
     2016-06-27 00:42:42,559 User:admin TRACE [util.transaction.SpringAwareUserTransaction] [BulkImport-Importer-0011] Completing transaction for [UserTransaction]
     2016-06-27 00:42:42,559 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Before commit TransactionSychronizationImpl[ txnId=fd1fce7f-b675-4d89-8c0f-07343c031713]
     2016-06-27 00:42:42,559 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Before Prepare - level 0
     2016-06-27 00:42:42,559 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Before Prepare priorities:[4]
     2016-06-27 00:42:42,559 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Prepared
     2016-06-27 00:42:42,560 User:admin DEBUG [mybatis.spring.SqlSessionUtils] [BulkImport-Importer-0011] Transaction synchronization committing SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@4057dea7]
     2016-06-27 00:42:42,560 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Before completion: TransactionSychronizationImpl[ txnId=fd1fce7f-b675-4d89-8c0f-07343c031713]
     2016-06-27 00:42:42,560 User:admin DEBUG [mybatis.spring.SqlSessionUtils] [BulkImport-Importer-0011] Transaction synchronization deregistering SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@4057dea7]
     2016-06-27 00:42:42,560 User:admin DEBUG [mybatis.spring.SqlSessionUtils] [BulkImport-Importer-0011] Transaction synchronization closing SqlSession [org.apache.ibatis.session.defaults.DefaultSqlSession@4057dea7]
     2016-06-27 00:42:42,560 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] After completion (committed): TransactionSychronizationImpl[ txnId=fd1fce7f-b675-4d89-8c0f-07343c031713]
     2016-06-27 00:42:42,560 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Bound resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,560 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,561 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,562 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,562 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,562 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,563 User:admin TRACE [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Fetched resource:
     2016-06-27 00:42:42,563 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] After Completion: DONE
     2016-06-27 00:42:42,563 User:admin DEBUG [util.transaction.TransactionSupportUtil] [BulkImport-Importer-0011] Unbound txn synch:TransactionSychronizationImpl[ txnId=fd1fce7f-b675-4d89-8c0f-07343c031713]
     2016-06-27 00:42:42,563 User:admin DEBUG [util.transaction.SpringAwareUserTransaction] [BulkImport-Importer-0011] Committed user transaction: UserTransaction[object=org.alfresco.util.transaction.SpringAwareUserTransaction@1fcb8253, status=3]
     2016-06-27 00:42:42,564 User:admin DEBUG [security.authentication.AuthenticationUtil] [BulkImport-Importer-0011] Removing the current security information.
     2016-06-27 00:42:42,667  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]
     2016-06-27 00:42:42,667  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]

(at least 10,000 more of the following lines)


     2016-06-27 00:45:33,871  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]
     2016-06-27 00:45:33,871  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]
     2016-06-27 00:45:33,871  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]
     2016-06-27 00:45:33,871  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]
     2016-06-27 00:45:33,871  TRACE [util.cache.AbstractAsynchronouslyRefreshedCache] [BulkImport-Scanner] get() from cache for key  on AbstractAsynchronouslyRefreshedCache [cacheId=compiledModelsCache]

then

     2016-06-27 00:45:33,968  ERROR [bulkimport.impl.Scanner] [BulkImport-Scanner] BULKIMPORT: Bulk import from 'Default' failed.
     java.util.concurrent.RejectedExecutionException: Task org.alfresco.extension.bulkimport.impl.Scanner$BatchImportJob@36b0e96c rejected from
     org.alfresco.extension.bulkimport.impl.BulkImportThreadPoolExecutor@4615da6f[Running, pool size = 16, active threads = 16, queued tasks = 100, completed tasks = 269]
            at org.alfresco.extension.bulkimport.impl.Scanner.submitBatch(Scanner.java:370)
            at org.alfresco.extension.bulkimport.impl.Scanner.submitCurrentBatch(Scanner.java:329)
            at org.alfresco.extension.bulkimport.impl.Scanner.submit(Scanner.java:298)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:223)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:245)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:245)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:245)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:245)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanDirectory(FilesystemBulkImportSource.java:245)
            at org.alfresco.extension.bulkimport.source.fs.FilesystemBulkImportSource.scanFiles(FilesystemBulkImportSource.java:172)
            at org.alfresco.extension.bulkimport.impl.Scanner.run(Scanner.java:188)

example export status results (I clicked stop hour after the import process failed)

    Status:    Stopped
    Initiating User:    admin
    Source Name:    Default
    Source directory    /import/Company Home/User Homes
    Target Space:    /Company Home/User Homes
    Import Type:    Streaming
    Dry run:    No
    Batch Weight:    100
    Threads:    0 active of 0 total
    Start Date:    2016-06-27T02:20:19Z
    Scan End Date:    n/a
    End Date:    2016-06-27T09:51:11Z
    Scan Duration:    07h 30m 52s 457.031ms
    Duration:    07h 30m 51s 990.186ms
    Currently Importing:   
    Source (read) Statistics
    Directories scanned:    217557    8.042 / sec
    Files scanned:    1449099    53.566 / sec

    Target (write) Statistics
    Aspects associated:    1043712    38.582 / sec
    Batches completed:    2226    0.082 / sec
    Batches submitted:    2343    0.087 / sec
    Bytes imported:    2530418309    93539.094 / sec
    Content streamed:    0    0 / sec
    In place content linked:    0    0 / sec
    Metadata properties imported:    2915017    107.756 / sec
    Nodes imported:    222599    8.229 / sec
    Nodes skipped:    222599    8.229 / sec
    Out-of-order retries:    0    0 / sec
    Versions imported:    0    0 / sec

Measure performance of alternative batching strategies

Migrated from https://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=96

Need to measure the performance of alternative batching strategies. For example now that multi-threaded imports are supported ( issue #8 ), it may be more performant to load multiple batches within the same directory in parallel (i.e. on separate threads).

Testing in preparation for the DevCon 2011 talk showed that the impact of timestamp propagation is no longer a factor in 3.x, so there's a good chance this will markedly improve performance for some imports (i.e. those that have a lot of child nodes in a single folder).

It would also be ideal to come up with a batching strategy that allows a transaction to span directories. Right now the tool is not optimal when confronted with a large directory hierarchy where each directory only contains a small number of files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.