Giter VIP home page Giter VIP logo

stream-sync's Introduction

Stream Sync

Building

Stream Sync uses [sbt](https://www.scala-sbt.org/) as its build tool. After the project has been checked out, you can cd into the project directory and enter the sbt console by typing sbt. Then the following commands are of interest:

  • test compiles everything and executes the unit tests

  • ITest / test compiles everything and executes the integration tests

  • assembly as the build makes use of the sbt-assembly plugin, with this command a so-called fat jar can be built that is executable and contains all the dependencies of Stream Sync. This makes it easy to run the tool from the command line without having to bother with a large class path.

Operation modes

When running Stream Sync it compares two folder structures and updates them to achieve a desired target state. Depending on a concrete use case, different target states are possible. The exact way how Stream Sync updates the structures it processes is controlled by its operation mode. The following subsections describe the operation modes supported by the tool.

Mirror mode

In mirror mode, a destination directory structure becomes an exact copy of a source structure. Both structures are compared. Files or directories existing in the source structure, but not in the destination structure are created there. Files that have been modified in the source structure override their counterparts in the destination structure. Files or directories that exist in the destination structure, but not in the source structure are removed. So, changes are applied only on the destination structure; the source structure is not modified.

This mode is suitable to create an exact (non-incremental) copy of data stored in a folder structure. For instance, you have a local working state and want to back up this state to an external storage medium or upload it to a cloud storage. After modifying the local working state, you can run the tool again, and it will adjust the destination structure according to your latest changes.

Warning
Note that such a "mirror" is typically not a replacement for a full backup, since files deleted in the source structure are deleted in the destination structure as well and cannot be restored later.

Sync mode

The typical use case for sync mode is a shared set of data that is used from multiple devices. A central copy of the data is available on a storage accessible from all devices, and each device has a local copy. When data on one device is changed, the changes are synced to the central copy. From there, they can be downloaded to all other devices.

In contrast to mirror mode, there is not a source and a destination structure, but a local and a remote structure. During a run of Stream Sync, both structures may be modified: changes done locally are applied to the remote copy, changes on the remote data are synced to the local files. This mode is more complex than mirror mode. Single sync operations can fail if a conflict is detected; for instance if a file was changed both locally and on the remote side. In this case, the conflicting operation is skipped, an error is reported, and the user has to resolve the conflict manually.

To be able to detect changes on the local copy and potential conflicts in a reliable manner, Stream Sync stores information about the managed data in a local file. Each sync run compares the current state of the local data with the information stored in the file and can thus determine what has actually changed. The file is then updated with the new local state resulting from the current sync run.

Caution
The current implementation assumes that the data is modified on different devices by the same user at different times. It cannot handle parallel sync processes run at the same time from multiple devices against the central copy of data.

Usage

The tool offers a command line interface (CLI) that is invoked using the following general syntax:

Sync <sourceStructure> <destinationStructure> [--mirror] [options]

for Mirror mode, where sourceStructure points to the structure serving as the source of the mirror process, and destinationStructure refers to the destination structure; or

Sync <localStructure> <remoteStructure> --sync [options]

to enable Sync mode, where localStructure references the local copy of the data, and remoteStructure is a URL pointing to the central copy. The --mirror switch is optional; mirror mode is the default operation mode.

The following subsections provide further details about the command line options supported by the tool. Most of these options work in all operation modes; so no explicit distinction is needed. Therefore, this documentation is a bit lax with the terms it uses:

  • The term sync process is used to refer to a run of the Stream Sync tool in all operation modes; it can especially mean a run in mirror mode as well.

  • The structures processed by the tool are usually referred to as source and destination structure. This wording also covers the local and remote structure of runs in sync mode.

Options that are operation mode-specific are marked as such in their description.

Note: Being written in Scala, Stream Sync requires a Java Virtual Machine to run. So the full command to be executed has to launch Java and specify the full class path and the main class (which is com.github.sync.cli.Sync). When a fat jar has been built as described in the Building section the command can be abbreviated to

java -jar stream-sync-assembly-<version>.jar [options] <source> <destination>

In all examples in this document the short form Sync is used as a placeholder for the complete command.

Structures

The generic term structure has been used to refer to the source and the destination of a sync process. The reason for this is that Stream Sync can handle different types of structures. In the most basic case, the structures are paths on a local file system (or a network share that can be accessed in the same way as a local directory). In this case, the paths can be specified directly.

To reference a different type of structure, specific URIs need to be used. These URIs typically start with a prefix followed by a part specific to a dedicated structure type. To give a concrete example, one prefix that is currently supported is dav:. This prefix indicates that the structure is hosted on a WebDav server. The root URL to the directory on the server to be synced must be specified after the prefix. The following snippet shows how to sync a path on the local file system with a directory on a WebDav server:

Sync [options] /data/local/music dav:https://my.cloud-space.com/data/music

Some structures need additional parameters to be accessed correctly. For instance, a WebDav server typically requires correct user credentials. Such parameters are passed as additional options in the command line; they are allowed only if a corresponding structure takes part in the sync process. A structure requiring additional parameters can be both the source and the destination of the sync process; therefore, when providing additional options it must be clear to which structure they apply. This is achieved by using special prefixes: src- for options to be applied to the source structure, and dst- for options referring to the destination structure. In the example above the WebDav structure is the destination; therefore, the username and password options must be specified using the dst- prefix:

Sync --dst-user myWebDavUserName --dst-password myWebDavPwd \
   /data/local/music dav:https://my.cloud-space.com/data/music

If both structures were WebDav directories, one would also have to specify the corresponding options with the src- prefix, as in

Sync dav:https://server1.online.com/source \
  --src-user usrIDSrc --src-password pwdSrc \
  dav:https://server2.online.com/dest \
  --dst-user usrIDDst --dst-password pwdDst

This convention makes it clear, which option applies to which structure. The structure types supported are described in more detail in the Structure types section later in this document. This section also lists for each structure type which additional options it supports.

Option syntax

A number of options is supported to customize a sync process. Options are distinguished from the source and destination URIs by the fact that they have to start with the prefix --. Most options have a value that is obtained from the parameter that follows the option key. So a sequence of command line options looks like

--option1 option1_value --option2 option2_value

There are also a few options acting like switches: these options do not have a value, but their presence or absence on the command line determines their value - true or false.

For some options the application defines short alias names consisting of only a single letter. Such aliases use only a single - as prefix. So for instance, the following parameter lists are equivalent:

Sync --log path/to/log

and:

Sync -l path/to/log

The order of options typically does not matter. It also makes no difference if options are placed before or after the URIs for the structures to be synced. Unrecognized option keys cause the program to fail with a corresponding error message. In case of an error, the application shows a help screen describing all the parameters it supports. The user can also request help explicitly by specifying the --help flag or its short alias -h, such as:

Sync srcUri destUri --help

Note that the help printed by the application is partly context-sensitive; it depends on the parameters already provided on the command line. If the help switch is passed without other arguments, such as

Sync -h

the application shows a generic help screen listing the top-level options available. If the command line contained already URIs for the structures to be processed, e.g.

Sync /data/local/music dav:https://my.cloud-space.com/data/music --help

the help screen would include descriptions of options supported by the structure types in use - the local file system and WebDav in this example. This makes it possible to complete the command line step by step, by requesting help for the parts that are currently defined.

The options supported are described in detail below. There is one special option, --file, that expects as value a path to a local file. This file is read line-wise, and the single lines are added to the sequence of command line arguments as if they had been provided by the user on program execution. For instance, given a file sync_params.txt with the following content:

--actions
actionCreate,actionOverride

--filter-create
exclude:*.tmp

Then an invocation of

Sync --file sync_params.txt /path/source /path/dest

would be equivalent to the following call

Sync --actions actionCreate,actionOverride --filter-create exclude:*.tmp /path/source /path/dest

An arbitrary number of command line files can be specified, and they can be nested to an arbitrary depth. Note, however, that the order in which such files are processed is not defined. This is normally irrelevant, but can be an issue if the source and destination URIs are specified in different files. It could then be the case that the URIs swap their position, and the sync process is done in the opposite direction!

Option keys are not case-sensitive; so --actions has the same meaning as --ACTIONS or --Actions. However, for short alias names case matters.

Filtering options

With this group of options specific files or directories can be included or excluded from a sync process. It is possible to define such filters globally, and also for different sync actions. A sync process is basically a sequence of the following actions, where each action is associated with a file or folder:

  • Action Create: An element is created in the destination structure.

  • Action Override: An element from the source structure replaces a corresponding element in the destination structure.

  • Action Remove: An element is removed from the destination structure.

To define such action filters, a special option keyword is used whose value is a filter expression. As option keywords can be repeated, an arbitrary number of expressions can be set for each action. A specific action on an element is executed only if the element is matched by all filter expressions defined for this action. The following option keywords exist (filter expressions are discussed a bit later):

Table 1. Command line options to filter for action types
Option Description

--filter-create

Defines a filter expression for actions of type Create.

--filter-override

Defines a filter expression for actions of type Override.

--filter-remove

Defines a filter expression for actions of type Remove.

--filter

Defines a filter expression that is applied for all action types.

In addition, it is possible to enable or disable specific action types for the whole sync process. Per default, all action types are active. With the --actions option the action types to enable can be specified. The option accepts a comma-separated list of action names; alternatively, the option can be repeated to enable multiple action types. Valid names for action types are actionCreate, actionOverride, and actionRemove (case is again ignored).

So the following option enables only create and override actions: --actions actionCreate,actionOverride

With the following command line only create and remove actions are enabled: --actions actionCreate --actions actionRemove

Filter expressions

During a sync process, for each action it is checked first whether its type is enabled. If this is the case, the filter expressions (if any) assigned to this action type are evaluated on the element that is subject to this action. Only if all expressions accept the element, the action is actually performed on this element.

Thus, filter expressions refer to attributes of elements. The general syntax of an expression is as follows:

<criterion>:<value>

Here criterion is one of the predefined filter criteria for attributes of elements to be synced. The value is compared to a specific attribute of the element to find out whether the criterion is fulfilled.

The following table gives an overview over the filter criteria supported:

Table 2. Filter criteria on element attributes
Criterion Data type Description Example

minlevel

Int

Each element (file or folder) is assigned a level, which is the distance to the root folder of the source structure. Files or folders located in the source folder have level 0, the ones in direct sub folders have level 1 and so on. With this filter the minimum level can be defined; so only elements with a level greater or equal to this value are taken into account.

min-level:1

maxlevel

Int

Analogous to minlevel, but defines the maximum level; only elements with a level less or equal to this value are processed.

max-level:5

exclude

Glob

Defines a file glob expression for files or folders to be excluded from the sync process. Here file paths can be specified that can contain the well-known wildcard characters '?' (matching a single character) and '*' (matching an arbitrary number of characters).

exclude:*.tmp excludes temporary files; exclude:*/build/* excludes all folders named build on arbitrary levels.

include

Glob

Analogous to exclude, but defines a pattern for files to be included.

include:project1/* only processes elements below project1

date-after

date or date-time

Allows selecting only files whose last-modified date is equal or after to a given reference date. The reference date is specified in ISO format with an optional time portion. If no time is defined, it is replaced by 00:00:00.

date-after:2018-09-01T22:00:00 ignores all files with a modified date before this reference date.

date-before

date or date-time

Analogous to date-after, but selects only files whose last-modified time is before a given reference date.

date-before:2018-01-01 will only deal with files that have been modified before 2018.

Log files

The sync operations executed during a sync process can also be written in a textual representation to a log file. This is achieved by adding the --log option whose value is the path to the log file to be written. With this option, a protocol of the operations that have been executed can be generated.

If only failed operations are of interest, the error log file is the right choice. This file contains all sync operations that could not be applied due to some exception, followed by this exception. This gives an overview over what went wrong and which files may not be up-to-date. To enable this error log, use the --error-log option and provide the path to the error log file.

Adjust granularity of timestamps

In order to decide whether a file needs to be copied to the destination structure, StreamSync compares the last-modified timestamps of the files involved. After a file has been copied, the timestamp in the destination structure is updated to match the one in the source structure; so if there are no changes on the file in the source structure, another sync process will ignore this file - at least in theory.

In practice there can be some surprises when syncing between different types of file systems or structures. The differences can also impact the comparison of last-modified timestamps. For instance, some structures may store such timestamps with a granularity of nanoseconds, while others only use seconds. This may lead to false positives when StreamSync decides which files to copy.

To deal with problems like that, the --ignore-time-delta option can be specified. The option expects a numeric value which is interpreted as a threshold in seconds for an acceptable time difference. So if the difference between the timestamps of two files is below this threshold, the timestamps will be considered to be equal. Setting this option to a value of 1 or 2 should solve all issues related to the granularity of file timestamps. An example using this option can be found in the Examples and use cases section.

Encryption

One use case for StreamSync is creating a backup of a local folder structure on a cloud server; the data is then duplicated to another machine that is reachable from everywhere. However, if your data is sensitive, you probably do not want it lying around on a public server without additional protection.

StreamSync offers such protection by supporting multiple options for encrypting the data that is synced:

  • The content of files can be encrypted.

  • The names of files and folders can be encrypted.

If encryption is used and what is encrypted is controlled by the so-called encryption mode. This is an enumeration that can have the following values:

  • none: No encryption is used.

  • files: The content of files is encrypted.

  • filesAndNames: Both the content of files and their names are encrypted. (This includes directories as well.)

In all cases, encryption is based on AES using key sizes of 128 bits. The keys are derived from password strings that are transformed accordingly (password strings shorter than 128 bits are padded, longer strings are cut). In addition, a random initialization vector is used; so an encrypted text will always be different, even if the same input is passed.

The source and the destination of a sync process can be encrypted independently. If an encryption mode other than none is set for the destination, but not for the source, files transferred to the destination are encrypted. If such an encryption mode is set for the source, but not for the destination, files are decrypted. If active encryption modes are specified for both sides, files are decrypted first and then encrypted again with the destination password.

The following table lists the command line options that affect encryption (all of them are optional):

Table 3. Command line options controlling encryption
Option Description Default

src-crypt-mode

The encryption mode for the source structure (see above). This flag controls whether encryption is applied to files on the source structure.

none

dst-crypt-mode

The encryption mode for the destination structure; controls how encryption is applied to the destination structure.

none

src-encrypt-password

Defines a password for the encryption of files in the source structure. This password is needed when the source crypt mode indicates that encryption should be used.

Undefined

dst-encrypt-password

Analogous to src-encrypt-password, but a password for the destination structure is defined. It is evaluated for a corresponding encryption mode.

Undefined

crypt-cache-size

During a sync operation with encrypted file names, it may be necessary to encrypt or decrypt file names multiple times; for instance if parent folders are accessed multiple times to process their sub folders. As an optimization, a cache is maintained storing the names that have already been encrypted or decrypted; that way the number of crypt operations can be reduced. For sync operations of very complex structures (with deeply nested folder structures), it can make sense to set a higher cache size. Note that the minimum allowed size is 32.

128

Note that folder structures that are only partly encrypted are not supported; when specifying an encryption password, the password is applied to all files.

Structure types

This section lists the different types of structures that are supported for sync processes. If not mentioned otherwise, all types can act as source and as destination structure of a sync process. The additional parameters supported by a structure type are described as well.

Local directories

This is the most basic and "natural" structure type. It can be used for instance to mirror a directory structure on the local hard disk to an external hard disk or a network share.

To specify such a structure, just pass the (OS-specific) path to the root directory without any prefix. The table below lists the additional options that are supported. (Remember that these options need to be prefixed with either src- or dst- to assign them to the source or destination structure.)

Table 4. Command line options for local directories
Option Description Mandatory

time-zone

There are file systems that store last-modified timestamps for files in the system’s local time without proper time zone information. This causes the last-modify time to change together with the local time zone, e.g. when the daylight saving time starts or ends. In such cases, Stream Sync would consider the files on this file system as changed because their last-modified time is now different. One prominent example of such a file system is FAT32 which is still frequently used, for instance on external hard disks, because of its broad support by different operation systems. To work around this problem, with the time-zone option it is possible to define a time zone in which the timestamps of files in a specific structure have to be interpreted. The last-modified time reported by the file system is then calculated according to this time zone before comparison. Analogously, when setting the last-modified of a synced file the timestamp is adjusted. As value of the option, any string can be provided that is accepted by the ZoneId.of() method of the ZoneId JDK class.

No

WebDav directories

It is possible to sync from or to a directory hosted on a WebDav server. To do this, the full URL to the root directory on the server has to be specified with the prefix dav: defining the structure type. The following table lists the additional options supported for WebDav structures. (Remember that these options need to be prefixed with either src- or dst- to assign them to the source or destination structure.)

Table 5. Command line options for WebDav directories
Option Description Mandatory

modified-property

The name of the property that holds the last-modified time of files on the server (see below).

No

modified-namespace

Defines a namespace to be used together with the last-modified property (see below).

No

delete-before-override

Determines whether a file to be overridden on the WebDav server is deleted first. Experiments have shown that for some WebDav servers override operations are not reliable; in some cases, the old file stays on the server although a success status is returned. For such servers this property can be set to true. StreamSync will then send a DELETE request for this file before it is uploaded again. All other values disable this mode.

No

In addition to these options, the mechanism to authenticate with the server has to be defined. Refer to the Authentication section for more information.

Notes

Using WebDav in sync operations can be problematic as the standard does not define an official way to update a file’s last-modified time. Files have a getlastmodified property, but this is typically set by the server to the time when the file has been uploaded. For sync processes it is, however, crucial to have a correct modification time; otherwise, the file on the server would be considered as changed in the next sync process because its timestamp does not match the one of the file it is compared against.

Concrete WebDav servers provide different options to work around this problem. Stream Sync supports servers that store the modification time of files in a custom property that can be updated. The name of this property can be defined using the modified-property option. As WebDav requests and responses are based on XML, the custom property may use a different namespace than the namespace used for the core WebDav properties. In this case, the modified-namespace option can be set.

When using a WebDav directory as source structure Stream Sync will read the modification times of files from the configured modified-property property; if this is undefined, the standard property getlastmodified is used instead.

When a WebDav directory acts as destination structure, after each file upload another request is sent to update the file’s modification time to match the one of the source structure. Here again the configured property (with the optional namespace) is used or the standard property if unspecified.

Microsoft OneDrive

Most Windows users will have a Microsoft account and thus access to a free cloud storage area referred to as OneDrive. For Windows there is an integrated OneDrive client that automatically syncs this storage area to the local machine. For Linux, however, no official client exists.

Stream Sync supports a OneDrive storage as both source or destination structure of a sync process. The storage is identified by using a URL of the form onedrive:<driveID> where driveID is a string referencing a specific Microsoft OneDrive account. In addition, the following special command line options are supported:

Table 6. Command line options for OneDrive
Option Description Mandatory

path

Defines the relative sub path of the storage which should be synced.

Yes

upload-chunk-size

File uploads to the OneDrive server have to be split to multiple chunks if the file size exceeds a certain limit (about 60 MB). With this parameter the chunk size in MB to be used by Stream Sync can be configured.

No, defaults to 10 MB.

OneDrive uses OAuth 2 as authentication mechanism with a special identity provider from Microsoft. Therefore, the corresponding credentials have to be setup (refer to the OAuth 2 section for further information). This requires a bunch of preparation steps before sync processes can be run successfully. The example Sync from a local directory to Microsoft OneDrive contains a full description of the steps necessary.

Google Drive

Another popular cloud storage offering is available from Google: On a Google Drive account users can store information up to a certain limit. Most users of Android will have such an account. As is true for Microsoft OneDrive, official sync clients are not available for all operation systems.

Stream Sync can handle a Google Drive account as both source and destination of a sync process. To access such an account, use a URL of the form googledrive:<path>, where path is the optional root path of the sync process. If it is missing, the special root folder of the Google Drive account is used; otherwise, only the path specified here is taken into account by sync operations. Note that there is no such thing like an account ID in the URL; the account to be accessed is encoded in the OAuth 2 access token, which is used for authentication (the OAuth 2 section contains more information about this topic).

One speciality of Google Drive is that this file system is not strictly hierarchical. A single file or folder can have multiple parents, and it is possible that a folder can have multiple children with the same name. Thus, a path like documents/private/MyText.doc does not necessarily uniquely identify a single element. Even cycles in folder structures are possible. Stream Sync does not handle such scenarios. It treats Google Drive like any other folder structure and assumes the same properties. So when using Stream Sync together with Google Drive, you should make sure that at least the sub path to be synced follows the conventions of a strictly hierarchical file system.

Other than the root path to be synced in the target Google Drive account - which is part of the structure URL - you typically do not have to specify any further configuration options.

Note
There is one additional command line option, --server-url, which can be used to specify an alternative server URL; but this is only needed for very special scenarios, e.g. for testing. Per default, the standard Google Drive API endpoint is addressed.

You can find a complete example how to set up Stream Sync for accessing a Google Drive account in the section Sync from a local directory to Google Drive.

Authentication

Structure types that involve a server typically require an authentication mechanism. Stream Sync supports multiple ways to authenticate with the server.

Basic Auth

The easiest authentication mechanism is Basic Auth, which requires that a user name and password are provided. This information is then passed to the server in the Authorization header. (Therefore, this mechanism makes only sense when HTTPS is used for the server communication.)

To make use of Basic Auth, just define the command line options user and password. Note that these options have to be prefixed with src- or dst- to assign them to either the source or destination structure. Examples how to use these options can be found in the Examples section, for instance under Sync from a local directory to a WebDav directory.

OAuth 2

OAuth 2 is another popular way for authentication. Stream Sync supports the Authorization code flow. In this flow the authentication is done by an external server, a so-called identity provider (IDP). In a first step, an authorization code is retrieved. In this step, the user basically grants Stream Sync the permission to access her account with a set of pre-defined rights. This is done by opening a Web page at a URL specific to the IDP in the user’s Web browser. The user then authenticates against the IDP, e.g. by filling out a login form or using another means. If login is successful, the IDP invokes a so-called redirect URL and passes the authorization code as a query parameter.

In a second step, the authorization code has to be exchanged against an access token. This is done by calling another endpoint provided by the IDP and passing the authorization code as a form parameter. If everything goes well, the IDP replies with a document that contains both an access token and a refresh token. The access token must be passed in the Authorization header for all requests sent to the target server. Its validity period is limited; when it expires, the refresh token can be used to obtain a new access token. The refresh token is typically valid for a longer time; so the user has to do the login (i.e. the first step) only once, and then Stream Sync can access the target server as long as the refresh token stays valid.

The authorization code flow is interactive; it requires that the user executes some actions in a Web browser. This is not a great fit for a command line tool like Stream Sync. To close this gap, in addition to the main class of Stream Sync, there is a second CLI class responsible for the configuration and management of OAuth identity providers: com.github.sync.cli.oauth.OAuth.

What this class basically does is updating a storage with information about known IDPs: First, an IDP has to be added to the system. In this step a number of properties for this IDP has to be provided, such as the URLs to specific endpoints or the client ID and secret to be used for the interaction with the IDP. For this purpose, the init command is used. An example invocation could look as follows:

$ java -cp stream-sync-assembly-<version>.jar com.github.sync.cli.oauth.OAuth init \
  --idp-storage-path ~/tokens/ \
  --idp-name microsoft \
  --auth-url https://login.live.com/oauth20_authorize.srf \
  --token-url https://login.live.com/oauth20_token.srf \
  --scope "files.readwrite offline_access" \
  --redirect-url http://localhost:8080 \
  --client-id <client-id> \
  --client-secret <secret>

The command supports the following options:

Table 7. Command line options to initialize an OAuth IDP
Option Description Mandatory

idp-name

Assigns a logical name to the IDP. This name is then used by other commands or within Stream Sync to reference this IDP. An arbitrary name can be chosen.

Yes

idp-storage-path

Defines a path on the local file system where information about the IDP affected is stored. In this path a couple of files are created whose names are derived from the name of the IDP.

Yes

auth-url

The URL of the authorization endpoint of the IDP. This URL is needed to obtain an authorization code; a GET request is sent to it with some specific properties added as query parameters.

Yes

token-url

The URL of the token endpoint of the IDP. This URL is used to obtain an access and refresh token pair for the authorization code, and later also for refresh token requests.

Yes

scope

This parameter defines a list of values that are passed in the scope parameter to the IDP. The values are specific to a concrete IDP; they determine the access rights that are granted to a client that has a valid access token.

Yes

redirect-url

Defines the redirect URL, which plays an important role in the authorization code flow. This URL is invoked by the IDP after a successful login of the user. The URLs to be used depend on the concrete use case; URLs referencing localhost are typically possible as well.

Yes

client-id

An ID identifying the client. This ID is provided by the IDP as part of some kind of on-boarding process.

Yes

client-secret

A secret assigned to the client. Like the client ID, the secret is provided by the IDP.

No; if missing the secret is read from the console.

store-unencrypted

This is a switch that determines whether some sensitive information related to the IDP should be encrypted. Affected are the client secret and the token information obtained from the IDP. With an access token - as long as it is valid - an attacker can access the target server on behalf of the user; therefore, it makes sense to protect this data, and encryption is active per default. It can be explicitly disabled by specifying this switch.

No, defaults to true.

idp-password

The password to be used to encrypt sensitive information related to the IDP. This property is relevant if the encrypt-idp-data option is evaluated to true.

No; it is read from the console if necessary.

After the execution of this command, the IDP-related information is stored under the path specified, but no access token is retrieved yet. This is done using the login command as follows:

$ java -cp stream-sync-assembly-<version>.jar com.github.sync.cli.oauth.OAuth login \
  --idp-storage-path ~/tokens/ \
  --idp-name microsoft

The parameters correspond to the ones of the init command; encryption is supported in the same way. (If an encryption password has been specified to the init command, the same password must be entered here as well.)

The login command does the actual interaction with the IDP as required by the authorization code flow. It tries to open the standard Web browser at the authorization URL configured for the IDP in question. If this fails for some reason, a message is printed asking the user to open the browser manually and navigate to this URL. The Web page served at this URL is under the control of the IDP; it should give the relevant instructions to do a successful authentication, e.g. by filling out a login form. If this is the first login attempt, the user is typically asked whether she wants to grant the access rights defined by the scope parameter to this client application. If authentication is successful, the IDP then redirects the user’s browser to the redirect URL. Depending on the configured redirect URL, there are two options:

  • If the redirect URL is of the form http://localhost:<port>;, the command opens a small HTTP server at the configured port and waits for the redirect. It can then obtain the authorization code automatically without any further user interaction.

  • For other types of redirect URLs, the user is responsible to extract the code; for instance from the URL displayed in the browser’s address bar. The command opens a prompt on the console where the code can be entered.

If everything goes well, the command creates a new file in the specified storage path with the access and refresh tokens obtained from the IDP; the file is optionally encrypted.

With this information in place, Stream Sync can now be directed to use this IDP for authentication. To do this, the user and password options used for basic auth have to be replaced by ones pointing to the desired IDP:

Sync C:\data\work dav:https://target.dav.io/backup/work \
--log C:\Temp\sync.log \
--dst-idp-storage-path /home/hacker/temp/tokens --dst-idp-name microsoft \

Note how, analogous to the OAuth commands, the IDP is referenced by its name and the path where its data is stored; the encrypt-idp-data and idp-password options are supported as well.

With one final OAuth command the data of a specific IDP can be removed again:

$ java -cp stream-sync-assembly-<version>.jar com.github.sync.cli.oauth.OAuth remove \
  --idp-storage-path ~/tokens/ \
  --idp-name microsoft

This command deletes all files for the selected IDP in the path specified. As the files are just deleted, no encryption password is required here.

As is true for the main Sync application, the OAuth application offers the switch --help (or its short form -h) to explicitly request usage information. To get a general help screen, just enter:

$ java -cp stream-sync-assembly-<version>.jar com.github.sync.cli.oauth.OAuth --help

To request help information specific to a concrete command, also provide this command, for instance:

$ java -cp stream-sync-assembly-<version>.jar com.github.sync.cli.oauth.OAuth init --help

Throttling sync streams

In some situations it may be necessary to restrict the number of sync operations that are executed in a given time unit. For instance, there are public servers that react with an error status of 429 Too many requests when many small files are uploaded over a fast internet connection.

StreamSync supports two command line options to deal with such cases:

Table 8. Command line options for throttling sync operations
Option Description Default

throttle

The option is passed a numeric value that limits the number of sync operations (file uploads, deletion of files, creation of folders, etc.) in a time unit.

None

throttle-unit

This option defines the time unit, in which the throttle option should be applied. It can take one of the values Second, Minute, or Hour, or one of the abbreviations S, M, or H (case does not matter).

Second

For instance, using a command like

Sync --throttle 1 ...

only a single operation per second is executed. This is a good solution for the problem with overloaded servers because it mainly impacts small files and operations that complete very fast. The upload of larger files that takes significantly longer than a second will not be delayed by this option. By specifying greater time units, throttling can even be configured on a finer level, e.g.:

Sync --throttle 45 --throttle-unit minute ...

would limit the throughput of the sync stream to 45 operations per minute.

Another option to influence the speed of sync processes that have an HTTP server as source or destination is to override certain configuration settings. StreamSync uses the Akka HTTP library for the communication via the HTTP protocol. The library can be configured in many ways, and system properties can be used to override its default settings. Options you may want to modify in order to customize sync streams are the size of the pool for HTTP connections (which determines the parallelism possible and is set to 4 per default) or the number of requests that can be open concurrently (32 by default). To achieve this, pass the following arguments to the Java VM that executes StreamSync:

-Dakka.http.host-connection-pool.max-connections=1 -Dakka.http.host-connection-pool.max-open-requests=2

As you can see in this example, the name of the system properties is derived from the hierarchical structure of the configuration options for Akka HTTP as described in the referenced documentation.

Timeouts

To prevent that sync processes hang when servers involved respond very slowly, a timeout is applied to all operations. The timeout in seconds can be configured via the --timeout command line option; the default value is one minute.

If a sync process needs to upload large files to a server via a not so fast internet connection, the timeout probably has to be increased; otherwise, operations will fail because they take too long. The following example shows how to set the timeout to 10 minutes to deal with larger uploads:

Sync C:\data\work dav:https://sd2dav.1und1.de/backup/work --timeout 600

Reading passwords from the console

For some use cases, e.g. connecting to a WebDav server or encrypting files, StreamSync needs passwords. Per default, such passwords can be specified as command line arguments, like any other arguments processed by the program. This can, however, be problematic when it comes to secret data: If the program is invoked from a command shell, the passwords are directly visible. They are typically stored in the command line history as well. So they can be easily compromised.

To reduce this risk, passwords can also be read from the console. This happens automatically without any additional action required by the caller. If a password is required for a concrete sync scenario, but the corresponding command line argument is missing, the user is prompted to enter it. As prompt the name of the command line argument representing the password is used. When the password is typed in no echo is displayed.

It is well possible that multiple passwords are needed for a single sync process. An example could be a process that syncs from the local file system to an encrypted WebDav server. Then a password is needed to connect to the server, and another one for the encryption. Either of them can be omitted from the command line; the user is prompted for all missing passwords.

Dry-run mode

Before actually modifying data on the destination structure, it is sometimes useful to check, which actions will be performed; so that unexpected manipulations or even data loss can be avoided. This is possible by adding the --dry-run switch to the command line or its short alias -d. The sync process then still determines the differences between the source and the destination structure, and a sync log file can be specified, in which the sync operations are written. It will, however, not apply any actual changes to the destination structure.

Options specific to Mirror mode

This section describes the options that are only allowed in Mirror mode.

Sync log files

Section Log files described the usage of the --log option to produce a protocol of the operations executed during a sync/mirror process. While such a log file is meaningful on its own, for mirror processes it can serve an additional purpose:

It is possible to use such a log file as input for another mirror process. Then the sync operations to be executed are not calculated as the delta between two structures, but are directly read from the log file. This is achieved by specifying the --sync-log option whose value is the path to the log file to be read. Note that in this mode still the URIs for both the source and destination structure need to be specified; log files contain only relative URIs, and in order to resolve them correctly the root URIs of the original structures must be provided.

If the structures to be synced are pretty complex and/or large files need to be transferred over a slow network connection, sync processes can take a while. With the support for log files this problem can be dealt with by running multiple incremental mirror processes. This works as follows:

  1. An initial mirror process is run for the structures in question that has the --log option set and enables Dry-run mode. This does not execute any actions, but creates a log file with the operations that need to be done.

  2. Now further mirror processes can be started to process the sync log written in the first step. For such operations the following options must be set:

    • --sync-log is set to the path of the log file written in the first step.

    • --log is set to a file keeping track on the progress of the overall operation. This file is continuously updated with the sync operations that have been executed.

The mirror processes can now be interrupted at any time and resumed again later. When restarted with these options the process ignores all sync operations listed in the progress log and only executes those that are still pending. This is further outlined in the Examples and use cases section.

In the incremental mode, as described above, the error log file has no further function than reporting errors. Sync operations that appear in the error log are not written to the normal log and are not considered to be completed. So when running another mirror process from the sync log, these operations are retried (and if they fail again, they are written anew to the error log).

Switching source and destination structures

The typical use case for Stream Sync in mirror mode is transferring data from one system - the leading system - to another data structure; the destination structure gets modified to become a clone of the original system. From time to time you may need to run a mirror process in the inverse direction.

Consider for example that you use Stream Sync as a backup tool. If you mess up with your original data, you will probably want to replace it from the backup storage. This is of course easily possible: you just have to rewrite the sync command you use for your backup to work in the opposite direction. This can be done rather mechanically; the source and destination URIs have to be exchanged, as well as the src- and dst- prefixes of the parameters that configure your data structures.

Sync commands tend to be become complex; you often need a bunch of parameters to configure authentication and fine-tune the transfer process. Maybe you have therefore written shell scripts that contain your sync commands. In the backup scenario, you would have a shell script that triggers your backup. To restore your data from the backup structure, you could create a restore script using the replacements outlined above. This solution is not ideal, however, because you now have to maintain two scripts that need to be kept in sync.

For such use cases, Stream Sync offers an easier solution: it supports the --switch parameter, which swaps the source and destination structures, effectively reversing the sync direction. This means, you do not have to duplicate your commands or scripts, but simply add a parameter to switch the sync direction.

If you use shell scripts to store your sync commands, you should write them in a way that they support additional parameters. For instance, if your backup script looks as follows:

backup.sh
#!/bin/sh
./stream-sync.sh /data/documents dav:https://webdav.my-storage.com/backup/ \
  --dst-user backup-user --timeout 600 --dst-crypt-mode filesAndNames \
  --log ~/logs/backup.log

Add the special parameter "$@" at the end, which represents all the parameters entered by the user:

backup.sh supporting additional parameters
#!/bin/sh
./stream-sync.sh /data/documents dav:https://webdav.my-storage.com/backup/ \
  --dst-user backup-user --timeout 600 --dst-crypt-mode filesAndNames \
  --log ~/logs/backup.log "$@"

You can now transform your backup script to a restore script by simply adding the --switch parameter:

./backup.sh --switch
Note
The --switch option is available only in mirror mode, since in sync mode the structures have different semantics attached to them. The local structure is the one that is backed by a local state file. Therefore, it is not easily possible to switch the direction of the process.

Options specific to Sync mode

As it is the case for mirror mode, a number of command line options is available only if Sync mode is active. In most cases, these are related to the local state managed by Stream Sync for sync processes. The following subsections deal with these options.

Managing local state files

As has been shortly mentioned in the Sync mode section, Stream Sync manages a file with information about the local state for each sync stream. Based on this file, it can detect local updates and compute the changes to be applied to the remote structure (or recognize conflicting changes). A sync stream is identified by its local and remote sources; for each combination of a local and a remote source, a separate state file is created.

Per default Stream Sync can manage these state files transparently without user interaction. They are stored in a subfolder named .stream-sync in the current user’s home directory and have a (non-readable) name derived from the URIs of the local and remote structures. (Actually, the name is computed by concatenating the local and the remote URI, calculating a SHA-1 hash on this string, and applying a Base64-encoding on the result; but this is merely an implementation detail.)

While these defaults should work well in most cases, they can be overridden with some command line options:

  • --state-path allows specifying the path in which the state file is created. Here the user can provide an arbitrary directory. The path will be created if it does not exist.

  • --stream-name can be used to set a name for the sync stream. The state file is then given this name instead of the cryptic auto-generated one.

The following fragment shows a usage example of these options:

Specifying options for the local state file
Sync /data/documents dav:https://webdav.my-storage.com/backup/ --sync \
  --state-path /data/sync/state
  --stream-name 'documents-backup'
  --dst-user backup-user

Importing local state

Every run of a sync process updates the local state file associated with the stream. For the initial execution of the stream, a state file does not exist yet. This is no problem if one of the structures taking part in the sync process is empty and will be initialized from the other side. Then the initial sync run actually becomes a mirror: Stream Sync copies all the files found in the existing structure to the empty one and writes an up-to-date local state file automatically.

If there is already data on both sides, however, there should better be a valid local state file available before running a first sync process. Otherwise, Stream Sync considers all local files as newly created and will treat changes on remote files as conflicts. To avoid this, you should create a clean local state file that reflects the current state of the local structure. This is achieved by adding the --import-state switch to the command line. The switch enables a special mode, in which only the local structure is iterated over, and all files encountered are recorded in the state file. Afterwards, a fresh and up-to-date state file exists. A (re-)import of the local state can also be done if the state file got corrupted for whatever reason.

For the example sync stream from the previous section, an import command could look as follows:

Importing local state
Sync /data/documents dav:https://webdav.my-storage.com/backup/ --sync --import-state \
  --state-path /data/sync/state
  --stream-name 'documents-backup'
  --dst-user backup-user
Note
You could of course drop the options that configure the local state file. Then the file would be created and initialized at its default location in the user’s home directory.
Note
The remote side of the sync process must be fully specified, even if it will not be accessed by this sync run. This is because the default name of the state file is derived from the URIs for the local and remote structures; so it must be present.

Examples and use cases

Sync a local directory to an external USB hard disk

This should be a frequent use case, in which some local work is saved on an external hard disk. The command line is pretty straight-forward, as the target drive can be accessed like a local drive; e.g. under Windows it is assigned a drive letter. The only problem is that if the file system on the external drive is FAT32, it may be necessary to explicitly specify a time zone in which last-modified timestamps are interpreted (refer to the description of local directories for more information). For this purpose, the time-zone option needs to be provided. In addition, the ignore-time-delta option is set to a value of 2 seconds to make sure that small differences in timestamps with a granularity below seconds do not cause unnecessary copy operations.

Sync C:\data\work D:\backup\work --dst-time-zone UTC+02:00 --ignore-time-delta 2

Do not remove archived data

Consider the case that a directory structure stores the data of different projects: the top-level folder contains a sub folder for each project; all files of this project are then stored in this sub folder and in further sub sub folders.

On your local hard-disk you only have a subset of all existing projects, the ones you are currently working on. On a backup medium all project folders should be saved.

Default sync processes are not suitable for this scenario because they would remove all project folders from the backup medium that are not present in the source structure. This can be avoided by using the min-level filter as follows:

Sync /path/to/projects /path/to/backup --filter-remove min-level:1

This filter statement says that on the top-level of the destination structure no remove operations are executed. For the example at hand the effect is that folders for projects not available in the source structure will not be removed. In the existing folders, however, (which are on level 1 and greater) full sync operations are applied; so all changes done on a specific project folder are transferred to the backup medium.

Interrupt and resume long-running sync processes

As described under Sync log files, with the correct options mirror processes can be stopped at any time and resumed at a later point in time. The first step is to generate a so-called sync log, i.e. a file containing the operations to be executed to sync the structures in question:

Sync /path/to/source /path/to/dest --dry-run --log /data/sync.log

This command does not change anything in the destination structure, but only creates a file /data/sync.log with a textual description of the operations to execute. (Such files have a pretty straight-forward structure. Each line represents an operation including an action and the element affected.)

Now another mirror process can be started that takes this log file as input. To keep track on the progress that is made, a second log file has to be written - the progress log:

Sync /path/to/source /path/to/dest --sync-log /data/sync.log --log /data/progress.log

This process can be interrupted and later started again with the same command line. It will execute the operations listed in the sync log, but ignore the ones contained in the progress log. Therefore, the whole sync process can be split in a number of incremental sync processes.

Sync from a local directory to a WebDav directory

The following command can be used to mirror a local directory structure to an online storage:

Sync C:\data\work dav:https://sd2dav.1und1.de/backup/work \
--log C:\Temp\sync.log \
--dst-user my.account --dst-password s3cr3t_PASsword \
--dst-modified-property Win32LastModifiedTime \
--dst-modified-namespace urn:schemas-microsoft-com: \
--filter exclude:*.bak

Here all options supported by the WebDav structure type are configured. The server (which really exists) does not allow modifications of the standard WebDav getlastmodified property, but uses a custom property named Win32LastModifiedTime with the namespace urn:schemas-microsoft-com: to hold a modified time different from the upload time. This property will be set correctly for each file that is uploaded during a sync process.

Note that the --dst-password parameter could have been omitted. Then the user would have been prompted for the password.

Sync from a local directory to a WebDav server with encryption

Building upon the previous example, with some additional options it is possible to protect the data on the WebDav server using encryption:

Sync C:\data\work dav:https://sd2dav.1und1.de/backup/work \
--log C:\Temp\sync.log \
--dst-user my.account --dst-password s3c3t_PASsword \
--dst-modified-property Win32LastModifiedTime \
--dst-modified-namespace urn:schemas-microsoft-com: \
--filter exclude:*.bak \
--dst-encrypt-password s3cr3t \
--dst-crypt-mode filesAndNames \
--crypt-cache-size 1024 \
--ops-per-second 2 \
--timeout 600

This command specifies that both the content and the names of files are encrypted using the password "s3cr3t" when copied onto the WebDav server. With an encryption mode of files only the files' content would be encrypted, but the file names would remain in plain text. The size of the cache for encrypted names is increased to avoid unnecessary crypt operations. In the example the number of sync operations per second is limited to 2 to avoid that the server rejects requests because its load is too high. Also, a larger timeout has been set (600 seconds = 10 minutes), so that uploads of larger files will not cause operations to fail.

Sync from a local directory to Microsoft OneDrive

As described in the Microsoft OneDrive section, some preparations are necessary before OneDrive can be used as source or destination structure of a sync process. These are mainly related to authentication because an OAuth client for the Microsoft Identity Provider (IDP) has to be registered and integrated with Stream Sync.

As a first step, the OAuth client application has to be created in the Azure Portal. The application is assigned a client ID and a client secret and is then able to interact with the Microsoft IDP to obtain valid access tokens. Note that if Stream Sync was a closed source application, it could have been registered as a client application and be shipped with its client secret. But because the full source is available in a public repository, such a registration cannot be done; the client secret would not be very secret, would it?

The steps necessary to create a client application are described in detail in the official Microsoft documentation under OneDrive authentication and sign-in. Here we will give a short outline.

Log into the Microsoft Azure Portal and navigate to the page for App registrations. Here you can create a new application. You are then presented a form where you can enter some data about the new application. Choose a name and select the type of accounts to be supported. You also have to enter a redirect URI, which will be invoked by the Microsoft IDP as part of the code authorization flow. It is up to you, which redirect URI you choose; if you intend to run sync processes on your personal machine, it is recommended to use a URI pointing to localhost with a port number that is not in use on your computer, such as http://localhost:8080. This simplifies the integration with Stream Sync as described below.

After all information has been entered, the app can be registered. The app is then assigned an ID that is displayed in the overview page. On the certificates and secrets page, you can request a new client secret. Copy this secret, it is required later on.

Next you have to add the information about your OAuth client application to Stream Sync. This is done with some command line operations. For the following steps we assume that you have defined some environment variables that are referenced in the commands below:

Variable Description

SYNC_JAR

Points to the assembly jar of Stream Sync; this is used to set the classpath for Java invocations.

CLIENT_ID

Contains the client ID of the app you have just registered at the Azure Portal.

CLIENT_SECRET

Contains the secret of this app.

TOKEN_STORE

Points to the directory where Stream Sync should store information about OAuth client applications, e.g. ~/token-store.

With a first command, basic properties of the client application are specified:

$ java -cp $SYNC_JAR com.github.sync.cli.oauth.OAuth init \
  --idp-storage-path $TOKEN_STORE \
  --idp-name microsoft \
  --auth-url https://login.live.com/oauth20_authorize.srf \
  --token-url https://login.live.com/oauth20_token.srf \
  --scope "files.readwrite offline_access" \
  --redirect-url http://localhost:8080 \
  --client-id $CLIENT_ID \
  --client-secret $CLIENT_SECRET

Here we use the name microsoft to reference this IDP and a localhost redirect URI. The other options, the URLs and the scope values, are defined by the OneDrive API and must have exactly these values. This command will prompt you for a password for the IDP; sensitive data in the token directory is encrypted with this password. (If you do not want the files to be encrypted, add the option --encrypt-idp-data false.)

Now we can do a login against the Microsoft IDP and obtain an initial pair of an access and refresh token:

$ java -cp $SYNC_JAR com.github.sync.cli.oauth.OAuth login \
  --idp-storage-path $TOKEN_STORE \
  --idp-name microsoft

This command will open your standard Web browser and point it to the authorization URL of the Microsoft IDP. You are presented a form to enter the credentials of your Microsoft account. You are then asked whether you want to grant access to your client application. Confirm this.

Because we have used a redirect URI of the form http://localhost:<port>; the authorization code can be obtained automatically, and the command should finish with a message that the login was successful. (For other redirect URIs you have to determine the code yourself and enter it at the prompt in the console.)

After completion of these steps, Stream Sync has all the information to authenticate against your OneDrive account. So you can run a sync process. One piece of information you still need is the ID of your OneDrive account. This can be obtained by signing in into the OneDrive Web application. The browser’s address bar shows a URL of the form https://onedrive.live.com/?id=root&cid=xxxxxx. The ID in question is the alphanumeric string after the cid parameter. We assume that you create an environment variable DRIVE_ID with this value.

The following command shows how the local work directory can be synced against the data folder of your OneDrive account:

Sync ~/work onedrive:$DRIVE_ID \
--dst-path /data \
--dst-idp-storage-path $TOKEN_STORE \
--dst-idp-name microsoft

Of course, you can use other standard options as well, for instance for setting timeouts, configuring encryption or set filters. The following example uses the same options as the one in the section about WebDav and encryption:

Sync ~/work onedrive:$DRIVE_ID \
--dst-path /data \
--dst-idp-storage-path $TOKEN_STORE \
--dst-idp-name microsoft \
--log C:\Temp\sync.log \
--filter exclude:*.bak \
--dst-encrypt-password s3cr3t \
--dst-crypt-mode filesAndNames \
--crypt-cache-size 1024 \
--ops-per-second 2 \
--timeout 600

Sync from a local directory to Google Drive

The steps to set up Stream Sync for an integration with Google Drive are very similar to the ones described in the OneDrive example. Specifically, an application needs to be created in the Google Cloud Platform Console, in order to obtain the credentials (the OAuth client ID and secret) required for the authentication with Google’s OAuth identity provider. As the OneDrive example covers the basics in detail, this section will focus mainly on the differences between these cloud storage providers.

Documentation about the process can be found in the Official Google documentation. Here is a short summary:

At first, a new project has to be created in the Google Cloud Platform Console. With this new project selected, under Credentials click CREATE CREDENTIALS and select OAuth Client ID. Set the Application type to Desktop app and enter a name for the new client. After the successful creation of the OAuth client, the web application will present its client ID and secret. In contrast to Microsoft’s OAuth implementation, no redirect URL needs to be specified when selecting Desktop app as client type. We can use a local redirect URL later when interacting with the identity provider.

With the OAuth client ID and secret available, Stream Sync can now be configured with the details of this client. This can be done using the following command:

$ java -cp $SYNC_JAR com.github.sync.cli.oauth.OAuth init \
  --idp-storage-path $TOKEN_STORE \
  --idp-name google \
  --auth-url https://accounts.google.com/o/oauth2/v2/auth \
  --token-url https://oauth2.googleapis.com/token \
  --scope "https://www.googleapis.com/auth/drive https://www.googleapis.com/auth/drive.file https://www.googleapis.com/auth/drive.metadata" \
  --redirect-url http://localhost:8080 \
  --client-id $CLIENT_ID \
  --client-secret $CLIENT_SECRET
Note
Here again some environment variables are referenced that are expected to have been initialized with the corresponding information. They are explained in the OneDrive example. Of course, you can use a different name for this configuration than google.

The next step is a login against the Google identity provider. It can be triggered with the command below:

$ java -cp $SYNC_JAR com.github.sync.cli.oauth.OAuth login \
  --idp-storage-path $TOKEN_STORE \
  --idp-name google

The command opens a web browser and navigates to a login page served by the Google OAuth identity provider. The account you select for the login will be the one that is later accessed by Stream Sync. You have to confirm that you grant access to the application you have created before. After a successful login, Stream Sync should be able to obtain the OAuth tokens and store them locally in the configured path.

You can now run sync processes using your Google Drive account as source or destination structure. For instance, the following command syncs the folder /data/google against the full content stored in your Google Drive:

Sync /data/google googledrive: \
--dst-idp-storage-path $TOKEN_STORE \
--dst-idp-name google

The destination URI googledrive: refers to the root folder of your Google Drive. It is possible to specify a path after the googledrive: prefix; so you could sync only the sub folder music as follows:

Sync /data/google/music googledrive:music \
--dst-idp-storage-path $TOKEN_STORE \
--dst-idp-name google

Of course, all other options provided by Stream Sync, like encryption or filters, are available as well.

Setting up a sync process for existing data

This section discusses the initialization of a sync process over already existing data on a concrete example. It assumes that the mirror mode of Stream Sync has already been used to keep a backup of a local folder with music files on a Google Drive account. (For simplicity, we use the example from the previous section about Google Drive.) Now another device comes into play that should have read and write access to the music collection. The challenge here lies in the correct setup of the local state file.

The first step is to make sure that the local folder contains the most recent data and is in sync with the content of the Google Drive folder. The straight-forward way to achieve this is by running again a mirror process that applies all local changes to the Cloud folder:

Mirror run to apply all local changes to the Google Drive folder
Sync /data/google/music googledrive:music \
  --dst-idp-storage-path $TOKEN_STORE \
  --dst-idp-name google

This assumes that modifications were done only locally, since all changes on the Google Drive Folder are overridden. If this was not the case, you would have to manually ensure that both structures contain the same, up-to-date data.

After the local folder has the correct content, the local state can now be imported using the command below. We use the standard name and location for the local state file:

Importing local state
Sync /data/google/music googledrive:music \
  --sync \
  --import-state \
  --dst-idp-storage-path $TOKEN_STORE \
  --dst-idp-name google

This should finish rather fast, since only the local file system is processed. The command yields a file with local state information in the .stream-sync subfolder of the user’s home directory.

The second device that should have access to the music collection can be initialized in a similar way. Probably, we want to run a mirror process first, but this time using the Google Drive folder as source and the local folder as destination structure - after the steps performed on the first computer, the Google Drive should contain the most recent data. After this is done, the local state can be imported as described before; execute an equivalent command, maybe the path to the local folder has to be adapted.

In the future, manipulations can be done on the data on both devices. Start a sync process when appropriate using a command like this:

Regular sync run
Sync /data/google/music googledrive:music \
  --sync \
  --dst-idp-storage-path $TOKEN_STORE \
  --dst-idp-name google

(This is basically the same command as for importing the local state, just without the --import-state flag.) Stream Sync will sync the changes from both devices or issue warnings if it detects conflicting changes.

Architecture

The Stream Sync tool makes use of Reactive streams in the implementation of [Akka](https://akka.io/) to perform sync operations. Both the source and the destination structure are represented by a stream source emitting objects that represent the contents of the structure (files and folders). A special graph stage implementation contains the actual sync algorithm. It compares two elements from the sources (which are expected to arrive in a defined order) and decides which action needs to be performed (if any) to keep the structures in sync. This stage produces a stream of SyncOperation objects.

So far only a description of the actions to be performed has been created. In a second step, the SyncOperation objects are interpreted and applied to the destination structure.

License

Stream Sync is available under the [Apache 2.0 License](http://www.apache.org/licenses/LICENSE-2.0.html).

stream-sync's People

Contributors

oheger avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.