Giter VIP home page Giter VIP logo

azure-searchindex-data-exporter's Introduction

page_type languages name description products urlFragment
sample
csharp
Export data from an Azure Cognitive Search index
Export data from an Azure Cognitive Search service. This example builds a C# Console Application using the Azure Cognitive Search .NET SDK.
azure
azure-cognitive-search
export-data

Export Azure Cognitive Search service index data

Flask sample MIT license badge

Export data from an Azure Cognitive Search service. This .NET application runs on the command line.

Prerequisites

Setup

  1. Clone or download this sample repository.

  2. Extract contents if the download is a zip file. Make sure the files are read-write.

Run the sample

  1. Run the app locally using Visual Studio or dotnet run

  2. There are 4 commands in the app

    1. get-bounds
    2. partition-index
    3. export-partitions
    4. export-continuous

These commands support two different strategies for exporting data from the index

  1. Partitioned export. Documents in the index are split into smaller partitions that can be concurrently exported into JSON files.
  2. Continuous export. An additional field is added to your index to track export progress, and is continually updated as more documents are exported.

These strategies have different tradeoffs. You should use partitioned export when:

  • You have a sortable and filterable field that can be used to partition the documents in the index.
  • You are not updating any documents in the index, or you are not updating the documents in the index you want to export.
  • You have a large number of documents. Partitioned export supports exporting more than 1000 documents concurrently. Export speed depends on how your search service is provisioned.

You should use continuous export when:

Partitioned export commands

get-bounds

The get-bounds command is used to find the smallest and largest values of a sortable and filterable field in the index. This is used to determine how to split up the documents in the index into smaller partitions

dotnet run get-bounds

Description:
  Find and display the largest and lowest value for the specified field. Used to determine how to partition index data for export

Usage:
  export-data get-bounds [options]

Options:
  --endpoint <endpoint> (REQUIRED)      Endpoint of the search service to export data from
  --admin-key <admin-key> (REQUIRED)    Admin key to the search service to export data from
  --index-name <index-name> (REQUIRED)  Name of the index to export data from
  --field-name <field-name> (REQUIRED)  Name of field used to partition the index data. This field must be filterable and sortable.
  -?, -h, --help                        Show help and usage information

Sample usage:

 dotnet run get-bounds --endpoint https://example.search.windows.net --admin-key AAAAAAA --index-name my-index --field-name date

Lower Bound 1969-12-31T16:11:38.0000000+00:00
Upper Bound 2022-11-06T12:14:21.0000000+00:00

In this example, date is a Edm.DateTimeOffset with the sortable and filterable attributes applied. The lowest possible value in the index for this field is 1969/12/31 and the highest possible value in the index for this field is 2011/11/06.

partition-index

The partition-index command is used to divide the index into smaller partitions.

Description:
  Partitions the data in the index between the upper and lower bound values into partitions with at most 100,000 documents.

Usage:
  export-data partition-index [options]

Options:
  --endpoint <endpoint> (REQUIRED)      Endpoint of the search service to export data from. Example:
                                        https://example.search.windows.net
  --admin-key <admin-key> (REQUIRED)    Admin key to the search service to export data from
  --index-name <index-name> (REQUIRED)  Name of the index to export data from
  --field-name <field-name> (REQUIRED)  Name of field used to partition the index data. This field must be filterable and sortable.
  --lower-bound <lower-bound>           Smallest value to use to partition the index data. Defaults to the smallest value in the
                                        index. []
  --upper-bound <upper-bound>           Largest value to use to partition the index data. Defaults to the largest value in the
                                        index. []
  --partition-path <partition-path>     Path of the file with JSON description of partitions. Should end in .json. Default is <index
                                        name>-partitions.json []
  -?, -h, --help                        Show help and usage information

Sample usage:

dotnet run partition-index --endpoint https://example.search.windows.net --admin-key AAAAAAA --index-name my-index --field-name date

Wrote partitions to my-index-partitions.json

In this case, my-index-partitions.json has a JSON description of the partitions inside the index

{
  "endpoint": "https://example.search.windows.net",
  "indexName": "my-index",
  "fieldName": "date",
  "totalDocumentCount": 500000,
  "partitions": [
    {
      "upperBound": "1976-08-09T12:41:58.375+00:00",
      "lowerBound": "1969-12-31T16:11:38+00:00",
      "documentCount": 62382,
      "filter": "date ge 1969-12-31T16:11:38.0000000+00:00 and date le 1976-08-09T12:41:58.3750000+00:00"
    },
    // more partitions in the same format as above
  ]

The JSON file contains metadata about the index and the partitions it created, such as total document count and partition field name. The partitions field lists all the filters used to retrieve the partitions using pagination.

export-partitions

The export-partitions command is used to export the partitions created by partition-index into JSON files.

Description:
  Exports data from a search index using a pre-generated partition file from partition-index

Usage:
  export-data export-partitions [options]

Options:
  --partition-path <partition-path> (REQUIRED)     Path of the file with JSON description of partitions. Should end in .json.
  --admin-key <admin-key> (REQUIRED)               Admin key to the search service to export data from
  --export-path <export-path>                      Directory to write JSON Lines partition files to. Every line in the partition
                                                   file contains a JSON object with the contents of the Search document. Format of
                                                   file names is <index name>-<partition id>-documents.json [default: .]
  --concurrent-partitions <concurrent-partitions>  Number of partitions to concurrently export. Default is 2 [default: 2]
  --page-size <page-size>                          Page size to use when running export queries. Default is 1000 [default: 1000]
  --include-partition <include-partition>          List of partitions by index to include in the export. Example:
                                                   --include-partition 0 --include-partition 1 only runs the export on first 2
                                                   partitions []
  --exclude-partition <exclude-partition>          List of partitions by index to exclude from the export. Example:
                                                   --exclude-partition 0 --exclude-partition 1 runs the export on every partition
                                                   except the first 2 []
  --include-field <include-field>                  List of fields to include in the export. Example: --include-field field1
                                                   --include-field field2. []
  --exclude-field <exclude-field>                  List of fields to exclude in the export. Example: --exclude-field field1
                                                   --exclude-field field2. []
  -?, -h, --help                                   Show help and usage information

Sample usage:

dotnet run export-partitions --partition-path my-index-partitions.json --admin-key AAAAAAA --export-path C:\Users\MyAccount\output --concurrent-partitions 8
Starting partition 2
Starting partition 1
Starting partition 0
Starting partition 3
Starting partition 7
Starting partition 4
Starting partition 5
Starting partition 6
Ended partition 4
Ended partition 6
Ended partition 3
Ended partition 0
Ended partition 7
Ended partition 2
Ended partition 1
Ended partition 5

The export-partitions command was run on partitions in the my-index-partitions.json file, which was output by the previous partition-index command. --concurrent-partitions was set to 8, so 8 partitions in this file were loaded into JSON files concurrently. This number can be changed to customize parallelization. Higher numbers increase load on the search service but complete the export more quickly. Lower numbers use less resources, but take a longer time to complete the export.

1 JSON file per partition is output, with the file name formatted as index-partition_index-documents.json. The output JSONL files have 1 JSON object per line, corresponding to a single search document. All fields marked as retrievable are exported by default. Fields can be either explicitly included using --include-field, or explicitly excluded using --exclude-field.

Example output in index-0-documents.json:

{"id":"document-1", "text": "first document", "date":"1969-12-31T16:11:38Z"},
{"id":"document-2","text": "second document", "date":"1969-12-31T17:05:39Z"},
...

Continuous export commands

export-continuous

The export-continuous command starts finding documents that have not been exported and writes them into a JSON file

Description:
  Exports data from a search service by adding a column to track which documents have been exported and continually updating it

Usage:
  export-data export-continuous [options]

Options:
  --endpoint <endpoint> (REQUIRED)         Endpoint of the search service to export data from. Example:
                                           https://example.search.windows.net
  --admin-key <admin-key> (REQUIRED)       Admin key to the search service to export data from
  --index-name <index-name> (REQUIRED)     Name of the index to export data from
  --export-field-name <export-field-name>  Name of the Edm.Boolean field the continuous export process will update to track which
                                           documents have been exported. Default is 'exported' [default: exported]
  --page-size <page-size>                  Page size to use when running export queries. Default is 1000 [default: 1000]
  --export-path <export-path>              Path to write JSON Lines file to. Every line in the file contains a JSON object with the
                                           contents of the Search document. Format of file is <index name>-documents.json []
  --include-field <include-field>          List of fields to include in the export. Example: --include-field field1 --include-field
                                           field2. []
  --exclude-field <exclude-field>          List of fields to exclude in the export. Example: --exclude-field field1 --exclude-field
                                           field2. []
  -?, -h, --help                           Show help and usage information

Sample usage:

dotnet run export-continuous --endpoint https://example.search.windows.net --admin-key AAAA --index-name my-index   

1 JSON file is output, with the file name formatted as my-index-documents.json. The output JSONL file has 1 JSON object per line, corresponding to a single search document. All fields marked as retrievable are exported by default, except the field used to track if the document was exported or not. Fields can be either explicitly included using --include-field, or explicitly excluded using --exclude-field. If the export is cancelled, it is resumed where it left off.

Duplicate documents may be included in the exported data. If the search service has multiple replicas, a best-effort attempt is made to use the same replica to ensure consistent export results. There may also be a delay in updating already exported documents, so documents may be exported more than once. Storage usage also increases as additional data is added to the index. If duplicate documents or storage limits are an issue, partitioned export is recommended.

Example output in my-index-documents.json:

{"id":"document-1", "text": "first document"},
{"id":"document-2","text": "second document"},

Next steps

You can learn more about Azure Cognitive Search on the official documentation site.

azure-searchindex-data-exporter's People

Contributors

farzammohammadi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.