Giter VIP home page Giter VIP logo

schranz-search / schranz-search Goto Github PK

View Code? Open in Web Editor NEW
304.0 304.0 17.0 4.42 MB

Search abstraction over different search engines written in PHP. Currently implemented Elasticsearch, Opensearch, Algolia, Meilisearch, RediSearch, Solr, Typesense. Documentation: https://schranz-search.github.io/schranz-search/

Home Page: https://schranz-search.github.io/schranz-search/

License: MIT License

PHP 94.32% JavaScript 0.04% Blade 4.78% Hack 0.55% Shell 0.07% CSS 0.22% Batchfile 0.02%
algolia apache-solr elasticsearch help-wanted meilisearch opensearch php phpseal redisearch schranz-search search search-abstraction search-client typesense

schranz-search's Introduction

Schranz Search Logo with a Seal on it with a magnifying glass

Schranz Search

Monorepository for SEAL a Search Engine Abstraction Layer with support to different search engines
Documentation | Packages

Elasticsearch | Opensearch | Meilisearch | Algolia | Loupe | Solr | Redisearch | Typesense
PHP | Symfony | Laravel | Spiral | Mezzio | Yii



๐Ÿ‘‹ Introduction

The SEAL project is a PHP library designed to simplify the process of interacting with different search engines. It provides a straightforward interface that enables users to communicate with various search engines, including:

It also provides integration packages for the

Symfony, Laravel, Spiral, Mezzio and Yii PHP frameworks.

It is worth noting that the project draws inspiration from the Doctrine and Flysystem projects. These two projects have been a great inspiration in the development of SEAL, as they provide excellent examples of how to create consistent and user-friendly APIs for complex systems.

Note: This project is heavily under development and any feedback is greatly appreciated.

๐Ÿ—๏ธ Structure

SEAL Structure overview

SEAL's provides a basic abstraction layer for add, remove and search and filters for documents. The main class and service handling this is called Engine, which is responsible for all this things. The Schema which is required defines the different Indexes and their Fields.

The project provides different Adapters which the Engine uses to communicate with the different Search Engine software and services. This way it is easy to switch between different search engine software and services.

Glossary

Term Definition
Engine The main class and service responsible to provide the basic interface for add, remove and search and filters for documents.
Schema Defines the different Indexes and their Fields, for every field a specific type need to be defined and what you want todo with them via flags like searchable, filterable and sortable.
Adapter Provides the communication between the Engine and the Search Engine software and services.
Documents A structure of data that you want to index need to follow the structure of the fields of the index schema.
Search Engine Search Engine software or service where the data will actually be stored currently Meilisearch, Opensearch, Elasticsearch, Algolia, Redisearch, Solr and Typesense is supported.

๐Ÿ“– Installation and Documentation

The documentation is available at https://schranz-search.github.io/schranz-search/. It is the recommended and best way to start using the library, it will step-by-step guide you through all the features of the library.

๐Ÿ“ฆ Packages

Full list of packages provided by the SEAL project:

Have also a look at the following tags:

๐Ÿฆ‘ Similar Projects

Following projects in the past target similar problem:

๐Ÿ“ฉ Authors

schranz-search's People

Contributors

alexander-schranz avatar butschster avatar dependabot[bot] avatar dhirtzbruch avatar jmsche avatar keichinger avatar ker0x avatar mdevster avatar toshy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

schranz-search's Issues

[Algolia] Multi Index Search support

Currently the AlgoliaAdapter implementation can search only in one index.

This should be changed if possible:

if (count($search->indexes) !== 1) {
throw new \RuntimeException('Algolia does not support multiple indexes in one query.');
}
$index = $this->client->initIndex($search->indexes[\array_key_first($search->indexes)]->name);

Seems like Agolia has possibility to search against multiple indexes but only get a result per index back. Which make it not usable like elasticsearch or opensearch where a mixed result is possible.

Docs Multi Search: https://www.algolia.com/doc/api-reference/api-methods/multiple-queries/

Responses from @algolia to this issue: https://twitter.com/chuckm/status/1611372282525667334

[Algolia] Multiple Sortby fields not supported

Currently we are not supporting to sortby document by multipl sortBy in the Algolia adapter. It has the reasons because in Algolia we need to create for each sortableField an own replica with a custom ranking.

$replicas = [];
foreach ($index->sortableFields as $field) {
foreach (['asc', 'desc'] as $direction) {
$replicas[] = $index->name . '__' . \str_replace('.', '_', $field) . '_' . $direction;
}
}
$attributes = [
'searchableAttributes' => $index->searchableFields,
'attributesForFaceting' => $index->filterableFields,
'replicas' => $replicas,
];
$indexResponse = $searchIndex->setSettings($attributes);
foreach ($index->sortableFields as $field) {
foreach (['asc', 'desc'] as $direction) {
$searchIndex = $this->client->initIndex(
$index->name . '__' . \str_replace('.', '_', $field) . '_' . $direction
);
$searchIndex->setSettings([
'ranking' => [
$direction . '(' . $field . ')',
],
]);
}
}

And so we can just search by one specific sorted index:

$sortByField = \array_key_first($search->sortBys);
if ($sortByField) {
$indexName .= '__' . \str_replace('.', '_', $sortByField) . '_' . $search->sortBys[$sortByField];
}

[Core] Make none Text fields searchable or not

At current state all kind of types can be searchable (text, int, float, bool, date, ..)

Some search engines like Typesense are not supporting this #96. So the general question is should the abstraction support this. Which would require extend the Marshaller to create for int, float, bool, date fields via a Flag addtional searchable text field or not.

Related changes:

[Core] Provide Schema Loaders for supporting different Formats

The Schema is currently created via PHP but maybe like in doctrine it make sense to create a Schema via other Formats like XML, YAML, JSON?

Current PHP Schema:

<?php

use Schranz\Search\SEAL\Schema\Field;
use Schranz\Search\SEAL\Schema\Index;
use Schranz\Search\SEAL\Schema\Schema;

return new Schema([
    'news' => new Index('news', [
        'id' => new Field\IdentifierField('id'),
        'title' => new Field\TextField('title'),
        'header' => new Field\TypedField('header', 'type', [
            'image' => [
                'media' => new Field\IntegerField('media'),
            ],
            'video' => [
                'media' => new Field\TextField('media', searchable: false),
            ],
        ]),
        'article' => new Field\TextField('article'),
        'blocks' => new Field\TypedField('blocks', 'type', [
            'text' => [
                'title' => new Field\TextField('title'),
                'description' => new Field\TextField('description'),
                'media' => new Field\IntegerField('media', multiple: true),
            ],
            'embed' => [
                'title' => new Field\TextField('title'),
                'media' => new Field\TextField('media', searchable: false),
            ],
        ], multiple: true),
        'footer' => new Field\ObjectField('footer', [
            'title' => new Field\TextField('title'),
        ]),
        'created' => new Field\DateTimeField('created'),
        'commentsCount' => new Field\IntegerField('commentsCount'),
        'rating' => new Field\FloatField('rating', sortable: true),
        'comments' => new Field\ObjectField('comments', [
            'email' => new Field\TextField('email', searchable: false),
            'text' => new Field\TextField('text'),
        ], multiple: true),
        'tags' => new Field\TextField('tags', multiple: true, filterable: true),
        'categoryIds' => new Field\IntegerField('categoryIds', multiple: true, filterable: true),
   ]),
]);

Which other formats should be supported:

  • YAML?
  • JSON?
  • XML?

Is the Core responsible for this formats or should the Core not care about additional Formats and Example the Symfony Bundle could us Symfony configuration tree instead, which out of the box supports Yaml, Xml, PHP Array, PHP Builder configuration.

But that would mean the definition would be framework specific, so own Schema Loaders which are not Framework specific maybe would make sense.

Maybe also other formats as the current PHP representation are not required?

In doctrine DBAL is also not responsible for this formats the ORM provide something so maybe the SEAL CORE should also not handle different formats or Schema Loaders, and maybe a future ODM would handle that, but that is another Ticket #81.

What also should be keep in mind does providing different formats improve a better developer experience or even hurt it as some answers to stackoverflow, github discussions, ... are written in just one format and not in another.

[Algolia] Cleanup Test Indices

Currently in our tests indices seems not correclty cleanup which ends up with more and more indexes in Algolia for our own account.

Currently I'm running manually:

foreach ($this->client->listIndices()['items'] as $key => $value) {
    var_dump($index['name']);

    try {
        $this->client->initIndex($value['name'])
            ->delete();
    } catch (\Exception $e) {
        var_dump('Errored ... ' . $value['name']);
        // ignore
    }
}

exit;

from time to time.

This cleanup should maybe be automated.

Algolia Support Ticket: https://support.algolia.com/hc/en-us/requests/540200?page=1

[Core] Aliases fields to reduce fields count

If we have a document like this:

{
    "id": "1",
    "title": "New Blog",
    "article": "<article><h2>Some Subtitle<\/h2><p>A html field with some content<\/p><\/article>",
    "blocks": [
        {
            "type": "text",
            "title": "Title",
            "description": "<p>Description<\/p>"
        },
        {
            "type": "text",
            "title": "Title 2",
            "description": "<p>Description 2<\/p>"
        },
        {
            "type": "embed",
            "title": "Video"
        },
        {
            "type": "quote",
            "title": "Quote",
            "description": "<p>Some quote and more<\/p>"
        }
    ],
    "footer": {
        "title": "New Footer"
    }
}

Currently we are mapping each field into an own field itself.

[
    'id' => new Field\IdentifierField('id'),
    'title' => new Field\TextField('title'),
    'article' => new Field\TextField('article'),
    'blocks' => new Field\TypedField('blocks', 'type', [
        'text' => [
            'title' => new Field\TextField('title'),
            'description' => new Field\TextField('description'),
        ],
        'embed' => [
            'title' => new Field\TextField('title'),
        ],
        'text' => [
            'title' => new Field\TextField('title'),
            'description' => new Field\TextField('description'),
        ],
    ], multiple: true),
    'footer' => new Field\ObjectField('footer', [
        'title' => new Field\TextField('title'),
    ]),
];

This make that in complex Systems like CMS a lot fields are created, or a lot of converts are required to map all fields in a collected fields. I think it would be a great addition to make this possible out of the box via some alias field type. Lets say we map all content data into an own content field e.g.:

[
    'id' => new Field\IdentifierField('id'),
    'title' => new Field\TextField('title'),
    'block_titles' => new Field\TextField('content', multiple: true),
    'content' => new Field\TextField('content', multiple: true),
    'article' => new Field\AliasField('content'),
    'blocks' => new Field\TypedField('blocks', 'type', [
        'text' => [
            'title' => new Field\AliasField('block_titles'),
            'description' => new Field\AliasField('content'),
        ],
        'embed' => [
            'title' => new Field\AliasField('block_titles'),
        ],
        'text' => [
            'title' => new Field\AliasField('block_titles'),
            'description' => new Field\AliasField('content'),
        ],
    ], multiple: true),
    'footer' => new Field\ObjectField('footer', [
        'title' => new new Field\AliasField('content'),
    ]),
];

This way the Marshaller would be required to create a document like this:

{
    "id": "1",
    "title": "New Blog",
    "block_titles": [
        "Title",
        "Title 2",
        "Video",
        "Quote"
    ],
    "content": [
        "<article><h2>Some Subtitle<\/h2><p>A html field with some content<\/p><\/article>",
        "<p>Description<\/p>",
        "<p>Description 2<\/p>",
        "<p>Some quote and more<\/p>",
        "New Footer"
    ]
}

The question which is coming up here if we always should save the original JSON in a _raw field to so we get it back in the same way we are setting it. This would also mean that the documents inside search indexes get a little bit bigger as e.g. elasticsearch will save whole doc already in _source but with additonal _raw the _source will twice as big as without _raw. Example:

Current JSON Size: 494
Squashed JSON File: 265
Squashed with Raw JSON Size: 767

/c @chirimoya @wachterjohannes

[Elasticsearch] Support Async HTTP Client

Currently only sync http client is supported:

throw new \RuntimeException('Currently only synchronous client is supported.');

It should be possible to use async client also. For this Indexer, SchemaManager, Searcher need to be updated. As there currently only sync Elasticsearch object is supported and not the Async Promise:

[Solr] Set uniqueKey to configured value instead of map id

Currently we need to map the uuid to id in our tests. This should not be required and we should set the uniqueKey correctly.

Sadly this is currently not possible via Solr Schema API: see https://issues.apache.org/jira/browse/SOLR-7242

[Core] Add test case when adding fields after first document

The TestingHelper first document should not contain all fields existing in the index. This way we make sure that we don't have a regression for complex documents which do not provide all fields inside the first document.

First document in TestingHelper should just be a document with a uuid nothing else.

[Docs] The "Getting Started" section does not clearly indicate if an adapter is optional or required

Adapters are introduced by the following sentence:

The project provides adapters to different search engines. Choice the one which fits your needs best:

https://github.com/schranz-search/schranz-search/blob/0.1/docs/getting_started/index.rst#installation

The section does not include the fact that an adapter is a requirement. But the following section ("Configure Engine") requires an adapter:

It requires an instance of the Adapter which we did install before.

https://github.com/schranz-search/schranz-search/blob/0.1/docs/getting_started/index.rst#configure-engine

[Elasticsearch / Opensearch] Move Normalize and Denormalize of Elasticsearch and Opensearch to a general place

Currently duplicated code exist in #21 and #20:

  • /**
    * @param AbstractField[] $fields
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeDocument(array $fields, array $document): array
    {
    $normalizedDocument = [];
    foreach ($fields as $name => $field) {
    if (!\array_key_exists($field->name, $document)) {
    continue;
    }
    if ($field->multiple && !\is_array($document[$field->name])) {
    throw new \RuntimeException('Field "' . $field->name . '" is multiple but value is not an array.');
    }
    match (true) {
    $field instanceof Field\ObjectField => $normalizedDocument[$name] = $this->normalizeObjectFields($document[$field->name], $field),
    $field instanceof Field\TypedField => $normalizedDocument = \array_replace($normalizedDocument, $this->normalizeTypedFields($name, $document[$field->name], $field)),
    default => $normalizedDocument[$name] = $document[$field->name],
    };
    }
    return $normalizedDocument;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeObjectFields(array $document, Field\ObjectField $field): array
    {
    if (!$field->multiple) {
    return $this->normalizeDocument($field->fields, $document);
    }
    $documents = [];
    foreach ($document as $data) {
    $documents[] = $this->normalizeDocument($field->fields, $data);
    }
    return $documents;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeTypedFields(string $name, array $document, Field\TypedField $field): array
    {
    $normalizedFields = [];
    if (!$field->multiple) {
    $document = [$document];
    }
    foreach ($document as $originalIndex => $data) {
    /** @var string|null $type */
    $type = $data[$field->typeField] ?? null;
    if ($type === null || !\array_key_exists($type, $field->types)) {
    throw new \RuntimeException('Expected type field "' . $field->typeField . '" not found in document.');
    }
    $typedFields = $field->types[$type];
    $normalizedData = \array_replace([
    '_type' => $type,
    '_originalIndex' => $originalIndex,
    ], $this->normalizeDocument($typedFields, $data));
    if ($field->multiple) {
    $normalizedFields[$name . '.' . $type][] = $normalizedData;
    continue;
    }
    $normalizedFields[$name . '.' . $type] = $normalizedData;
    }
    return $normalizedFields;
    }
    /**
    * @param Index[] $indexes
    * @param array<string, mixed> $searchResult
    *
    * @return array<string, mixed>
    */
    private function hitsToDocuments(array $indexes, array $hits): \Generator
    {
    $indexesByInternalName = [];
    foreach ($indexes as $index) {
    $indexesByInternalName[$index->name] = $index;
    }
    foreach ($hits as $hit) {
    $index = $indexesByInternalName[$hit['_index']] ?? null;
    if ($index === null) {
    throw new \RuntimeException('SchemaMetadata for Index "' . $hit['_index'] . '" not found.');
    }
    $denormalizedDocument = $this->denormalizeDocument($index->fields, $hit['_source']);
    yield $denormalizedDocument;
    }
    }
    /**
    * @param AbstractField[] $fields
    * @param array<string, mixed> $normalizedDocument
    *
    * @return array<string, mixed>
    */
    private function denormalizeDocument(array $fields, array $normalizedDocument): array
    {
    $denormalizedDocument = [];
    foreach ($fields as $name => $field) {
    if (!\array_key_exists($name, $normalizedDocument) && !$field instanceof Field\TypedField ) {
    continue;
    }
    match (true) {
    $field instanceof Field\ObjectField => $denormalizedDocument[$field->name] = $this->denormalizeObjectFields($normalizedDocument[$name], $field),
    $field instanceof Field\TypedField => $denormalizedDocument = \array_replace($denormalizedDocument, $this->denormalizeTypedFields($name, $normalizedDocument, $field)),
    default => $denormalizedDocument[$field->name] = $normalizedDocument[$name] ?? ($field->multiple ? [] : null),
    };
    }
    return $denormalizedDocument;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function denormalizeTypedFields(string $name, array $document, Field\TypedField $field): array
    {
    $denormalizedFields = [];
    foreach ($field->types as $type => $typedFields) {
    if (!isset($document[$name . '.' . $type])) {
    continue;
    }
    $dataList = $field->multiple ? $document[$name . '.' . $type] : [$document[$name . '.' . $type]];
    foreach ($dataList as $data) {
    $denormalizedData = \array_replace([$field->typeField => $type], $this->denormalizeDocument($typedFields, $data));
    if ($field->multiple) {
    /** @var string|int|null $originalIndex */
    $originalIndex = $data['_originalIndex'] ?? null;
    if ($originalIndex === null) {
    throw new \RuntimeException('Expected "_originalIndex" field not found in document.');
    }
    $denormalizedFields[$name][$originalIndex] = $denormalizedData;
    continue;
    }
    $denormalizedFields[$name] = $denormalizedData;
    }
    }
    return $denormalizedFields;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function denormalizeObjectFields(array $document, Field\ObjectField $field): array
    {
    if (!$field->multiple) {
    return $this->denormalizeDocument($field->fields, $document);
    }
    $documents = [];
    foreach ($document as $data) {
    $documents[] = $this->denormalizeDocument($field->fields, $data);
    }
    return $documents;
    }
  • /**
    * @param AbstractField[] $fields
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeDocument(array $fields, array $document): array
    {
    $normalizedDocument = [];
    foreach ($fields as $name => $field) {
    if (!\array_key_exists($field->name, $document)) {
    continue;
    }
    if ($field->multiple && !\is_array($document[$field->name])) {
    throw new \RuntimeException('Field "' . $field->name . '" is multiple but value is not an array.');
    }
    match (true) {
    $field instanceof Field\ObjectField => $normalizedDocument[$name] = $this->normalizeObjectFields($document[$field->name], $field),
    $field instanceof Field\TypedField => $normalizedDocument = \array_replace($normalizedDocument, $this->normalizeTypedFields($name, $document[$field->name], $field)),
    default => $normalizedDocument[$name] = $document[$field->name],
    };
    }
    return $normalizedDocument;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeObjectFields(array $document, Field\ObjectField $field): array
    {
    if (!$field->multiple) {
    return $this->normalizeDocument($field->fields, $document);
    }
    $documents = [];
    foreach ($document as $data) {
    $documents[] = $this->normalizeDocument($field->fields, $data);
    }
    return $documents;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function normalizeTypedFields(string $name, array $document, Field\TypedField $field): array
    {
    $normalizedFields = [];
    if (!$field->multiple) {
    $document = [$document];
    }
    foreach ($document as $originalIndex => $data) {
    /** @var string|null $type */
    $type = $data[$field->typeField] ?? null;
    if ($type === null || !\array_key_exists($type, $field->types)) {
    throw new \RuntimeException('Expected type field "' . $field->typeField . '" not found in document.');
    }
    $typedFields = $field->types[$type];
    $normalizedData = \array_replace([
    '_type' => $type,
    '_originalIndex' => $originalIndex,
    ], $this->normalizeDocument($typedFields, $data));
    if ($field->multiple) {
    $normalizedFields[$name . '.' . $type][] = $normalizedData;
    continue;
    }
    $normalizedFields[$name . '.' . $type] = $normalizedData;
    }
    return $normalizedFields;
    }
    /**
    * @param Index[] $indexes
    * @param array<string, mixed> $searchResult
    *
    * @return array<string, mixed>
    */
    private function hitsToDocuments(array $indexes, array $hits): \Generator
    {
    $indexesByInternalName = [];
    foreach ($indexes as $index) {
    $indexesByInternalName[$index->name] = $index;
    }
    foreach ($hits as $hit) {
    $index = $indexesByInternalName[$hit['_index']] ?? null;
    if ($index === null) {
    throw new \RuntimeException('SchemaMetadata for Index "' . $hit['_index'] . '" not found.');
    }
    $denormalizedDocument = $this->denormalizeDocument($index->fields, $hit['_source']);
    yield $denormalizedDocument;
    }
    }
    /**
    * @param AbstractField[] $fields
    * @param array<string, mixed> $normalizedDocument
    *
    * @return array<string, mixed>
    */
    private function denormalizeDocument(array $fields, array $normalizedDocument): array
    {
    $denormalizedDocument = [];
    foreach ($fields as $name => $field) {
    if (!\array_key_exists($name, $normalizedDocument) && !$field instanceof Field\TypedField ) {
    continue;
    }
    match (true) {
    $field instanceof Field\ObjectField => $denormalizedDocument[$field->name] = $this->denormalizeObjectFields($normalizedDocument[$name], $field),
    $field instanceof Field\TypedField => $denormalizedDocument = \array_replace($denormalizedDocument, $this->denormalizeTypedFields($name, $normalizedDocument, $field)),
    default => $denormalizedDocument[$field->name] = $normalizedDocument[$name] ?? ($field->multiple ? [] : null),
    };
    }
    return $denormalizedDocument;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function denormalizeTypedFields(string $name, array $document, Field\TypedField $field): array
    {
    $denormalizedFields = [];
    foreach ($field->types as $type => $typedFields) {
    if (!isset($document[$name . '.' . $type])) {
    continue;
    }
    $dataList = $field->multiple ? $document[$name . '.' . $type] : [$document[$name . '.' . $type]];
    foreach ($dataList as $data) {
    $denormalizedData = \array_replace([$field->typeField => $type], $this->denormalizeDocument($typedFields, $data));
    if ($field->multiple) {
    /** @var string|int|null $originalIndex */
    $originalIndex = $data['_originalIndex'] ?? null;
    if ($originalIndex === null) {
    throw new \RuntimeException('Expected "_originalIndex" field not found in document.');
    }
    $denormalizedFields[$name][$originalIndex] = $denormalizedData;
    continue;
    }
    $denormalizedFields[$name] = $denormalizedData;
    }
    }
    return $denormalizedFields;
    }
    /**
    * @param array<string, mixed> $document
    *
    * @return array<string, mixed>
    */
    private function denormalizeObjectFields(array $document, Field\ObjectField $field): array
    {
    if (!$field->multiple) {
    return $this->denormalizeDocument($field->fields, $document);
    }
    $documents = [];
    foreach ($document as $data) {
    $documents[] = $this->denormalizeDocument($field->fields, $data);
    }
    return $documents;
    }

Somewhere in Seal package we should move this logic.

I think I would move to a Marshaller (https://en.wikipedia.org/wiki/Marshalling_(computer_science)), maybe we can even abstract it that there is a TypedMarshaller, ObjectMarshaller wrapped be the Basic Marshaller.

[RediSearch] Why is it just experimental?

I was very hyped when I did begin this project to see that Redis has a SearchModule because I think it would be a great lightweight alternative also to Elasticsearch / Opensearch and so on.

While working on the implementation of the RediSearch we did go with JSON Documents as this is the most common usage of it on our side. So we are using JSON and Search module to store our documents.

So where are the problems. Let see we want to index a document like this:

{
    "uuid": "3d3586fe-0416-4572-8ce1-c7b6b9e424ff",
    "title": "Hello World - And other Foo Bar things",
    "tag": "some-tag",
    "description": "A simple description about the article which is searchable."
}

For this documents we create the following Schema:

FT.CREATE news ON JSON PREFIX 1 news SCHEMA $.uuid AS uuid TEXT SORTABLE $.title AS title TEXT $.description AS description TEXT $.tag AS tag TEXT SORTABLE

To write this JSON into our Redis we are using the following code:

JSON.SET news:3d3586fe-0416-4572-8ce1-c7b6b9e424ff $ '{"uuid": "3d3586fe-0416-4572-8ce1-c7b6b9e424ff","title": "Hello World - And other Foo Bar things","tag": "some-tag","description": "A simple description about the article which is searchable."}'

Now as example we want to create a filter on tag or another text field uuid. Here it comes complicated. RediSearch at current state different DIALECTs. If you read over it mostly the created new dialect to provide backward compatibility and still fixing some issues. We decided to go with DIALECT 3 so we are now quering the data:

FT.SEARCH news @uuid:(3d3586fe-0416-4572-8ce1-c7b6b9e424ff) DIALECT 3
FT.SEARCH news @tag:(some-tag) DIALECT 3

No result. Strange. Okay - is a special symbol we need to query the data:

FT.SEARCH news @uuid:(3d3586fe\-0416\-4572\-8ce1\-c7b6b9e424ff) DIALECT 3
FT.SEARCH news @tag:(some\-tag) DIALECT 3

Still no result. So now we remove the document and index it without the - in the tag:

JSON.SET news:3d3586fe-0416-4572-8ce1-c7b6b9e424ff $ '{"uuid": "3d3586fe-0416-4572-8ce1-c7b6b9e424ff","title": "Hello World - And other Foo Bar things","tag": "sometag","description": "A simple description about the article which is searchable."}'

And then search again:

FT.SEARCH news @tag:(sometag) DIALECT 3

And we got now a result. But we manipulated our index data which we didn't want todo as that should contain the - in it and should be returned.

So some reasearch later Redis solution for this problem which is in our case not not usable for us is to escape the value when creating the JSON Document. Yes you read correctly we would need to manipulate our json document which is valid and can be read and write correctly that it works with the -. So lets write a document with:

JSON.SET news:3d3586fe-0416-4572-8ce1-c7b6b9e424ff $ '{"uuid": "3d3586fe\\-0416\\-4572\\-8ce1\\-c7b6b9e424ff","title": "Hello World - And other Foo Bar things","tag": "some\\-tag","description": "A simple description about the article which is searchable."}'
FT.SEARCH news @uuid:(3d3586fe\-0416\-4572\-8ce1\-c7b6b9e424ff) DIALECT 3
FT.SEARCH news @tag:(some\-tag) DIALECT 3

This works but is for us not acceptable solution as reading the document this way:

JSON.GET news:3d3586fe-0416-4572-8ce1-c7b6b9e424ff

Will us return the JSON with \ before the -.

Why the RediSearch would have been a lot of potential this kind of little issue make it sadly not very usable. As filters are common usecase for searches this kind of queries should work without getting strange manipulated JSON back. What the problem is also with \\ as it is not easily can be added to the JSON as a key could also have a - which ends in endless escaping and converting data on all ends. We also tried to use TAG instead of TEXT type for the field which was mention in some comment but it also did not work there liked expected and ended in the same issue. This is way as current state I can only mark RediSearch @experimental which maybe will end up that we even remove the adapter in the future.

Here the open releated issues on RediSearch repository:

If somebody knows a solution on the Schema side share it with us so we can fix that issue.

[Docs] Clarification of terminology in the "Getting Started" section of the documentation is missing

The "Getting Started" section of the documentation uses terms that are not described, but the entire section is based on this terminology:

  • Schema
  • Engine
  • Search Engine
  • Indexes
  • Documents

The terms are used in the headlines and content:

Additionally, the use of "engine" and "search engine" without explanation is very confusing or even unclear.


One description can be found:

The Engine is the main class which will be used to communicate with the search engine.

Therefore, I suggest adding a list of terms with descriptions at the top of the page.

[Core] Implementing of Reindex Providers

An index can have one or multiple reindex providers. A general interface and a reindex service would be great. So a general reindex command can be provided, which allow reindex or drop/create and reindex all or specific indexes.

[Laravel] Integrate support for Laravel Scout

Add support for Laravel Scout by providing a bridge to it.

This maybe would allow that laravel scout can use any search engines which is supported by SEAL without having to implement the search engines by there own.

Differences between Laravel Scout vs SEAL

The main difference between Laravel Scout and SEAL is that in Laravel Scout the mapping, field configuration, ... is optional and there exists not an abstraction around it. So if you have searchable, filterable, ... fields that definitions are not defined in an abstract way instead they requires to be defined inside a specific meilisearch configuration: https://laravel.com/docs/10.x/scout#configuring-filterable-data-for-meilisearch.

So the most difficult part to have a seemless integration of SEAL Adapter into Laravel Scout would be that project using the SEAL adapter requires still defining the mapping.

There are 2 options where the mapping could be defined.

Solution A

It is configured inside a configuration file which entity mapped to which SEAL schema e.g.:

'schranz_search' => [
    'laravel_scout' => [
        User::class => 'user_index_name',
        Flight::class => 'user_index_name',
    ],
],

Solution B

We introduce a new kind of metadata loader which alloes to definition of the mapping on Laravel Eloquent Model itself e.g.:

class User extends Model implements SealModelInterface
{
    use Searchable;
    
    public static function getSealIndex(): Index
    {
        return new Index('blog', [
            'title' => new Field\TextField('title', sortable: true),
            'tags' => new Field\TextField('tags', multiple: true, filterable: true),
        ]);
    }
}

[Core] Bulk Action on Connection and Engine

It should be possible to provide Bulk Action on Connection and Engine.

$engine->bulk(string $index, \Generator $saveDocuments, \Generator $deleteDocuments, bulkSize: 100);

// and

$indexer->bulk(Index $index, \Generator $saveDocuments, \Generator $deleteDocuments, bulkSize: 100);

Engine not supporting bulk action can fallback to basic save and delete methods.

The documents should be able to be aryor \Generator for performance.

A BulkableIndexerInterface should be added so the fallback to the save/delete can be part of the Engine and not every adapter itself need to implement it.

Bulk Action make maybe sense in case of:

Implementing of Reindex Providersย #16

[Core] Search over multiple Indexes

At current state only Elasticsearch and Opensearch support to search over multiple indexes.

The question is if we should support it still in the abstraction. At current state it doesn't hurt a lot but could make strange if most of the adapters aren't supporting it.

Framework Integrations

Symfony

  • Init Bundle #112
  • Adapter Integration #112
  • Schema Integration #135
  • Engine Integration #135
  • EngineRegistry Integration #140
  • Prefix config #142
  • Create & Drop CLI Commands #143
  • Move config under seal
  • Publish Package

Laravel

  • Init Module #136
  • Adapter Integration #136
  • Schema Integration #136
  • Engine Integration #136
  • EngineRegistry Integration #136
  • Prefix config #142
  • Create & Drop CLI Commands #143
  • Move config under seal
  • Publish Package
  • Scout Integration? moved to an own issue: #148

Spiral

  • Init Package #141
  • Adapter Integration #141
  • Schema Integration #141
  • Engine Integration #141
  • EngineRegistry Integration #141
  • Prefix config #142
  • Create & Drop CLI Commands #143
  • Move config under seal
  • Publish Package

Not tackling for the release:

  • Mezzio
  • Laminas

[Core] Support for Aggregations

Typical shops will allow if you search something that they show categories to add additional filters. In that case mostly only categories are shown which matches the current search condition of the user. This kind of things are called aggregations.

There are different kind of aggregations which would be nice to support but requires a lot of research if all kind of currently supported search engines support such kind of aggregations:

  • category filter aggregations (matching categories with search term)
  • price range filter aggregations (lowest and highest price)

Questions:

What kind of search engines support this kind of aggregations

Supported:

Maybe:

Should aggregations be part of the search result or own call?

Some search engines support aggregations as combination with the search query, some already requires that the aggregation is a seperate query. Not sure what from the DX point if SearchBuilder allows to query aggregations or if a seperate AggregationBuilder would be better and provide easier DX.

Which type of aggregations are required?

  • Term Aggregation? (e.g.: category filter)
  • Min/Max Aggregation? (e.g.: price range filter)
  • Average Aggregation (show average price value)
  • ...? What other aggregations exists?

[Core] Split Connection Read and Write processes into separate services

Currently the connection is responsible to save, delete and even search documents. It maybe make sense to split this more up. Example laravel/scout uses the terms Seeker and Indexer to split read and write processes into seperate services.

The Schema should be keept in an own services as it make sense that SchemaManagement is divided from indixing.

This splitting would also make the ReadWrite Adapter simpler

$this->connection = new ReadWriteConnection(
$this->readAdapter->getConnection(),
$this->writeAdapter->getConnection(),
);
as it not longer requires a ReadWriteConnection instead the ReadWriteAdapter just return on getReader/getSeeker the read instance of the read adapter and getWriter/getIndexer the write instance of the write adapter.

@wachterjohannes what do you think?

[Core] Add StartsWithCondition and EndsWithCondition

Sometimes we need the possibility to check if a field is starting with a specific values.

This can be interesting to filter documents which are part of a URL tree e.g. new StartsWithCondition('url', '/parent/');.

Or it can be interesting for Lexicon based overview which have query per letter A-z e.g.: new StartsWithCondition('url', 'A');.

Research links

[MemoryAdapter] Support nested Fields in Condition filters

Currently the Memory Adapter does not support Condition on nested fields.

if (\str_contains($filter->field, '.')) {
throw new \RuntimeException('Nested fields are not supported yet.');
}

A test should be added for nested filter and the MemoryAdapter / MemoryConnection should be adopted to support this also.

  • Support Nested Fields in:
    • EqualCondition
    • NotEqualCondition
    • ...
  • Add abstract TestCase

Validate the IndexName of ReindexProviders

Got this idea from @kbond that inside a CompilerPass we could maybe already validate if that index exists and else throw an error. This way we avoid that a false name for a ReindexProvider is used.

How this can be implemented outside of Symfony we need to check as not all frameworks have compilations of containers.

[Elasticsearch / Opensearch] Use AsyncTask from Apis for save and delete of documents

Currently we are using 'refresh' => 'true' to make sure our Tests run as expected. We should refractor this that Async APIs work also as expected and fast as possible.

// TODO refresh should be refactored with async tasks
'refresh' => $options['return_slow_promise_result'] ?? false, // update document immediately, so it is available in the `/_search` api directly

  • So the AbstractTests should be refractored to support Async APIs.
  • Refractor implementation to AsyncTask pooling see example Algolia internal handling

This is done: Same is also required for Meilisearch createIndex and dropIndex actions are always async and there is no other way for it.

[Yii] Integration Module

As requested twitter it would also be good to have an integration into the Yii Ecosystem. Like for Mezzio I think the most difficult part could be to create multiple instances of Engine like we are doing here in the other frameworks based on configuration:

Based on the following configuration two instances of engine with a seperate instance of adapter should be used:

[
    'schranz_search' => [
        'engines' => [
            'meilisearch_1' => [
                'adapter' => 'meilisearch://127.0.0.1:7700',
            ],
            'meilisearch_2' => [
                'adapter' => 'meilisearch://127.0.0.1:7701',
            ],
        ],
    ],
];

[New package] ODM/ORM based Datamapping to Document classes

The current abstraction requires json like arrays to be indexed and return that indexes. There should be in future another package build on top of seal package which will do mapping from PHP Classes so Document Object to this array and back to it.

So the seal package will be something like what in doctrine is doctrine/dbal and the odm or whatever we will call it will be what doctrine/orm is.

A ODM Impelmentaiton could look like this:

<?php

use Schranz\Search\SEAL\Schema\Field;
use Schranz\Search\SEAL\Schema\Index;

#[Index(name: 'news')]
class News {
    #[Field\IdentifierField('id')]
    private string $id;
    
    #[Field\TextField('title')]
    private string $title;
    
    #[Field\TypedField('header', 'type', [
        'image' => [
            'media' => new Field\IntegerField('media'),
        ],
        'video' => [
            'media' => new Field\TextField('media', searchable: false),
        ],
    ])]
    private array $header;
    
    // ...
}

[Elasticsearch / Opensearch] Recheck the usage of doc_values for filtering

Currently we are using doc_values to true for filtering none indexed fields. This is documented as not being as performant as query indexed fields: https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html. Which we maybe need to recheck.

Interesting is that Opensearch and Elasticseach has different default values which did make tests fail in Opensearch for #64 (commit) and required to set doc_values to true there:

Opensearch: doc_values: false doc here
Elasticsearch: doc_values: true doc here

Update the doc_values: true set on opensearch did not work like expected: opensearch-project/OpenSearch#5770

Currently in opensearch we workaround making all filterable fields also index. This should be rechecked if this should also be the behaviour for elasticsearch. This maybe requires us to switch from query_string search to match search where we put only searchable fields into it.

[Mezzio] Integration Module

It would be nice to support also Mezzio framework. Currently I did only have a short look how this could be integrated, and failed to understand how I can create services based on configuration. I know how to get config inside a factory but not how the config can control that one or multipl services are created.

Every service seems require a factory be configured at the time where we don't have the config (e.g.: https://github.com/schranz-templating/templating/blob/eb28f45eabec42a5738b999030ab8e9e049d4fd7/src/Integration/Mezzio/Latte/ConfigProvider.php#L31-L35). So we can do Engine::class -> MezzioEngineFactory::class but not sure how to handle this if we want to create based on configuration multiple instances of the Engine like we are doing it here in the other frameworks:

I also did have a look at the Doctrine integration but not understand how it was handled there that multipl instances of entity manager can exist ๐Ÿค” .

Index fields configuration duplication

Discussed in #199 @kbond

Originally posted by kbond June 13, 2023
This is a very tiny papercut I had - not a big deal.

Consider this:

use Schranz\Search\SEAL\Schema\Field;
use Schranz\Search\SEAL\Schema\Index;

return new Index('page', [
    'id' => new Field\IdentifierField('id'),
    'title' => new Field\TextField('title'),
    'subtitle1' => new Field\TextField('subtitle1'),
    'subtitle2' => new Field\TextField('subtitle2'),
    'description' => new Field\TextField('description'),
]);

I think it would be nice if we didn't have to set the array key:

use Schranz\Search\SEAL\Schema\Field;
use Schranz\Search\SEAL\Schema\Index;

return new Index('page', [
    new Field\IdentifierField('id'),
    new Field\TextField('title'),
    new Field\TextField('subtitle1'),
    new Field\TextField('subtitle2'),
    new Field\TextField('description'),
]);

I could be missing something but if interested, feel free to turn into an issue and I can create a PR.

Originally posted by alexander-schranz June 13, 2023
This is currently also related if we use the same classes maybe in a future Object Data Mapper package or not. As it could also be:

use Schranz\Search\SEAL\Schema\Field;
use Schranz\Search\SEAL\Schema\Index;

return new Index('page', [
    'id' => new Field\IdentifierField(),
    'title' => new Field\TextField(),
    'subtitle1' => new Field\TextField(),
    'subtitle2' => new Field\TextField(),
    'description' => new Field\TextField(),
]);

The key is currently required inside Index.php class as the items are currently accessed that way.

In a ODM the above example could maybe look like this:

#[Index('page')]
class Test {
    #[Field\IdentifierField()]
    public string $id;
}

What I'm not yet sure if I should support something like doctrine does fieldName vs columnName which would be something like this:

return new Index('page', [
    'fieldName' => new Field\TextField('columnName'),
]);

At current state that is not supported and currently avoding that to avoid to have complexity in the mapping of the stored data vs presentated data, but the initial interface was created by keep that in mind and that is way it currently looks like this.

I think it should not hurt the performance too much if we currently add support for int as field key but make sure that $index->fields always use the name as that is required to access the field. So we could adopt the __construct of Index do something like this:

$normalizedFields = [];
foreach ($fields as $key => $field) {
    if (is_int($key)) {
         $key = $field->name;
    }
    
    $this->normalizedFields[$key] = $field;
}

$this->fields = $normalizedFields; // set to the readonly publi

This needs also keep in mind for TypedField and ObjectField which also have fields configuration. If you want to give it a try feel free to create a pull request for it.

[Elasticsearch / Opensearch / Solr] Optimize Schema for some data by using enabled false on objects

Stumble over the following things today in my twitter stream by @frankdejonge:

We should make sure to optimze the schema we are using:

https://www.elastic.co/guide/en/elasticsearch/reference/8.6/enabled.html

If there are object fields which definitly don't need to be analyzed in any case.

Things should be checked specially also for Solr and our custom _source field there.

Also we should where possible force a strict schema by default for all search engines to avoid any kind of this problems.

[Symfony] Integration configuration

We currently have 2 things which we require to configure in the Integration Bundle for Symfony. In PHP the Engine is created like this:

$engine = new Engine(
    $adapter, // coming from DSN
    $schema // schema comes from different configured directories
);

As an application can have different engines like doctrine can have different connections, we need to keep also that configuraiton in mind.

schranz_search:
    seal:
        prefix: '' # TODO
        schemas:
            app:
                dir: '%kernel.project_dir%/config/schemas/app'
                # added to the first engine default engine
                # a `type: ` could exist if we support in future maybe additional schema providers (attributes, xml, json, ..)
            other:
                dir: '%kernel.project_dir%/config/schemas/other'
                engine: special # a bundle maybe don't provide the engine but I can then set in my project a bundles config path to a specific engine
        engines:
            default:
                adapter: 'elasticsearch://127.0.0.1:9200'
            other:
                adapter: 'algolia://%env(ALGOLIA_APPLICATION_ID)%:%env(ALGOLIA_ADMIN_API_KEY)%'

@chirimoya @Toflar @bmack what do you think about this configuration?

As reference here the `doctrine.yaml` configuration:
doctrine:
    dbal:
        url: '%env(resolve:DATABASE_URL)%'

        # IMPORTANT: You MUST configure your server version,
        # either here or in the DATABASE_URL env var (see .env file)
        #server_version: '15'
    orm:
        auto_generate_proxy_classes: true
        enable_lazy_ghost_objects: true
        naming_strategy: doctrine.orm.naming_strategy.underscore_number_aware
        auto_mapping: true
        mappings:
            App:
                is_bundle: false
                dir: '%kernel.project_dir%/src/Entity'
                prefix: 'App\Entity'
                alias: App
The `flysystem.yaml` configuration:
flysystem:
    storages:
        default.storage:
            adapter: 'local'
            options:
                directory: '%kernel.project_dir%/%VAR_DIR%/storage/default'
And the `messenger.yaml` configuration:
framework:
    messenger:
        transports:
            my_transport:
                dsn: "%env(MESSENGER_TRANSPORT_DSN)%"
                options:
                    auto_setup: false
            # https://symfony.com/doc/current/messenger.html#transport-configuration
            # async: '%env(MESSENGER_TRANSPORT_DSN)%'
            # failed: 'doctrine://default?queue_name=failed'
            # sync: 'sync://'

        routing:
            # Route your messages to the transports
            # 'App\Message\YourMessage': async

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.