Comments (4)
Hi @Zloka , thanks for the kind words :)
In principle, there is nothing preventing you to index and search HTML content with
MiniSearch
. In practice, you need to come up with a suitable pre-processing of the content, becauseMiniSearch
by default tokenizes simply by splitting on space or punctuation, and only normalizes the case of the text.Should one be able to search for tags like
<div>
, or only the textual content should be searchable? If it's the latter, then as you mentioned the best thing to do is to first strip tags away (or, if you can access the DOM of the pages to index, get thetextContent
property), then index the textual content.
Thank you for your prompt reply! I went with your suggestion, and it seems to work well!
A follow-up question came to mind: In my case, I can pre-process the files in a build step ahead of time. I figured this would be desirable, as I can build the MiniSearch
index ahead of time, and just load the index in the client-side directly, without having to add all documents.
While this works fine, the client-side of my application is a React
application, and I was planning to use your react-minisearch
React integration. The useMiniSearch
hook seems to accept an array of documents and options, but unless I'm mistaken, I can't seem to find a way to provide a prebuilt index, which would be preferable in my case. I couldn't find an example, do you happen to have any pointers? I can also open an issue in react-minisearch
, if you believe that is more appropriate 🙂
To give back, in case someone else has a similar use-case, here is a TypeScript script that outlines the rough approach I took creating the MiniSearch
index, for future reference:
import { promises as fs } from 'fs';
import { join } from 'path';
import { convert } from 'html-to-text';
import MiniSearch from 'minisearch';
const miniSearchOptions = {
fields: ['title', 'content'], // fields to index
storeFields: ['title', 'content'], // fields to return with search results
};
// initialize MiniSearch
let miniSearch = new MiniSearch(miniSearchOptions);
// html-to-text strips all tags. If your tags contain attributes you wish to include, you can do so e.g. using regex
// the following regex captures an attribute called "data-content", so that the value can be included in the text
const regex = /<span[^>]*data-content="([^"]*)"[^>]*>[^<]*<\/span>/g;
// async function to preprocess an HTML-file and convert it to plain text
const convertHtmlToText = async (filePath: string): Promise<string> => {
let html = await fs.readFile(filePath, 'utf-8');
// replace span tags with data-content attribute value
html = html.replace(regex, '$1');
const text = convert(html);
return text;
};
// async function to process all HTML files in a directory
const processFilesInDirectory = async (dirPath: string) => {
// read the directory
const files = await fs.readdir(dirPath);
// iterate over each file
for (const file of files) {
// join the dirPath and file to get the full file path
const filePath = join(dirPath, file);
// convert HTML to plain text
const title = file;
const content = await convertHtmlToText(filePath);
// add document to MiniSearch index
const document = {
id: title,
title,
content,
};
miniSearch.add(document);
}
};
// async function to load or create the index
const loadOrCreateIndex = async () => {
const indexPath = 'minisearch-index.json';
try {
// check if index file exists
await fs.access(indexPath);
// if it exists, load the index from the file
const json = await fs.readFile(indexPath, 'utf-8');
miniSearch = MiniSearch.loadJSON(json, miniSearchOptions);
console.log('Index has been loaded successfully!');
} catch (err) {
// if it does not exist, process the HTML files and create the index
await processFilesInDirectory('./pages');
// serialize MiniSearch instance
const json = JSON.stringify(miniSearch);
// write to a local file
await fs.writeFile(indexPath, json);
console.log('Index has been created and stored successfully!');
}
const searchOptions = {
prefix: true, // Allows for prefix search, i.e. searching "polyn" will match "polynomial" and "polynomial expression"
fuzzy: 2, // Apply fuzzy search (typo tolerance). Can be either a boolean, or a number specifying the accepted Levenstein distance. E.g. "polymonial" will match "polynomial", even though there is a slight typo.
boost: { title: 2 }, // Boost the value of certain fields. For example, if the match is in the title of the document, boost the search result to make it appear higher up.
};
const searchResult = miniSearch.search('polyn', searchOptions);
console.log(searchResult);
};
loadOrCreateIndex().catch(console.error);
from minisearch.
You are right @Zloka , at the moment there is no utility function in react-minisearch
to load a pre-built index, so you have to do that directly with MiniSearch
. I will have to think about how to best introduce that possibility, as it is definitely feasible. If you want, you can open an issue on react-minisearch
, and I will try to spend some time on it when possible.
In general, whether it makes sense to pre-build the index very much depends on the kind of pre-processing needed, and on how frequently the index needs to change. In many cases I have seen, indexing "just in time" is the best solution, as it is much simpler to implement and maintain. That said, since you need pre-processing, if your documents don't change too often it is probably more efficient to pre-build.
Thank you for sharing your code, it is always useful for other people landing on an issue!
from minisearch.
I will open an issue there! Myself, I can work around it using MiniSearch
only, so no hurry in that sense, but perhaps it might help others and make my code cleaner in the future if anything 😇
For potential future readers, here's a draft of how one could implement it using React
and MiniSearch
:
I decided to create a useMiniSearchInstance
hook, that is responsible for loading the prebuilt index setting up the search engine. In my case, I load it from the public
directory, but naturally, many other approaches exist:
import { useEffect, useState } from 'react';
import MiniSearch from 'minisearch';
import {
MiniSearchDocument,
indexPath,
miniSearchOptions,
} from '../client-side-search/miniSearchConfig';
const useMiniSearchInstance = () => {
const [miniSearch, setMiniSearch] =
useState<MiniSearch<MiniSearchDocument> | null>(null);
useEffect(() => {
fetch(`/${indexPath}`)
.then((response) => response.text())
.then((data: string) => {
// You'll probably want some validation here.
const searchIndex = MiniSearch.loadJSON(data, miniSearchOptions);
setMiniSearch(searchIndex);
})
.catch(console.error);
}, []);
return miniSearch;
};
export default useMiniSearchInstance;
I then created a useMiniSearch
hook that interacts with the search engine, and exposes some variables. Of course, it is trivial to expose other methods or variables, if your use case requires it:
import { SearchResult, Suggestion } from 'minisearch';
import { useEffect, useState } from 'react';
import useMiniSearchInstance from './useMiniSearchInstance';
const searchOptions = {
prefix: true, // Allows for prefix search, i.e. searching "polyn" will match "Polynomi" and "Polynomin"
fuzzy: 2, // Apply fuzzy search (typo tolerance). Can be either a boolean, or a number specifying the accepted levenstein distance. E.g. "realiluvut" will match "reaaliluvut", even though there is a slight typo.
boost: { title: 2 }, // Boost the value of certain fields. For example, if the match is in the title of the document, boost the search result to make it appear higher up.
};
const useMiniSearch = () => {
const [searchInput, setSearchInput] = useState('');
const [results, setResults] = useState<SearchResult[] | null>(null);
const [suggestions, setSuggestions] = useState<Suggestion[] | null>(null);
const miniSearch = useMiniSearchInstance();
useEffect(() => {
if (searchInput.length > 0) {
if (miniSearch !== null) {
const searchResults = miniSearch.search(searchInput, searchOptions);
const autoSuggestions = miniSearch.autoSuggest(
searchInput,
searchOptions,
);
setResults(searchResults);
setSuggestions(autoSuggestions);
}
} else {
setResults(null);
setSuggestions(null);
}
}, [miniSearch, searchInput]);
return {
searchInput,
setSearchInput,
results,
suggestions,
};
};
export default useMiniSearch;
It should then be straightforward to hook searchInput
and setSearchInput
to an input field, and display results using suggestions
and results
✌️
from minisearch.
Hi @Zloka , thanks for the kind words :)
In principle, there is nothing preventing you to index and search HTML content with MiniSearch
. In practice, you need to come up with a suitable pre-processing of the content, because MiniSearch
by default tokenizes simply by splitting on space or punctuation, and only normalizes the case of the text.
Should one be able to search for tags like <div>
, or only the textual content should be searchable? If it's the latter, then as you mentioned the best thing to do is to first strip tags away (or, if you can access the DOM of the pages to index, get the textContent
property), then index the textual content.
from minisearch.
Related Issues (20)
- Add a boost to recently updated docs? HOT 2
- How to search for parts of a word? HOT 2
- searchTokenize(...).flatMap HOT 1
- Prefix search enabled/disabled per search field HOT 3
- Switch to stronger typings HOT 2
- Barebones, framework agnostic example HOT 8
- Is it possible to make autoSuggest suggest the entire title of my blogs instead of just one word? HOT 2
- case-sensitive dynamic selection during search HOT 2
- `fuzzy` predicate function? HOT 2
- how to index nested field with its value is an array HOT 6
- about search result HOT 8
- Minimum should match HOT 4
- Any way to search across multiple vitepress sites? HOT 4
- How to have a search at least as good as `includes` HOT 3
- How to prevent treating terms separately? HOT 2
- Can `loadJSON` be added as an instance method which merges indices? HOT 3
- Any notification on status of data loading? HOT 1
- Search terms are broken when immediately following unicode whitespace HOT 9
- Get Mini Search version? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from minisearch.