olivernn / lunr.js Goto Github PK

View Code? Open in Web Editor NEW

8.8K 126.0 549.0 4 MB

A bit like Solr, but much smaller and not as bright

Home Page: http://lunrjs.com

License: MIT License

Makefile 0.61% Shell 0.51% JavaScript 94.44% HTML 3.97% CSS 0.46%

lunr javascript search full-text-search

lunr.js's Introduction

Lunr.js

A bit like Solr, but much smaller and not as bright.

Example

A very simple search index can be created using the following:

var idx = lunr(function () {
  this.field('title')
  this.field('body')

  this.add({
    "title": "Twelfth-Night",
    "body": "If music be the food of love, play on: Give me excess of it…",
    "author": "William Shakespeare",
    "id": "1"
  })
})

Then searching is as simple as:

idx.search("love")

This returns a list of matching documents with a score of how closely they match the search query as well as any associated metadata about the match:

[
  {
    "ref": "1",
    "score": 0.3535533905932737,
    "matchData": {
      "metadata": {
        "love": {
          "body": {}
        }
      }
    }
  }
]

API documentation is available, as well as a full working example.

Description

Lunr.js is a small, full-text search library for use in the browser. It indexes JSON documents and provides a simple search interface for retrieving documents that best match text queries.

Why

For web applications with all their data already sitting in the client, it makes sense to be able to search that data on the client too. It saves adding extra, compacted services on the server. A local search index will be quicker, there is no network overhead, and will remain available and usable even without a network connection.

Installation

Simply include the lunr.js source file in the page that you want to use it. Lunr.js is supported in all modern browsers.

Alternatively an npm package is also available npm install lunr.

Browsers that do not support ES5 will require a JavaScript shim for Lunr to work. You can either use Augment.js, ES5-Shim or any library that patches old browsers to provide an ES5 compatible JavaScript environment.

Features

Full text search support for 14 languages
Boost terms at query time or boost entire documents at index time
Scope searches to specific fields
Fuzzy term matching with wildcards or edit distance

Contributing

See the CONTRIBUTING.md file.

lunr.js's People

Contributors

Stargazers

Watchers

Forkers

cloud8421 vijayphilip gijs nvdnkpr cstansbury lemonhall kess79 yazalim dizhu gitlisted sdepold vkisselmann rushton bertomartin pborreli emaxerrno chu888chu888 tinganho rpontes bobozhengsir ibeen gigo101 jennyung ilovejs dokalanyi bhannat2012 nazrulworld mnishihan makoto mpneuried maserg ssured einsitang aoj wuxiaorui86 upadhyay-ashish kaiquewdev shakakira hsanchez garysieling tonymarklove hidemasuoka ming300 mpmedia rameshv pvblivs mirion signight tweettypography benpickles bawerd parastructure andreif tonny-zhang-fork joshrtay samals sebv brockfanning pombredanne doowb gjtorikian igoldsmith acmyonghua jessetaylor84 nangal gitgrimbo turtlemonvh robrigo simonwex qdrk helga-sida imclab nilsteampassnet marioaer cambridgemike zs-zs gorakhargosh aarono mihaivalentin rocketcoder tchen0123 gerhobbelt idflood montanaflynn moonhani eiriksm evanrs kconor bagobor devlato nolanlawson atteeela micheg janrywang codepiano rowanoulton rhiokim claudia1204 gregglind evieliu

lunr.js's Issues

Implement merge of two indexes

http://garysieling.com/blog/building-a-full-text-index-in-javascript

On this page a way of merging two indexes into one is described. It's a great way to split work over several processors/machines and then merge the result. My guess is that using this merge function, it will be easier to implement WebWorkers for background indexing once this merge is supported.

My use case is to implement indexing in a NoSQL environment with map/reduce functionality. Maybe this could solve the absence of a full text index in CouchDB.

Make tokenizer a property of the index (or of a field)

The tokenizr is a static function of the lunr instance, which is itself a singleton.
As a result, it is not easily possible to maintain different tokenizers within a single application.

I would be happy to see the tokenizer become a member of lunr.Index, or preferably of a field (as it is with Solr).

Thanks for lunr.js and regards, Martin

Bug removing a non-existent document

When trying to remove a document from the index that the ID doesn't exists (maybe because an automatic clean-up) it raise an error because this case is not check:

lunr.Index.prototype.remove = function (doc) {
  var docRef = doc[this._ref],
      docTokens = this.documentStore.get(docRef)

  this.documentStore.remove(docRef)

  docTokens.forEach(function (token) {
    this.tokenStore.remove(token, docRef)
  }, this)
}

If the document doesn't exists, docTokens get null, so the forEach method fails. I think it should be checked and if so, ignored.

lunr.js + fuzzyset.js?

Hey Oliver,

I was thinking of combining lunr with fuzzyset.js to allow for fuzzy searching. The idea is to modify the token store and replace the expand() method so that it uses the fuzzy matching results. The only problem is "similarityBoost" doesn't seem to provide sufficient penalty on matches that have a low fuzzy matching score (also between 0 and 1). Would you have any ideas on how to improve this?

run lunr.js on the server side in nodejs?

@olivernn, help me with these questions:

how is possible run lunr.js on the server side in nodejs?
how can i save and retrieve the index on disk?

thank you

lunr.Index.prototype.idf() seems to fail when the term is an all-lowercase property of Object.prototype.

For example, an index with "constructor" or "UNDERUNDERprotoUNDERUNDER" terms in it will cause idf() to return the Object.prototype property of the same name.

(UNDER is _. Used here to avoid GFM stripping the _s and turning the property bold)

Because the returned property of Object.prototype is not a number, the subsequent calculations return NaN.

Pull request with tests to follow.

Split Indexes

(If this is a dup of #29, my apologies, but I think it's a different goal...)

Is it possible in theory to break up a lunr.js index into separate .js files loaded on demand rather than having to load the entire index to do the search? I have some very large static websites (Wikipedia For Schools, distributed in the developing world) that could benefit from lunr.js, but which would need an inordinately large index, and thus an unacceptable delay, when loading the search page. I was thinking that if the index was split into parts -- maybe by first letter -- the parts could be selectively loaded depending on the search term, speeding up individual searches. Does this sound feasible? If so I may poke around and see if I can figure it out.

Also, much thanks for lunr.js, I'm already using it on our static Khan Academy distribution: http://rachel.worldpossible.org/ka/ - it's wonderful!

Maybe use semicolons to complete function definitions

Hi there,

I encountered a problem where after concat/minification, TokenStore.prototype.toJSON was an object, not a function. Weird, huh?

I did some digging and found that lunr was concatted into my lib file directly before another script that was using an anonymous function to avoid polluting global. Looked something like this:

lunr.TokenStore.prototype.toJSON = function () {
  return {
    root: this.root,
    length: this.length
  };
}(...other library code)();

Should have looked like this:

lunr.TokenStore.prototype.toJSON = function () {
  return {
    root: this.root,
    length: this.length
  };
}; 
(...other library code)();

Anyway, I got around it by changing my grunt config to use semicolon as a file separator, but it may be worthwhile to add semicolons directly in lunr.js to save some others some very weird debugging.

Lower level query interface

A more advanced query interface is something that lunr definitely needs. I'm not entirely sure what the interface would look like for this, perhaps a different method on the index and a query object itself:

idx.query(function () {
    this.all('foo bar*') // these are AND query terms
    this.some('*baz') // these are OR query terms

    this.limit(10) // maximum number of results to return
    this.threshold(0.7) // minimum score for returned documents

    this.facet('herp', ['derp', 'burp']) // facets and values
})

This is just a very rough idea, but I think it is the right direction. It hopefully exposes a more powerful interface to the search. The idx.search method would still exist as a quick way to perform searches, but it would probably be built onto of this query method.

Any input or feedback is much appreciated.

Support for AMD

I wonder if it's worth thinking about a build process that creates an AMD compliant version of Lunr and the adapters. The best course of action would be to probably change the build process, I can work on my fork on this and then submit a pull request, it would be interesting to integrate some extra tools.

No results for certain query lengths

> var idx = lunr(function () { this.field('title') })
> var doc = {"title": "Excavator", id:1}
{ title: 'Excavator', id: 1 }
> idx.add(doc)
undefined
> idx.search('exc')
[ { ref: '1', score: 1 } ]
> idx.search('exca')
[ { ref: '1', score: 1 } ]
> idx.search('excav')
[ { ref: '1', score: 1 } ]
> idx.search('excava')
[]
> idx.search('excavat')
[]
> idx.search('excavato')
[]
> idx.search('excavator')
[ { ref: '1', score: 1 } ]

Storage Events

Adding storage events to a a lunr.Index would make it easy to snapshot an index to some storage location, whether that be localStorage in the browser or to file or some other database on the server.

I think lunr would have to emit three events, add, update and remove, this should give users enough hooks to maintain a persisted copy of their index.

It might work like this:

index.on('add', function (doc, index) {
    // doc is the newly added document
    // index is the instance of `lunr.Index` that has been added to

    localStorage.set('asdf', JSON.serialize(index))
})

The callback signature would be the same for the add, update and remove events.

The way update is implemented, it first removes a document and then re-adds it, would mean that some special care would be needed to make sure that only an update event is fired rather than the remove and then add event, but this should be simple enough.

To support this all three methods, add, update and remove could take an argument that prevents any events from being emitted, this also might be useful when doing a bulk load.

I don't think event handlers should be serialised, so when loading an index event handlers would have to re-added.

Capitalised tags not being detected

Here's a jsFiddle demonstrating the problem.
http://jsfiddle.net/bt5yq/5/

If I give an item capitalized tags in the json, an exact search for the tag will still not return the item in the results.

If the tags are not capitalized, there is no problem at all.

bower.json/package.json not updated during 0.4.3 build

Bower install pulls in lot of unnecessary files

If I install lunr.js using bower, I get too many files into my bower_components/lunr.js. Typically, an app using lunr.js won't need all these files. Why is this problem because bower recommends to check-in the bower_components into vcs, which essentially means I am checking in whole lunr.js repo into my project.

What I should ideally get from bower install is:

$ tree bower_components/lunr.js
bower_components/lunr.js
|-- CHANGELOG.mdown
|-- README.mdown
|-- VERSION
|-- bower.json
|-- component.json
|-- lunr.js
|-- lunr.min.js
`-- package.json

What I currently get is:

$ tree bower_components/lunr.js
bower_components/lunr.js
|-- CHANGELOG.mdown
|-- CNAME
|-- LICENSE
|-- Makefile
|-- README.mdown
|-- VERSION
|-- bower.json
|-- component.json
|-- example
|   |-- example_data.json
|   |-- example_index.json
|   |-- index.html
|   |-- index_builder.js
|   |-- jquery.js
|   `-- mustache.js
|-- index.html
|-- lib
|   |-- document_store.js
|   |-- event_emitter.js
|   |-- index.js
|   |-- lunr.js
|   |-- pipeline.js
|   |-- sorted_set.js
|   |-- stemmer.js
|   |-- stop_word_filter.js
|   |-- token_store.js
|   |-- tokenizer.js
|   |-- utils.js
|   `-- vector.js
|-- lunr.js
|-- lunr.min.js
|-- notes
|-- package.json
|-- server.js
|-- styles.css
`-- test
    |-- env
    |   |-- augment.min.js
    |   |-- jquery.js
    |   |-- qunit.css
    |   |-- qunit.js
    |   `-- runner.js
    |-- event_emitter_test.js
    |-- fixtures
    |   `-- stemming_vocab.json
    |-- index.html
    |-- index_test.js
    |-- lunr_test.js
    |-- pipeline_test.js
    |-- search_test.js
    |-- serialisation_test.js
    |-- sorted_set_test.js
    |-- stemmer_test.js
    |-- stop_word_filter_test.js
    |-- store_node_test.js
    |-- store_test.js
    |-- test_helper.js
    |-- token_store_test.js
    |-- tokenizer_test.js
    |-- utils_test.js
    `-- vector_test.js

Test error

I just forked the repository (everything is up to date), but I'm getting a test failure. Is this a known test failure or do I have something misconfigured? Was there supposed to be a build step before running the tests?

~/code/lunr.js (master) $ make test
Test failed: lunr.tokenizer: calling to string on passed val
Failed assertion: expected: tue,jan,01,2013, but was: mon,dec,31,2012
at http://localhost:32423/test/env/qunit.js:472
at http://localhost:32423/test/tokenizer_test.js:50
at http://localhost:32423/test/env/qunit.js:136
at http://localhost:32423/test/env/qunit.js:279
at process (http://localhost:32423/test/env/qunit.js:1277)
at http://localhost:32423/test/env/qunit.js:383
Took 58ms to run 297 tests. 296 passed, 1 failed.
make: *** [test] Error 1

Exact phrase matching?

Hi!

Does lunar support exact phrase matching (i.e. use quotation marks in search)? It doesn't seem like it from what my initial research. I'd like to try and add this feature to the project. Could someone please give me some pointers on how to implement this?

Lunr.js as a search for static websites?

It would be super-amazing to have a Lunr.js run with statically generated websites. I can imagine it running effectively in two modes:

Server-side mode during static site generation - when it does indexing, similar to how SASS precompiler works. That would generate a compressed index.
Client side which loads the index and does the search.

At the moment, statically generated sites work without search or use Google embedded search. This would open up a lot more options for them, including in presentation and faceting, avoiding generic template elements, etc.

Sorting

Hi!

I'm still learning so be gentle...

Is there a way to re-order or sort the results after a search?

I have a working implementation with some numeric data and I want to bring the largest values for a specific term to the top.

Any help would be really appreciated.

Thanks

Make lunr.js compatible with IE8

Right now the reference to undeclared vars console gives script errors in IE8. The use of ES5 method forEach makes lunr.js incompatible with IE8 too. lunr.js is perfectly capable of running in IE8 with reasonable speed, it'd be nice if these tiny issues could be fixed. ~~If not, at least put up a notice somewhere that says which browser version is supported.~~

Exact matches should have a (slightly) higher score

Thanks for this great project!

Currently searching 'hand' in a set of 'hand' and 'handsome' returns both with the same score. Obviously 'hand' should have a higher score than 'handsome'

To test:

testIndex = lunr(function(){this.field('name'); this.ref('id');});
testIndex.add({id:'hs',name:'handsome'});
testIndex.add({id:'hd',name:'hand'});
testIndex.search('hand');

results in

[   Object
    ref: "hd"
    score: 0.7071067811865476
, 
    Object
    ref: "hs"
    score: 0.7071067811865476
]

Partial word matching

Hi Oliver,

Do you know if it's possible to add partial word matching support to lunr?
Currently it seems that it adds a wildcard to the end of each token but not to the beginning. So for example, if I search for "create" and the document has the word "recreate" it wouldn't find it. I know stemming may be a solution to this, but I prefer not to use it. Also, on a similar subject, how difficult would it be to add fuzzy matching (e.g. if I made a typo and searched for "recreete" instead of "recreate"). Can you give me some pointers as to where in the code I should be looking to address these?

Thank you!!

Problem with serializing/deserializing index to/from JSON

After making an index like so:

index = lunr(function(){
      this.ref('id');
      this.field('title', {boost: 10});
      this.field('text');
});

And calling index.toJSON() to create a storable representation, and then reloading the index from it:

newIndex = lunr.Index.load(index.toJSON());

The new index seems to be "broken" in some way. Attempting a search, for instance, gives the error TypeError: Cannot read property 'tf' of undefined

Maybe I'm making some simple error here, but it seems a little strange.

Document for Node.JS

This works fine on node.js on the server side too. If there was a little documentation explaining how, I think it could be useful to some people.

TypeError: Cannot call method 'replace' of null

I got TypeError: Cannot call method 'replace' of null for line 74 in lunr.js.

Just inserting a check for null argument fixes it:

lunr.tokenizer = function (str) {
  if (str == null) return new Array // check for null argument
  if (Array.isArray(str)) return str

  var str = str.replace(/^\s+/, '')
  ...

Additonal Features Plan

Do you plan to incorporate following additional features as part of current or future plan?

Allow phrase searching as well as word matches
Support use of wildcards
Support for case-sensitive and case-insensitive Search
Standard Boolean Search using AND / OR/ NOT
Support for related terms or synonyms
Support for auto-complete for search terms
Enable proximity search (terms located near each other; e.g., within 2 words, not just exact matches)
Provide fuzzy AND (for ranking)

Highlighted matched terms

It would be nice if Lunr.js could highlight matched terms.

404 on node package install

I get a 404 when trying to install the node package:

npm install --save-dev lunr

Error in some search queries

I got this error on some certain search queries

TypeError: Cannot read property 'tf' of undefined at lunr.Index.documentVector (/usr/local/lib/node_modules/grunt-translate/node_modules/lunr/lunr.js:1104:53) at lunr.Index.search (/usr/local/lib/node_modules/grunt-translate/node_modules/lunr/lunr.js:1076:61) at Array.map (native) at lunr.SortedSet.map (/usr/local/lib/node_modules/grunt-translate/node_modules/lunr/lunr.js:574:24) at lunr.Index.search (/usr/local/lib/node_modules/grunt-translate/node_modules/lunr/lunr.js:1075:6) at Search.query (/usr/local/lib/node_modules/grunt-translate/src/modules/search.js:93:24) at module.exports (/usr/local/lib/node_modules/grunt-translate/app/modules/search/searchApi.js:9:21) at callbacks (/usr/local/lib/node_modules/grunt-translate/node_modules/express/lib/router/index.js:161:37) at param (/usr/local/lib/node_modules/grunt-translate/node_modules/express/lib/router/index.js:135:11) at pass (/usr/local/lib/node_modules/grunt-translate/node_modules/express/lib/router/index.js:142:5)

lunr.index.search throws TypeError when searching for "javascript foo".

Here's a jsfiddle that shows this exception. http://jsfiddle.net/bmY5L/6/

Is this a bug?

Support for require.js (AMD) and avoiding global scope

So far, Lunr supports module.exports for node.js but not AMD style loading like with require.js.

It also by default creates a global variable lunr even when loaded as a module.

Wrapping the lib in a function(){}, detecting the presence of defineor exports and only using a global variable as a fallback, this can be resolved.

different pipelines for indexing and searching

Hi,

I'm using lunr.js in a project where autocomplete is also a requirement besides of the client-side search. I was able to make the autocomplete work with introducing an additional processing step in the pipeline, just before the stemmer does its work.
But it's good to see that now there is a possibility to extend the package with the new use() api on the index. So I started to extract my solution into a lunr extension, you can check my approach here.

I'm using a Radix tree to store n-grams of the tokens (right before stemming), later you can use this tree for efficient autocomplete over the stored n-grams.

Unfortunately, there is a problem with this solution - lunr.js calls the indexing pipeline on each search - I can easily understand why you chose this approach: because usually you need the same processing steps (stemming, stop-word filtering, etc.) for the query string which you've used for indexing.

But in this case, I would need an "indexing pipeline" which is different than the "search/query string pipeline".

Previously I solved this issue in my application in a way that right after indexing, I set a flag to true indicating that indexing has finished, but IMO it's a very hackish solution - if I would like to extract this solution to a lunr extension, we would need a way for configuring different pipelines for search and indexing.

What is your opinion?

Support more languages

I know the web is English centric, but that would be nice to support more languages through a plugin system.
Fortunately there are several tools out there for tokenization and stemming that can be used:

Natural has tokenizers, stemmers, stop words and even inflectors for several languages
jssnowball also has a nice collection of stemmers
Detecting the language may also come handy: language-identifier and cld-js can help

I don't really have the time to work on this feature, but I'd like to see lunr.js going this way.

This looks like I cannot find my Russian static content.

I have feed like this

var docs = [

    {
    "id"    : "http://dreamand.me/ru/emerald/developer/s",
    "title"   : "Новое действие",
    "content" : "Дествие это то что сработает после успешной активации...."
},

    {
    "id"    : "http://dreamand.me/ru/cobalt/optimize-links",
    "title"   : "Оптимизация ссылок",
    "content" : "Очень часто получается так, что URLы...."
}
]

and then

var idx = lunr(function () {
    this.field('title', 10);
    this.field('content');
})

for(var index in docs) {
    idx.add(docs[index]);
}

It search well for posts on english but not on Russian. Is there anything I can do to fix it?

Safari (iOS) "TypeError"

With lunr.js 0.4.1, Safari (iOS) shows the following JavaScript error in line 765:

"TypeError: Result of expression '(function () {this._idfCache = {}..."

As the index is not generated, another JavaScript error is shown when my JavaScript is executed:

"TypeError: Result of Expression 'index'[undefined] is not an object".

TypeError When Searching a Previously Serialized Index

I found the bug. You can read my explanation in the first comment below.

When I search a previously serialized index, having loaded it using lunr.Index.load(), I receive a TypeError (shown in detail below). The following are steps to reproduce this error:

First, I create an index, add a document to the index, and log the index to ensure it is working:

var index = lunr(function () { //create index
    this.field('title')
    this.ref('id')
})
index.add({id: 1, title: 'apple' }) //add a document to index
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is working correctly:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object] }, length: 1 },
  tokenStore: { root: { docs: {}, a: [Object] }, length: 1 },
  corpusTokens: { length: 1, elements: [ 'appl' ] } }

I then serialize the index and store it in my database:

user.index = index.toJSON() //serialise index and store it in the database
user.save() //save changes to database

Later, I load the previously serialised index and log the index to ensure it is working:

index = lunr.Index.load(user.index) //load previously serialised index from database
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is working correctly and is identical to before:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object] }, length: 1 },
  tokenStore: { root: { docs: {}, a: [Object] }, length: 1 },
  corpusTokens: { length: 1, elements: [ 'appl' ] } }

Again, I will add a document to the index and log the index to ensure it is working after having loaded the previously serialized index:

index.add({id: 2, title: 'banana' }) //add a second document to index
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is still working correctly:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object], '2': [Object] }, length: 2 },
  tokenStore: { root: { docs: {}, a: [Object], b: [Object] }, length: 2 },
  corpusTokens: { length: 2, elements: [ 'appl', 'banana' ] } }

Here is the problem. I will now attempt to search the index for a document that was added before serializing the index ({id: 1, title: 'apple' }):

index.search('ap') //attempt to search index for document added before serialising index

This is there error I receive:

TypeError: Cannot read property '0' of undefined
    at lunr.TokenStore.getNode (/home/danny/node_modules/lunr/lunr.js:1467:18)
    at lunr.TokenStore.get (/home/danny/node_modules/lunr/lunr.js:1491:15)
    at lunr.Index.documentVector (/home/danny/node_modules/lunr/lunr.js:908:30)
    at null.<anonymous> (/home/danny/node_modules/lunr/lunr.js:880:61)
    at Array.map (native)
    at lunr.SortedSet.map (/home/danny/node_modules/lunr/lunr.js:453:24)
    at lunr.Index.search (/home/danny/node_modules/lunr/lunr.js:879:6)

This is the code in its entirety:

var index = lunr(function () { //create index
    this.field('title')
    this.ref('id')
})
index.add({id: 1, title: 'apple' }) //add a document to index
console.log(index) //log index to demonstrate proper functionality

user.index = index.toJSON() //serialise index and store it to the database
user.save() //save changes to database

index = lunr.Index.load(user.index) //load previously serialised index from database
console.log(index) //log index to demonstrate proper functionality

index.add({id: 2, title: 'banana' }) //add a second document to index
console.log(index) //log index to demonstrate proper functionality

index.search('ap') //attempt to search index for document added before serialising index

Note: If, at the final line, I searched the index for 'ba', it would correctly return {id: 2, title: 'banana' }, which was added after loading the index. Also, if I pass a query that normally would not return any result, such as 'da', it would correctly not return anything, without error.

Thank you very much.

I found the bug. You can read my explanation in the first comment below.

TypeError: Cannot call method 'replace' of INTEGER

Did not see the issue filed around number. Not sure even if it is valid use case, but tokenizer fails on indexing integer (like 10). I do not have much control into what is going into update even, so if sting got change to integer following code fails with TypeError.

lunr.tokenizer = function (str) {
if (!str) return []
if (Array.isArray(str)) return str.map(function (t) { return t.toLowerCase() })

var str = str.replace(/^\s+/, '')

Solved by simply adding empty string to the str variable before calling replace().

lunr.tokenizer = function (str) {
if (!str) return []
if (Array.isArray(str)) return str.map(function (t) { return t.toLowerCase() })

var str = str+'';
str = str.replace(/^\s+/, '')

Serialize big indexes (get rid of TokenStore?)

Hello there,

I'm building a browser based application (without a webserver). I'll have multiple documents to index (in my case, about 700) and the cumulated weight of those documents is about 5 MB.

For now, my application rebuilds the index every time the webpage is displayed, but it takes about 40 seconds to build the full index, which is very long.

I want to store the index in the browser but I can't even serialize it. Here is what I get when I try to JSON.stringify() my index:

Uncaught RangeError: Maximum call stack size exceeded

After a short investigation, the problem seems to come from JSON.stringify() which can't handle big object. In fact, after a heap snapshot, it occurs that my TokenStore is about ~ 70 MB..!

I will not even try to store such a big object in a browser, but maybe I'm looking in the wrong way here. Maybe there is a way to store the index without the TokenStore and rebuild it somehow?

Thank you!

Repository field missing in package.json

Would be great if you could add the repository field in package.json so there won't be a warning when installed with npm. Thanks!!

typeerror 'undefined' is not an object on ipad

lunr (including demo site) fails to work on ipad. After enableing the debug console of Safari, the error above shows (no line nr)

its a first generation ipad, model MB294LL, running IOS 5.1.1 (9B206)
Safari version (found using www.watismijnip.nl) 5.1, Websit/534.46, Mobile/9B206 Safari/7534.48.3

my project uses lunr 0.4.1 (just as the demo page does now)

index_builder.js not working with UTF8 - unicode

My static data is Many language text.

How can I create index.json?

working only latin character (a-z).

data:

index:

using: (cmd=>nodejs)

Uncaught TypeError: Cannot call method 'toString' of null

I assigned field using: lunr.Index.prototype.field()

Then I call this function: lunr.Index.prototype.add()

And one of the fields is null, i get this error. Shouldn't it know if the field is null and exclude it from the the index?

Thanks!

Implement serialize

Building the index for 7000 1-3 word strings is quite slow on a mobile device. One solution t to speed up would be to store and retrieve a generated index on the device.
What's needed to make that happen is to be able to serialize the data inside lunr and add a deserialize too. JSON seems a nice fit.
My understanding of the algorithm is not deep enough to see how data is stored and which part takes most computing, I'm willing to help though

Search Score Decreases with More Accurate Query

The following are steps to reproduce an issue I am experiencing. First, create an index and add a two-word String:

var index = lunr(function () {
  this.ref('ref')
  this.field('text')
})
index.add({
  ref: 1,
  text: 'yes funny'
})

This first query is a portion of the first word followed by the complete second word:

console.log(index.search("ye funny"))
// => [ { ref: '1', score: 0.773957299203321 } ]

This second query also begins with a portion of the first word (can be identical to the term in the previous query, or not) followed by only a portion of the second word:

console.log(index.search("ye fun"))
// => [ { ref: '1', score: 1 } ]

Issue: Why does the second, less accurate query return a higher score than the first, more accurate query?

Note: For this particular example, the issue occurs only when the stemmer is disabled. See my comment below for better examples.

Thank you very much.

Odd bug in results sorting

Hi,
I've got about 9000 food items in an array. I'm wanting to use lunr to match results and order them. So far so good.

Having tried it in Node, I'm getting an error. I thought I'd try it using the front end and I get the same error. Namely, searching for "bread" brings back "seafood breader" first, then "breadfruit" and then finally "bread". I'd expect "bread" to be first...

I've uploaded my test case including the data here: http://03sq.net/lunr-test/ as I'm not sure if I've done something obviously wrong or if this is a bug or how to debug it :)

Index a composite set of fields

For context, an example document:

var document = {"id": 1,"first_name":"Paul", "last_name":"Jensen"};

And an example of the index I would like to construct and consume:

var index = lunr(function () {
  this.field(['first_name', 'last_name'])
  this.ref('id')
})

index.add(document);

index.search("Paul Jensen");

A question that comes to mind is how to make that composite index pad the fields with a space between them. Alternatively, I imagined that a function may be a better option:

this.field(function(doc){return doc.first_name + " " + doc.last_name});

Great work BTW :)

component.json

I think bower recommends to use bower.json rather than component.json, can you rename it and republish?