Pacharapol Withayasakpunt Pacharapol Withayasakpunt
Tue, March 16, 2021

Search engine building tutorial, that supports advanced search syntaxes

  • It should support AND, OR, NOT; and perhaps brackets ().
  • Another part of it, though, is about optimization and fuzzy searches.
    • Fast even for a large body of text.
    • Realizes pluralization.
    • Forgiving of minor typos.

Advanced search syntaxes

I have thought about this a lot in the past.

The easiest way is to use lunr.js's syntaxes.

  • Default connector is AND.
  • To make an OR, use ?expression.
  • Search is normally case-insensitive, i.e. a and A means the same thing.
  • +expression means exactly match, and case-sensitive.
  • -expression means negation.
  • Not only :, but also > and < is used to specify comparison. For example, +foo:bar, count>1.
  • Date comparison is enabled.
    • Special keyword: NOW.
    • +1h means next 1 hour. -1h mean 1 hour ago.
    • Available units are y (year), M (month), w (week), d (day), h (hour), m (minute).

You can see my experiment and playground here.

https://github.com/patarapolw/qsearch

Full text search and fuzzy search

I made a list, here.

https://dev.to/patarapolw/what-s-your-favorite-full-text-search-implementation-4659

  • Algolia
  • Elasticsearch, Lucene, Solr
  • Google custom search

How does it compare to search engines with web crawlers?

  • Yahoo
  • Bing
  • DuckDuckGo
  • Yandex
  • Baidu

What about pure JavaScript implementations?

  • js-search
  • lunr, elasticlunr

RDBMS and NoSQL's feature?

  • SQLite FTS4, FTS5
  • PostgreSQL plugin
  • MongoDB

Or, some other implementations, like Python's Whoosh?

Implementing both together

It is easier if you use RDBMS and NoSQL's features. PostgreSQL, MySQL and MongoDB (but not SQLite) allows you to create an index on a TEXT column, and make a full-text index.

Furthermore, PostgreSQL also has pgroonga, that does not only have more language support than native tsvector; but also can index anything, including JSONB.

https://pgroonga.github.io/

Now comes the algorithm for the syntax. I made it for PostgreSQL in another project.