blog2.shibu.jp

Python JavaScript Qt

Full Text Search Engine on JS

I am creating a full text search engine in JavaScript. My source code is uploaded here.

Goal of my project

My goal is creating an alternative search engine for Sphinx‘s HTML outputs.

Sphinx is a document generator written in Python. It reads simple formatted (reStructuredText) text and converts int into several formats like HTML, PDF and so on. It is a very popular tool for writing documents for programming language reference manual, software instruction, library, API etc.

Sphinx’s HTML generator has a simple built-in search engine. An HTML document written by Sphinx has a search window and the user can search whole documents. It is written in JavaScript and it downloads an index file and searches on the client browser. A document provider doesn’t have to set up search engines on his/her server. Just deploying html folder on his/her hosting server, then it is available.

I liked this concept, but originally it supported only English. I sent a patch to support other languages (it provides API to support other languages) and a Japanese plug-in for Sphinx. I heard it was merged into the latest stable version 1.1.

When I visited Japan for a business trip (now I live BayArea), I bought the book “World of high-speed string analysis”. In this book, an amazing algorithm, “FM-index”, is described. It is good for a client side search engine. I think this improves Sphinx’s search engine even more.

And I am a founder of sphinx-users.jp. Now I have gave the leader position to @tk0miya and @shimizukawa but I still like Sphinx.

In addition to that, there are many simple HTML hosting service now, like bitbucket.org, github.com, PyPI, Google Drive and more. Also, HTML5 provides an application cache feature. I am thinking static HTML documents (like Sphinx’s outputs) without a search engine server will increase more and more. Providing a pure JavaScript search engine helps many users.

FM-index

This engine uses the FM-index algorithm. FM-index was invented Paolo Ferragina and Giovanni Manzini. Compared to “inverted index”, FM-index has the following advantages:

  • It doesn’t need word splitting before a search

    Some eastern Asian languages don’t have space between words. To split words, it needs language information (it is called a “dictionary”) For example, the famous Japanese word splitter, MeCab, needs a big dictionary file (> 10MB). I don’t know Chinese and Korean well. Supporting many languages by “inverted index” is difficult.

  • It can recreate original document from index file

    “Inverted index” only has word positions in documents. To show document samples on the search result page, it needs the original documents in addition to the index file. FM-index needs only an index file to search words and show document samples. Sphinx uses the original document source file for this purpose. But it is not good with translation feature because source files are written in different languages. And a user should enable the source exporting option.

  • It is the fastest full text search algorithm using a compressed index file

    The book “World of high-speed string analysis” said this. I don’t have any other evidence because I don’t have any other experience implementing a compressed index for inverted index algorithms, but smaller and faster is good for my goal.

I ported the FM-index library Shellinford that was written in C++ by @echizen_tm.

Other tools

The search algorithm was already ported and it passed the unit test. But it needs other parts to create a useful search engine.

At first, I ported the HTML parsing library. It analyses HTML and picks out text from existing HTML files. It is used from the index generator tool.

Now I am porting stemming algorithms. Stemming converts words into base/root form. It provides flexibility to the search engine.

I am changing Snowball‘s source code generator. Snowball a tool that has a long history in this area. It has stemming algorithms for more than 15 natural languages and generates C/C++/Java source code for each algorithm. I added a JSX generator and am adding Python (Python is needed for supporting Sphinx).

I selected JSX as the programming language. It is created by my DeNA Co.,Ltd. coworkers. It can generate faster and safer JavaScript code. It has a powerful static type system and detect many errors before running.

First Milestone

I will provide the search engine and an index file for the JSX website later this month.