For the last several months, I have been deploying Sphinx search to handle search indexes for all the websites Martino Flynn hosts in-house. My intention here is to explain the following topics as thoroughly as I possibly can in a blog post:
- What is Sphinx search?
- How to install and configure
- Configuring data sources and indexes
- Configuring the searchd daemon
- Indexer and searchd commands
- Web scraping options
- Scheduled tasks
What is Sphinx search?
Sphinx is an open source, full-text search server, designed with performance, search quality, and simplicity in mind. Some key features included in Sphinx that I enjoy but are not limited to:
- Non-SQL storage indexing.
- Advanced full-text searching syntax.
- Rich database-like querying features.
- Better relevance ranking.
- Flexible text processing.
- Easy application integration.
Sphinx is licensed under the GPLv2 license that allows for commercial and non-commercial usage. Sphinx has outstanding searching performance and scales well, which is important for the growth of any Web application or website.
How to install and configure?
There was a time when I used the SphinxSE engine for MySQL, but I no longer think it is necessary, at least for my current needs. This is the install script I use for compiling and installing Sphinx.
tar xvf sphinx-2.0.5-release.tar.gz
./configure --prefix=/usr/local/sphinx --with-libstemmer --without-mysql --with-xmlpipe2
After the installation process has finished, you will want to create an unprivileged user for Sphinx to run in.
useradd -r sphinx
You may need to specify the full path to the command for this to work. On Debian, you can delete the password after creating the user, like this:
passwd --delete sphinx
After you have created this user, you can change the permissions of the binary files under /usr/local/sphinx to the Sphinx user. You will also want to adjust your log directories and socket file permissions. I usually create a Sphinx folder in both /var/log and /var/run.
Configuring data sources and indexes
You will need to edit your sphinx.conf file under /usr/local/sphinx/etc/sphinx.conf.
The first step is to create your source. In this case, we will depend on an XMLPipe2 source. The syntax for creating this data source can be located here. Once you have generated the xml file, you can reference it in your sphinx.conf file.
type = xmlpipe2
xmlpipe_command = /bin/cat /path/to/websitename.xml
Next, you will need to configure indexing options for this data source as follows:
source = srcwebsitename
path = /path/to/indexes
morphology = stem_en
charset_type = utf-8
min_prefix_len = 3
enable_star = 1
Make sure that the path to your indexes is owned by the Sphinx user. You may also want to look up the options here or add your own by referencing the Sphinx documentation.
Configuring the searchd daemon
Lastly, you will need to set up the configuration for the searchd daemon. Add the following to the end of your sphinx.conf file:
port = 9312
log = /var/log/sphinx/searchd.log
query_log = /var/log/sphinx/query.log
read_timeout = 5
max_children = 30
pid_file = /var/run/sphinx/searchd.pid
max_matches = 1000
seamless_rotate = 1
compat_sphinxql_magics = 0
preopen_indexes = 1
Again, reference the Sphinx documentation as needed based on your configuration needs.
Indexer and searchd commands
You must run these commands under the Sphinx user account.
indexer --rotate --all
You should see output similar to this. The rotate option will happen if the index already exists and simply needs to be updated.
total 79 docs, 34354 bytes
total 0.025 sec, 1340591 bytes/sec, 3082.80 docs/sec
total 6 reads, 0.000 sec, 31.5 kb/call avg, 0.0 msec/call avg
total 18 writes, 0.002 sec, 22.1 kb/call avg, 0.1 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=8724).
Now you start the search daemon (searchd).
Exit simply logs you out of the Sphinx user account. Data sources will vary here, but you should be able to search your index by using a command similar to this:
search --index (INDEXNAME) -a (SEARCH TERM)
(Ignore the parentheses)
You can simply type “search” and hit enter to view all options for this command and search through your indexes.
Web scraping options
Currently, we use a cool program called Scrapy at Martino Flynn to generate a JSON data source based on a website crawl. We then write a program that parses the JSON into the Sphinx XML format and writes this file for Sphinx to use. In addition, this program will add records to the search results table, which is based on the document IDs returned by the search results. This way we ensure we are pulling the correct results when generating the search page. There are many other Web scraping tools out there such as BeautifulSoup. I am sure there are more and would love to hear what everyone is using out there.
Cron jobs are set up that run Web crawls at night and update the XML files and search results table in the database. An email is then sent to me if the crawl were to fail or be interrupted; otherwise a success email is sent. These cron jobs simply target the program mentioned under “Web scraping options” above.
I would love to hear your thoughts on this technology, especially other ways of configuring this excellent product. In my next post I plan to discuss relevance algorithms to improve the search results even further in Sphinx search. Search is an important tool and needs to perform well. Sphinx search is an excellent product that we plan to continue to use at Martino Flynn to help our clients.