Sphinx & MySQL: facts and misconceptions

Sphinx search is a full text search engine, commonly used with MySQL.

There are some misconceptions about Sphinx and its usage. Following is a list of some of Sphinx’ properties, hoping to answer some common questions.

  • Sphinx is not part of MySQL/Oracle.
  • It is a standalone server; an external application to MySQL.
  • Actually, it is not MySQL specific. It can work with other RDBMS: PostgreSQL, MS SQL Server.
  • And, although described as “free open-source SQL full-text search engine”, it is not SQL-specific: Sphinx can read documents from XML.
  • It is often described as “full text search for InnoDB”. This description is misleading. Sphinx indexes text; be it from any storage engine or external source. It solves, in a way, the issue of “FULLTEXT is only supported by MyISAM”. Essentially, it provided full-text indexing for InnoDB tables, but in a very different way than the way MyISAM’s FULLTEXT index works.

Sphinx works by reading documents, usually from databases. Considering the case of MySQL, Sphinx issues a SQL query which retrieves relevant data (mostly the text you want to index, but other properties allowed).

  • Being an external module, it does not update its indexes on the fly. So if 10 new rows are INSERTed, it has no knowledge of this. It must be called externally to re-read the data (or just read the new data), and re-index.
    • This is perhaps the greatest difference, functionality-wise, between Sphinx and MyISAM’s FULLTEXT. The latter is always updated, for every row INSERTed, DELETEd or UPDATEd. The latter also suffers by this property, as this makes for serious overhead with large volumes.
    • There’s more than one way to make that less of an issue. I’ll write some more in future posts.
  • Sphinx does not keep the text to itself; just the index. Sphinx cannot be asked “Give me the blog post content for those posts containing ‘open source'”.
    • Sphinx will only tell you the ID (i.e. Primary Key) for the row that matches your search.
    • It is up to you to then get the content from the table.
    • With SphinxSE (Sphinx Storage Engine for MySQL) this becomes easier, all-in-one query.
  • It can keep other numeric data. Such data can be used to filter results.
  • It provides with GROUP BY-like, as well as ORDER BY-like mechanism.
  • It allows for ordering results by relevance.
  • It allows for exact match search, boolean search, and more.
  • It has an API & implementation for popular programming languages: PHP, Python, Perl, Ruby, Java.

The above describes Sphinx as a general fulltext search engine for databases. It does, however, have special treatment for MySQL:

  • First and foremost, it knows how to query MySQL for data (duh!)
  • If you don’t mind compiling from source, you can rebuild MySQL with SphinxSE: a storage engine implementation. This storage engine does not actually hold any data, but rather provides an SQL-like interface to the search daemon.
    • Thus, you can query for search results using SELECT statements, JOINing to document tables, retrieving results, all in one step.
    • If you do mind compiling MySQL, be aware that MariaDB comes with SphinxSE built in in newer versions.
  • It implements the MySQL protocol. You can connect to the sphinx server using a MySQL client, and actually issue SQL statements to retrieve data. Not all SQL is supported. The valid subset is called SphinxQL.
  • Snippets (excerpts) are supported via MySQL UDF.

12 thoughts on “Sphinx & MySQL: facts and misconceptions

  1. * Being an external module, it does not update its indexes on the fly. So if 10 new rows are INSERTed, it has no knowledge of this. It must be called externally to re-read the data (or just read the new data), and re-index.

    ** There’s more than one way to make that less of an issue. I’ll write some more in future posts.

    Did you end up writing about how to make that less of an issue?

  2. I did not… And it’s been a while since I last used Sphinx, so I can only assume a lot has gone by that I’m unaware of.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.