Sphinx search is a full text search engine, commonly used with MySQL.
There are some misconceptions about Sphinx and its usage. Following is a list of some of Sphinx’ properties, hoping to answer some common questions.
- Sphinx is not part of MySQL/Oracle.
- It is a standalone server; an external application to MySQL.
- Actually, it is not MySQL specific. It can work with other RDBMS: PostgreSQL, MS SQL Server.
- And, although described as “free open-source SQL full-text search engine”, it is not SQL-specific: Sphinx can read documents from XML.
- It is often described as “full text search for InnoDB”. This description is misleading. Sphinx indexes text; be it from any storage engine or external source. It solves, in a way, the issue of “FULLTEXT is only supported by MyISAM”. Essentially, it provided full-text indexing for InnoDB tables, but in a very different way than the way MyISAM’s FULLTEXT index works.
Sphinx works by reading documents, usually from databases. Considering the case of MySQL, Sphinx issues a SQL query which retrieves relevant data (mostly the text you want to index, but other properties allowed).
- Being an external module, it does not update its indexes on the fly. So if 10 new rows are INSERTed, it has no knowledge of this. It must be called externally to re-read the data (or just read the new data), and re-index.
- This is perhaps the greatest difference, functionality-wise, between Sphinx and MyISAM’s FULLTEXT. The latter is always updated, for every row INSERTed, DELETEd or UPDATEd. The latter also suffers by this property, as this makes for serious overhead with large volumes.
- There’s more than one way to make that less of an issue. I’ll write some more in future posts.
- Sphinx does not keep the text to itself; just the index. Sphinx cannot be asked “Give me the blog post content for those posts containing ‘open source'”.
- Sphinx will only tell you the ID (i.e. Primary Key) for the row that matches your search.
- It is up to you to then get the content from the table.
- With SphinxSE (Sphinx Storage Engine for MySQL) this becomes easier, all-in-one query.
- It can keep other numeric data. Such data can be used to filter results.
- It provides with GROUP BY-like, as well as ORDER BY-like mechanism.
- It allows for ordering results by relevance.
- It allows for exact match search, boolean search, and more.
- It has an API & implementation for popular programming languages: PHP, Python, Perl, Ruby, Java.
The above describes Sphinx as a general fulltext search engine for databases. It does, however, have special treatment for MySQL:
- First and foremost, it knows how to query MySQL for data (duh!)
- If you don’t mind compiling from source, you can rebuild MySQL with SphinxSE: a storage engine implementation. This storage engine does not actually hold any data, but rather provides an SQL-like interface to the search daemon.
- Thus, you can query for search results using SELECT statements, JOINing to document tables, retrieving results, all in one step.
- If you do mind compiling MySQL, be aware that MariaDB comes with SphinxSE built in in newer versions.
- It implements the MySQL protocol. You can connect to the sphinx server using a MySQL client, and actually issue SQL statements to retrieve data. Not all SQL is supported. The valid subset is called SphinxQL.
- Snippets (excerpts) are supported via MySQL UDF.
@Nams,
same with other indexes: you need to keep track of the PK where you’ve got the string in the first place. Meaning you’ll need a “real” table to hold the content.
Update: It is not available on a shared hosting account.
HI Shlomi,
Its a grate overview about Sphinx.
Currently I’m using Apahe Lucene’s Solr as “search engine” in my ajax based application.
Is there any difference between these two?
Thanks
http://twitter.com/sachinpethani
@Sachin,
I have no experience with Lucene. From what I gather around, I see many users to both, and I hear they both compete closely in terms of performance and feature set. I can find quite a few links on Google which discuss this; sorry, but I can’t offer any additional information here, other than the fact that Sphinx has a good integration with MySQL, which Lucene does not. This in itself is probably not a sufficient reason to choose one over the other.