SQL Server Full Text Search

From Seo Wiki - Search Engine Optimization and Programming Languages

Jump to: navigation, search
File:SQLServer2008Logo.png
Full-Text Search (FTS) is a feature of the Microsoft SQL Server DBMS. FTS has been a part of SQL Server since SQL Server 7.0.

SQL Server Full Text Search (SQL FTS) is a fuzzy search technology for Microsoft SQL Server. It provides features for indexing and searching the contents of character-based, XML, and binary data stored in SQL Server databases. FTS is included as a feature of SQL Server 7.0, SQL Server 2000, SQL Server 2005, and SQL Server 2008 DBMSs.

Contents

Full-Text Search Architecture

File:MS SQL Server FTS Crawl.png
In the MS SQL Full-Text Search architecture, the full-text crawler (shown above) streams tokenized words from documents and data stored in the database. The full-text crawler is responsible for eliminating stopwords from the token stream, and for adding tokenized words to the full-text index.

The SQL Server FTS architecture consists of two distinct parts, the FTS crawler and the FTS query engine[1]. The FTS crawler is a service that streams tokenized words from full-text indexed documents and data stored in the database. The FTS crawler manages external word breaker components that use language-based rules to determine word boundaries during the tokenization process. This component is also responsible for comparing retrieved tokens to the stoplists stored in the database (called "noiseword lists" and stored in the file system prior to SQL Server 2008) and removing all matching stopwords from the token stream. Finally the crawler updates the full-text index with the tokens captured during the crawl process[2]. The crawler can be scheduled to start the process automatically as needed, at regularly scheduled intervals, or manually.

The FTS query engine is activated by a client full-text search request. The FTS query engine parses the incoming client full-text search predicate, tokenizing it and removing noisewords. The query engine then loads a language-specific XML thesaurus from the file system and uses it to perform search token expansions and replacements on the client's full-text search predicate. The FTS query engine then locates token matches in the full-text search index and passes off the IDs of matching documents to the SQL Server query engine so it can return relevant documents and data from the database. In SQL Server 2005 and prior, the FTS query engine was a separate service that existed outside of the SQL Server process space. Beginning with SQL Server 2008 the FTS query engine is closely integrated directly with the SQL Server query engine and is not a separate service.

File:MS SQL Server FTS Query.png
The full-text query engine is responsible for parsing client FTS search predicates, performing thesaurus expansions and replacements on search phrases, performing thesaurus expansions and replacements, locating matches in the full-text index, and finally passing the IDs of matching documents and data to the SQL Server query engine for data retrieval.

Differences in FTS Architecture Across Versions

In SQL Server 7.0, 2000, and 2005 FTS required installation of a Windows service called MSSearch, which operated outside of the SQL Server memory space. This service controlled all indexing and provided FTS search functionality. The MSSearch service communicated with the SQL Server query engine via the Component Object Model (COM) interface. Full-text indexes (FTIs) held stored search tokens in a compressed inverted index in files in the file system. Full-text catalogs (FTCs), which held FTIs, were also stored in the file system. Additional files, like language-specific thesaurus files that provide search synonym and keyword replacement functionality, and noiseword lists that provide word exclusion functionality, were also located in the file system.

Beginning with SQL Server 2008 the FTS service is tightly integrated with the SQL Server query engine. The newest version of FTS has been rebranded Integrated Full-Text Search or iFTS. FTS integration with the SQL Server query engine has eliminated the overhead associated with COM communication between services, and also provided additional opportunities for the SQL Server query engine to further optimize FTS queries. In SQL Server 2008 full-text indexes, catalogs, and stoplists (previously called noiseword lists) are all stored within the database. Thesaurus files are still stored in the file system.

Although the MSSearch service is no longer necessary, SQL Server 2008 iFTS uses two additional Windows services: the fdhost service, which loads and manages word breakers and stemmers, and the fdlauncher service which loads the fdhost service on an as-needed basis. The fdhost service is required to keep third-party iFTS components (generally written in standard C++) out of the SQL Server process space, where a poorly written unmanaged component could otherwise cause server-wide issues.[3]

Full Text Query Syntax

SQL Server provides an FTS query language which can be used within FTS-specific predicates and functions to perform full-text queries. SQL Server provides the predicates and functions listed in the following table, each of which can accept FTS query strings:

Keyword Meaning Returns
CONTAINS Supports complex syntax to search for a word or phrase using Boolean operators, prefix terms, weighted terms, phrase search, thesaurus expansions/replacements and proximity search Boolean
CONTAINSTABLE Supports the full CONTAINS syntax and returns document IDs with rank scores for matches Table
FREETEXT Accepts a simplified FTS query syntax and automatically performs thesaurus expansions and replacements Boolean
FREETEXTTABLE Supports simplified FREETEXT syntax and returns document IDs and rank scores for matches Table

The FTS CONTAINS and FREETEXT predicates can be used in the WHERE clause of a SQL statement to limit results of a query based on the results of a full-text search. The CONTAINSTABLE and FREETEXTTABLE functions are table-valued functions that each return a table with two columns, one containing a document ID and another with a relevance ranking score. The tables returned by these functions can be joined to the source table to return the documents that match the FTS query results.

The rank score returned by the FTS functions is a number from 0 to 1000 that indicates the relevance of the search result. The higher the value the more relevant the match. Although rank is always calculated internally when a full-text search is performed, it is only exposed through the use of the full-text search functions. A variety of factors are considered in the various ranking algorithms used, including:

  • Document length - all things equal, shorter documents are considered more relevant
  • Number of occurrences of search words/phrases - the more times the search words/phrases occur in the document, the higher the ranking
  • Proximity of search words/phrases in proximity search - proximity is factored into rank when the NEAR proximity search operator is used
  • User-defined weights - user-defined weights are factored into rank when the ISABOUT weighted search operator is used

The algorithms used to calculate rank include Jaccard for weighted searches and Okapi BM25 for FREETEXT searches.

Proximity Searches

FTS has the ability to find words and phrases that are located in close proximity to one another. This is achieved by using the keyword "NEAR" (or the operator "~") between two given search words or phrases. The number of intervening words between phrases is factored into the rank of documents containing both of the search words or phrases. The more intervening words between the two search terms, the lower the total rank. There is currently no method for specifying a user-defined proximity (i.e., "10 words apart", "2 paragraphs apart", etc.) Paragraph breaks and other non-space whitespace characters are known to affect the calculation of proximity, though Microsoft has not published the specific details of the calculations. Additionally, the calculation of proximity is subject to change with each SQL Server release, potentially affecting proximity search results.

SQL Server uses a vector-space search algorithm to perform proximity searches. The order in which the phrases appear within the search document are irrelevant to the vector-space algorithm. In a SQL Server 2008 CONTAINS or FREETEXT proximity search predicate the NEAR operator is mapped internally to the AND operator, so that proximity is ignored. The net results are the same as if you simply use the AND operator with CONTAINS or FREETEXT. To take advantage of proximity search you must use the CONTAINSTABLE or FREETEXTTABLE functions, both of which allow access to the internally-generated relevance ranking.

The following sample query uses the NEAR keyword to perform an FTS proximity search:

SELECT 
  e.name, 
  e.surname
FROM CONTAINSTABLE
  (
    Athletes, 
    *, 
    '("Gold" NEAR "Finalists")'
  ) AS ct
INNER JOIN Athletes e
  ON ct.[KEY] = e.id
WHERE ct.[RANK] > 0
  ORDER BY ct.[RANK];

Proximity search can be performed using either the NEAR keyword or the ~ operator in FTS queries.

Inflectional Searches

Full-text searching has the ability to search for word inflectional forms, or different forms of the same word. Inflectional forms of words are generated by a language-specific components known as "stemmers" at query time. The U.S. English language stemmer (Locale ID [LCID] 1033) is a different component from the Spanish language stemmer (LCID 3082), for instance.

Stemmer components rely on language-specific rules and dictionary lookups to generate inflectional word forms which include gender and neutral verb conjugations, plural and singular nouns, adjective forms, and verb conjugations. As an example, when given the verb swim, SQL Server generates the inflectional forms swim, swims, swimming, swam, and swum when using the U.S. English stemmer.

The FREETEXT predicate and FREETEXTTABLE function automatically generate inflectional forms for search words. The CONTAINS predicate and CONTAINSTABLE functions, however, require you to use the FORMSOF(INFLECTIONAL, ...) operator in your full-text query string to generate inflectional forms.

Prefix Term Searches

FTS provides prefix search capability, which allows you to search for words beginning with a certain term, or prefix. This is achieved by following your search token with a wild-card asterisk (*) symbol. In order to indicate to SQL Server that your search term is in fact a prefix search term (and that the * character is not to be matched exactly), you must enclose the search token in double quotes.

Prefix Term Simple Example #1

-- This sample matches all documents containing a word that begins with the prefix "al" 
SELECT 
  name, 
  surname
FROM Athletes a
WHERE CONTAINS
  (
    *, 
    '"al*"'
  );

Binary Document Search

FTS has the ability to search through binary data, which can consist of a wide variety of documents stored in the database. FTS uses document type-specific iFilter components which understand the structure, formatting and can implement the rules of the document type it is designed for. Some of the supported document types which can be stored and searched in the database include:

  • HTML (*.htm or *.html)
  • XML (*.xml)
  • MS Word (*.doc)
  • MS Excel (*.xls)
  • MS PowerPoint (*.ppt)
  • Adobe Acrobat (*.pdf)

Check the SELECT * FROM sys.fulltext_document_types catalog view for a list of the stock SQL Server 2008 supported document types. You will notice that the DOCX and XLSX are not represented but are probably covered under XML.


This list is by no means exhaustive. In fact, any type of document can be stored in the database and searched with FTS as long as an appropriate document-specific iFilter component is available and installed. Some formats, such as the Adobe Acrobat .pdf format listed above, require installation of a third-party iFilter for FTS to properly index the document contents.

Weighted Searches

FTS has the ability to associate a higher preference for certain terms over others. This is achieved by using the function ISABOUT with the WEIGHT keyword. The keyword is associated with any decimal value between 0.0 and 1.0. The higher the weight the greater preference the match will be given.

Weighted Searches Simple Example #1

SELECT 
  a.name, 
  a.surname
FROM Athletes a
JOIN CONTAINSTABLE
  (
    Athletes, 
    notes, 
    'ISABOUT(Bronze WEIGHT(.2), Silver WEIGHT(.4), Gold WEIGHT(.8))'
  ) ct
  ON a.id = ct.[KEY]
  ORDER BY Rank DESC;

Stopwords

These are common tokens that are considered useless, or nearly useless, in search applications. These tokens generally fall into one, or both, of the following categories:

  • Tokens that occur too frequently to add value to a search (e.g., "himself", "would", "through")
  • Tokens that are too short to add value to a search (e.g., "B", "C", "1")

Many words fall into both of these categories. Words such as "I", "the", "and", and "of" are all too short -- and occur too frequently -- to add considerable value to the search process. By default FTS uses a language-specific system stoplist, which consists of a list of stopwords, to eliminate these types of words from searches. Eliminating these stopwords will, in turn, reduce the overall size of the full-text index.[4] Administrators can create custom user-defined stoplists, or turn off stoplists, as required.

Hardware Optimization Considerations

By default, MS SQL Server reserves all the virtual memory in its process space, minus 1GB. This leaves little for MSSearch. Consider readjusting to a ratio of 8(SQL Server):2(MSSearch). By default the Windows 2000/NT virtual memory is 1.5 times physical memory. Consider increasing this to 3 times physical memory.

References

  1. [1]. MSDN: Books Online - Full-Text Search Architecture
  2. [2]. MSDN: Books Online - Full-Text Search Architecture
  3. [3]. MSDN: Books Online - Stopwords and Stoplists
  4. [4]. MSDN: Books Online - Stopwords and Stoplists
Personal tools

Served in 0.164 secs.