Relevant Search
by Doug Turnbull and John Berryman
The book provides a solid introduction to search relevance. It provides a well-structured overview, explaining problem definitions, core concepts and techniques. It also emphasizes non-technical aspects crucial building high-quality search systems, such as having a fast iteration process, having content curators in a team and organizational structure to support a healthy relevance feedback loop.
Chapter 1 introduces the problem space defining relevance as "the art of ranking content for a search based on now much tat content satisfies the needs of a user and the business". They show that relevant search results look different depending on the type of search or domain -- for example web search, e-commerce, expert search each have difference expectations and goals.
The chapter also introduces several key concepts:
- signals - ranking factors that measure what users care about. They measure whether items are relevant for a given query
- information need - a description of the ideal content that would satisfy the user's search
- judgement lists - predefined lists of search results for a set of queries, often used in validation
- feature - an attribute of the content or query
- ranking function - a function that combines multiple signals to produce a ranked list of results
One detail I appreciated was how the authors define tools like Solr and Elastcisearch as search programming frameworks. I particularly liked it because at some point I was talking to a friend of mine who assumed that having Elasticsearch meant search was a "solved problem". This perspective highlights that search frameworks are just the starting point.
The definition of relevance gets refined in the chapter, arriving at the comprehensive form:
Relevance is the practice of improving search results for users by satisfying their information needs in the context of a particular user experience, while balancing how ranking impacts our business's needs
Chapter 2
- inverted index
- analysis - which is converting raw documents into tokens, which will serve as the features describing the document. An important remark to remember is that for classical searches, query and the document tokens have to be identical byte-for-byte to be considered a match. The chapter also expands all the analysis steps, namely character filtering, tokenization, token filtering.
- indexing and storing - saving the analyzed data. The authors articulate the difference between indexing (updating the inverted index with the extracted tokens so that they are searchable) and storing (retaining the original, unaltered document content in the stored field's data structure). Basically indexed data is going to be used for the search operations, stored data will be retrieved and displayed to the user. Depending on the technology and configuration, storing is also taken care of when indexing a document (e.g. Elasticsearch without disabled
_source
).