How to Develop Search Engine From Scratch – Everything You Need to Know

How about “what is the meaning of life?” on Google?

If you said yes, we probably share a lot. Search engines are infinitely complex. Numerous servers scrape websites, store and consolidate data, run extensive matrix calculations, and push us closer to the singularity.

Still, the basics of a search engine are simple. You ask a question in the form of an inquiry. The search engine looks for relevant pages. The most critical factor in establishing a page’s relevance to a query is:

Is the data on the page like the query data?

We’ll go one step further and construct a full-featured search warrant module in Python that mimics Google’s capabilities.

How does full-text search work?

Full-text search is the capacity to search for terms across many documents. They are used to characterize a database’s claimed feature. Take a look.

A document is a word cloud.

Make it so. Just do it. Don’t let your dreams die. Tomorrow was yesterday, So go for it. Realize your fantasies. Just do it. Some actual dream of prosperity, but you’ll work hard every day to attain it. Nothing is impossible. You should be able to keep going when others would give up. So, why are you waiting? Make it so! Just do it! Yes, you can. Just do it.

Restarting is not for the faint-hearted.

Term: A frequently used word in the paper.

Just

The document’s use of a term

Don’t let your dreams die. Realize your dreams.

A question is a query written in a query language format.

Do it.

Is “just” or “do” better?

A conventional database index cannot quickly achieve this. Row values are translated to row indexes in the order they are stored in a database. The most common data structure is a map (for example, implemented with a B-Tree). It speeds up range and targeted search searches. Answering questions like:

Who owns the username ‘bob’?

Why so many users are aged 5 to 15?

In terms of documentation, these are irrelevant. You will never Google an entire Wikipedia page. Documents with certain words or phrases catch your eye. A regular index can’t do it.

It’s where full-text searching comes in, helping you to answer questions like swiftly:

AND execute

“Just” and “do” are synonyms.

“Just do it,” they say.

“Just do” is exact.

-do

The phrase is not like do.

-do

The phrase is not like do.

-tomorrow AND

“You can” and “do,” but not “tomorrow.”

To get a feeling for a full-text search, type the above queries into Google right now. Google understands them all. Millions of websites use Elasticsearch, the most widely used open-source search engine.

Google will implicitly employ several of the above query types. For example, if you google for San Francisco, it will look necessarily looking even if you did not indicate it.

Google Elasticsearch

But wait on, things will improve. Each document gets a score.

Search engines receive a value to each document for each query.

9000.001

Lucene is a comprehensive search library that uses the tf-IDF algorithm. You will score it on how often a term occurs in the text versus overall common.

Do these work in practice?

TL; DR: We started by tokenizing the document. The tokens are then inverted-indexed. A final inverted index is used to expedite query execution.

Making a search engine is complicated. Google, Bing, and Elasticsearch employ thousands of engineers.

To fit the whole process into a blog post, I’ve reduced it:

Our library only supports the query types specified above.

We use the simplest data structure possible.

Before querying the data, it is indexed. Regular entire text search libraries support incremental indexing, allowing you to add documents and query the index as it builds.

Query planning, ASTs, and compression are absent.

You intend to write separate entries about each. The implementation should show how an optimized library works. We employ the same high-performance libraries as Lucene and Tantivy but at a much slower rate.

Read More: How to Build a Website from Scratch with HTML – Step by Step Guide

Constructing a pipeline

Indexing is a funnel, with queries as the last phase.

Documents: First, we’ll need certain documents. This blog post does not cover their origins. I used three motivational presentations from the local file system.

Analyze: As indicated previously, we must term the document. The most common way is to convert all words lowercase, so we’ll do that.

The index is where the tokens are placed into the data structure.

Query: We use the query language’s logic to find the proper documents quickly.

Analyze

As stated above, extracting lower case words from document strings is the method used in this step. The method outlined below is a crude one for demonstration purposes, but it does the job.

List [str] as a result of the tokenize (document: str) function

# change the document’s case-sensitivity

A lowercase document is equal to a lowercase document ()

Regex is used to find all of the words.

Re.findall (“w+”, lowercase document) returns all words.

Return the results of the search (words)

It means [tokenize (document) for each document in documents].

In tokenized documents: for tokenized document: for tokenized document:

“…”, tokenized document [-3:]

There are a total of [“inch,” “by,” “inch”]. “You,” “Gotta,” and “Do”

[“just do it,” “do it now,” “it”] … There is a [stop], [give], [up].

“The pursuit of,” “the,” and “of” … ‘Get,’ ‘it,’ and ‘period’ are all synonyms.

As a result, each document’s tokens are now ready.

Indexing

Using an inverted index, a data structure that you can easily query, is the next step in tokenization.

An inverted index must tell you where a given the word appears in a given document based on a single word. When you mention a hash map, many people will immediately think of it. Real search engines use a variety of data structures to build their indexes.

When inverted, the index is equal to 0

The enumerate (tokenized documents) for the document ID and tokenized document.

For the token index, the token in enumerate (tokenized document)

Token position = (document id, token index) / token position

If a token is an inverted index:

inverted index[token].append(token position)

else:

Inverted index

The location of the token is equal to the token

When printing, use the inverted index of “the.”

Inverted index [“the”] [“do ->”] is the command to print.

[(1, 0), (1, 5), (1, 20), (1, 28), (1, 76), (1, 82)] is all that is required.

Perform the following steps: [(0.94), (0.399), 0.455, 0.493, (0.493)], [(1, 1)], [(3, 1)], [(6, 1)], [(21), [(29)]], [(74), [(77)], [(83)], [(2.15), 2.33], 2.44]]

Instead of storing the frequency of each term, we keep its position because doing so allows for more efficient searches for exact matches.

Querying

Operation types we want to support are as follows:

“a b” is interpreted as “a OR b” since there is no operator between the two terms.

We only return similar documents when two terms are separated by the “AND” operator.

In the presence of a “+,” all documents that do not contain the prefixed term are omitted from consideration.

If a term is preceded by a “-,” all documents containing that term are excluded.

We also have the option of matching exactly. Only documents that exactly match the search term will be returned when enclosed in quotation marks (‘”.’).

That is the algorithm I’m employing, as follows:

Sift through the query and separate both its operators and terms.

It’s essential to keep track of each document’s current score and what documents have been excluded and included as you go through the query.

In the end, return a list of document identifiers sorted by score, excluding incorrect entries based on inclusion and exclusion.

You should note that this algorithm does not currently support parenthesis. Adding to the complexity of this already tricky explanation would be a mistake.

Algorithm Architecture

To illustrate, let’s look at this example:

AND “you can” – tomorrow

The first step is to separate the query’s terms and operators.

Is there a way to get a list of all the query expressions?

You can do it tomorrow if you do it today.

Next, we’ll set up a global state to update as we process the query.

DocumentScores =

Set [DocumentId] = set ()

Included Document IDs: Set [DocumentId] = set ()

A mode and a pointer are used in the algorithm to evaluate the data. You can use two methods in conjunction with OR and AND. Depending on the way, you will handle a term differently. However, there is more complicated logic involved in determining a score for exact matches than for term matches.

Mode is equal to Mode. OR

The pointer is set to zero.

However, pointer == query expressions

[Pointer] = query expression

If query expression is set to MODES:

# Change the Mode

Else:

We’re looking for quotes in the query expression.

The case-sensitive handling of EXACT CASE

The new document scores are:

A pointer has been set to the closing quotation mark.

Else:

UNDERSTANDING THE CASE

In this case, the new document scores variable is set to

# Make necessary changes to the global variables’ settings.

The Mode and the use of the new document scores

EITHER INCLUDE, EXCLUDE, INCLUDE AND, OR

Add one to the pointer to get one.

Compute the final results

Based on exclusion and inclusion, a #filter is created.

Compiling the results

Example of the Case

We first need to know the inverted index’s token positions to get a term’s score.

[Term] inverted index [token positions], or,

To calculate the term-scoring of a document, use the following formula: term-scoring (token positions)

The document term scores are 1: 10, 5: 3, 5: 3, and 5: 3

THE EXACT SAME SITUATION

Exact match parsing is more complicated. The fundamental formula is:

Start at the beginning.

Find all the places where the term is used.

Take a look at the following probable locations. That leaves the question of where the next expected term would be located within the text.

While there are additional terms, follow the same steps outlined above.

Despite the complexity of the actual code, the happy path is the same as the error path:

Add one to the pointer to get one.

In this case, matches =

That is the combination of the document id and the token position.

In inverted index. Get (for document id, token position)

[pointer], []]

Add one to the pointer to get one during this time, query expressions [pointer] does not contain the quotation marks if it doesn’t match:

Break [pointer] is the name of the query expressions

Next Token Position Matches (inverted index [term] & matches)

Add one to the pointer to get one.

To calculate the term-scoring of a document, use the following formula: term-scoring (token positions)

Also Read: How to Set Up Your First Shopify Store – Beginner’s Guide

Term scores are a method for calculating term scores.

As you can see, the scoring is based on how many times a particular term appears in the document. That’s how it’s done in terms of a formula:

As a result of this, the term scores (token position) -> the document scores (token role)

To calculate the document term scores, we’ll use

For id, the token position is:

If document id is found in document term scores:

The document term scores [document id] are equal to 1.

Else:

In this case, the document has a term score of 1.

Send back the document term scores.

Adapting global settings to the selected Mode

Finally, we need to merge all document term scores into the global state. Depending on the Mode, the approach will be different.

The addition is all that is required for the OR operation. Merging the current and new DocumentScores values are called def merge or.

New. Items () should return this value for document id:

If document id is current, then the sum of the current document’s id and its score

Else: The current document id is equal to the score.

AND

Only those terms that do not already appear in the scores table should be included in the AND calculation.

current: DocumentScores and new: DocumentScores

Find the keys that aren’t in both.

filtered out = set(current.keys()) set(new.keys())

list (current.keys()): for document id

documents with filtered-out ids, but still having their current filtered-in ids:

The current document is del current.

Score in the list (new. Items ()) for document id:

If document id is not filtered out:

The sum of the current document’s id and its score

However, I’m using a bit of set magic to make the point.

INCLUDE

Included document ids.update(document term scores.keys())

EXCLUDE

Excluded document ids.update(document term scores.keys())

Compute the outcomes

In the end, we remove the documents that should not have been included or were not explicitly excluded.

List ( ) is returned.

Sorted

(Document id, score) is a Document scores. Items should contain the score for document id ()

Whether or not a document id is present in a query

Not included in the excluded document ids list

In this case, the lambda is equal to the square root.

The outcome is as follows:

“THE PURSUIT OF HAPPINESS” (Chapter One):

Here are a few examples.

Just get on with it.

Inch by inch, (1, “the pursuit of happiness,” 3), (2, “just do it,” 14), (0, “just do it,” 14)

Just as well as

Just do it [(2, ‘JUST DO IT,’ 14)].

“Just do” and “it.”

“[(2, ‘JUST DO IT!’ 15)

A day from now, if you “just do,”

[2 ‘JUST DO IT, 7]]

-do what you need to do

(0, ‘INCH BY INCH, 4) and (1, ‘THE PURSUIT OF HAPPiness,’ 3)]

Using only 100 lines of code, we were able to mimic the functionality of Google search.

Benchmarks

When we look at all of Shakespeare’s works in their entirety using this larger dataset, we get the following results. The query index function finished in 1.389 milliseconds.

Just get on with it.

“Othello,” “Hamlet,” and “Troilus Cressida” all appear in the list above.

The query index function took 1.336 milliseconds to complete.

Just as well as

“Othello,” “Hamlet,” and “Troilus Cressida” are all included in the list of “15.”

The query index function took 3.134 milliseconds to run.

“Just do” and “it.”

Query index took 0.737 milliseconds to perform

A day from now, if you “just do,”

The search index took 0.901 milliseconds.

-do what you need to do

“Coriolanus” (130), “midsummer” (107), “all” (95)” are the three most common names given to the constellation.

Query index took 12.303 milliseconds

Do and “you can” -tomorrow

“Cleopatra,” “Hamlet,” and “Troilus Cressida” all appear in this list.”

Since the ‘just do’ approach has an iteration time of 101ms, our system is 100 times faster.

Conclusion

This piece was meant to help readers better understand how search engines work “under the hood.” In future postings, I hope to expand on the following themes:

We can improve token generation with NLP and other ways. A comparison of indexing techniques.

How can we improve our query efficiency? Why don’t we grade assignments better?

How do we index documents in real-time, incrementally?

What are the most typical scaling patterns for distributed search?