using grep to increase index quality

by benjamin hollon on

I wrote in a recent article about how something so simple as reversing the sort order of a list of URLs could drastically improve my search index’s quality, due to my practice of limiting the crawler to 100 pages per domain.

Today I decided to prune those starting URLs further, with the help of grep.

the problem

The biggest thing I saw that was impacting my search index negatively was the presence of tag listings, pages that list all the articles that match a certain tag.

Due to my sort order solution, these pages (starting with t) were getting crawled before most blog posts and articles (starting with b or a, usually), which is non-ideal. I don’t have any desire to serve these tag listings as search results, and they were keeping me from crawling more articles, which I do want, so something had to be done.

the solution

I have a function in my crawler that’s regularly run to remove duplicates in the queue of pages to be crawled. This felt like the ideal place to prune unwanted URLs.

So I passed the result of it to this grep command:

grep -vP 'http(s)?://.*/(tag(s)?|categor(ies|y))/*'

Looks a little complex, let’s break it down.

This problem was a bit more complex, but still pretty simple in concept.


#100DaysToOffload 4/100


Reply via Email (PGP) Reply on the Fediverse



Liked what you read?

I'm really glad you did! What's next?