Outreachy 2020

The Struggle : Part 2

I’m writing a Part 2 for this topic because I made a lot of progress since the last blog post and I wanted to share it with all of you. When I last blogged, I was looking into how to create a Algolia search index from scratch. I tried searching for a lot of things on our website, and started realizing that a bunch of pages just don’t show up, even if I searched for their exact titles. On the other hand, pages which did appear on Search, appeared on searching any related keyboard. This prompted me to think that maybe the problem is not with the “search” portion, but rather that some of our pages are not getting crawled at all, especially because I could see patterns in the missing pages – for example all pages listed under “components” were missing.

So I set out to configure Algolia locally on my machine. The code for their crawler, DocSearch, is open source, but since it’s a paid service, it’s meant to be used as a black box for the sake of the user’s ease. Even my mentors so far had used it as a black box. So it took me a couple of days to get it up and running on my machine, since there wasn’t a lot of documentation available ( I’ve realized the importance of documentation so much in the past week). Then to get proof for my suspicions, I ran a crawl job and noted down all the pages which were being crawled – I was right! A big bunch of pages were missing, including the pages under “Components”.

I began to read about how DocSearch is configured. Read about what each section of the configuration file does. I made some local changes and tested them, we had already gone from crawling < 50 pages to > 800 pages. I noted some more observations, made some more changes. Bam! Crawling >1300 pages now. Number of records documented by Algolia went from <600 to >200,000. I went ahead and configured the “search” locally as well, so that I could test if I’m getting relevant search results. It worked!

So ultimately, the solution to a problem which was expected to be a long, tedious piece of code written from scratch turned out to be 4 lines of addition to a piece of code which was not even part of our repository. I never knew that my intuition would be correct, and I never knew if the track I was on would actually lead anywhere. I was quite hesitant to go through so much effort of setting things up and experimenting with them, knowing that it might just be a dead end. What really helped was that even though we didn’t know what might or might not work, my mentors were very encouraging about me spending the entire week exploring this possibility. We are down to 8% of “no results found” from the earlier >20%, which is a great way to quantitatively evaluate our improvement.

While we have definitely made progress, it’s not perfect yet. The next step is to ensure that search results are displayed in a way which is more informative. For example, they show versions and high level headings. Looks like there’s a lot more investigation coming up!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s