One of our goals as a company is to educate hackers so they can find more critical bugs in the targets they hack on. To this end, we created Hacktivity over 4 years ago. On Hacktivity, hackers can read publicly disclosed reports that were submitted to public programs on HackerOne. In 2018, we also added the ability for hackers to publish any report they submitted to a company outside of HackerOne, making Hacktivity one of the biggest resources of vulnerability information for hackers to learn about specific types of vulnerabilities.
In order to make surfacing reports on Hacktivity easier, we introduced Search functionality for Hacktivity. This post will talk about what we learned from the project.
Initial strategy: 3 search vectors (one per disclosure level)
In order to keep things simple and consistent, our initial strategy was to re-use pg_search, a gem that is widely adopted on our platform for search functionality. Given the different disclosure levels of vulnerability reports (which determine the attributes a user is allowed to search on), we opted for defining different search vectors that contain the searchable content for each disclosure level. Then, we would take a search query and search through the search vectors using different subgroups of reports, e.g:
There were a couple of problems with this approach. First, the scope isn’t very legible (note that this does not include the undisclosed reports yet). As the conditions for the report subsets become more complex and the names for the scopes are a little longer, this scope becomes pretty hard to read. Second, this scope isn’t very performant. We ended up constructing a SQL query instead and performing that query on Report.
The most difficult problem with this approach, however, was not performance or legibility: it was the backfills. With more than 300,000 reports on HackerOne, we would need to backfill the search vectors for every single report. We tried to mitigate some of the issues by creating background jobs that would create those search vectors, but with three separate search vectors for each report, this would mean we’d have to schedule around 1,000,000 jobs! We ran test backfills on a small number of reports using Datadog to measure the performance of the backfill. Extrapolating the performance of the initial test set to our full set indicated that the backfill would likely take one week.
However, because we would be creating so many new ReportSearch records, we chose to run the backfill in batches. That way, we could monitor the impact that the growth of the ReportSearch table was having on the performance of the backfill. So, we would have to run a batch, wait for the queue to be empty, evaluate the performance, make changes if needed, and repeat. We used both a Datadog counter for keeping track of the size of the queue and a histogram to track the duration of the jobs.
In short, the initial approach led us down a rabbit hole of over-engineering a solution to a problem we didn’t yet have. It also had a negative impact on our velocity and we had to push our launch date back considerably as we were figuring out how to backfill the search vectors, working closely with our infrastructure team to make sure we didn’t jeopardize the overall availability and performance of the platform.
Our main reasons for reconsidering this initial approach were:
- We’d have to wait until the full backfill is done before launching the feature, otherwise the search results would be incomplete or incorrect (see #2).
- While the backfill is running, events may happen that warrant an update to the search vectors. This backfill would be placed at the end of the backfill queue, causing the search vectors to be out of date until the backfill is completely finished. We had to consider the security implications of this; if an update is redacting sensitive information from a report, we want those changes reflected in the search vectors as soon as possible so we don’t match reports on information that isn’t there anymore.
- Having three search vectors means we would need to run three background jobs to update the search vectors, consuming more resources than necessary, and possibly delaying other, unrelated background jobs.
- Having three search vectors causes the report table to become bigger than it needs to be, since we’ll only be using one search vector based on the disclosure level of the report. This would cause us to consume more resources than necessary.
New strategy: 1 search vector (contents determined by disclosure level of report)
So we needed to come up with another solution. We decided to go with a single search vector strategy, where the background job that performs the backfill determines which attributes should be included in the search vector. We implemented this strategy using pg_search as well. The cost of switching to a new gem would introduce too much scope creep.
The implications of this change of strategy is that we have to perform a new backfill whenever something changes on a report that requires the search vector to be updated (i.e. when a report is publicly disclosed). This causes some delay between the report being updated and those changes to be reflected in Hacktivity Search: background jobs are placed in a separate queue with rate limiting to ensure that critical background processes keep running in a timely manner. In an ideal scenario, this delay would be less than a minute, but depending on the size of other queues, the delay could be several minutes. In the initial strategy, that search vector would have already existed and the change would be reflected in Hacktivity Search instantly.
We are happy with the strategy that we ended on. The scope is legible and easy to reason about, and the logic for determining the contents of the search vector (still using pg_search) is contained in the backfill job. However, we do have to be mindful of all the places where we need to trigger an update to the search vector, but that is a relatively small price to pay. The backfill was still a long process because we did it in batches, but as we got more insight into the performance of the backfill, we were able to confidently increase the batch size from 25,000 to 150,000 records. All in all the backfill took a few days; we would start a batch in the morning so we would be able to monitor the process as the day progressed. We released the feature late November 2018, and it’s been received very positively. We are thrilled that hackers are now better able to surface reports on Hacktivity.
- The chaining of a scope and a search was not a good idea
- Backfilling the search vectors in a separate queue was a good choice; we could confidently kill the queue if CPU usage was becoming too high without hurting any other processes
- When your strategy requires multiple backfills for one record, see if you can do it another way
- Datadog monitoring was great for us to monitor (the performance of) the backfill jobs
- We should have gone with a scrappier solution instead of over-engineering our initial implementation
Karen Sijbrandij is a Software Engineer at HackerOne. She is part of a squad focused on growth. She enjoys experimenting and iterating on squad processes, mob and pair programming, and constantly improving her squad’s skills and her own technical skills. Fun fact: she brings her cat Suzy to work at least once a week.Share