Protecting Your APIs from Web Scrapers

Tim Hysniu
9 min readMay 30, 2020

One of the huge annoyances that a lot of companies with content run into is web scraping. Let’s say you have a business that has a large database of documents of any kind: news articles, photos, properties for sale, or any type of content that creates value for your business. Naturally, you want to protect your data to prevent someone from leeching and most probably profiting from it. Yes, there are copyright laws that protect your data, but when there are bad actors at play they have already figured out a workaround on how they will use that data.

I’ve done plenty of research in this topic and I’ve come across articles that say it’s nearly impossible to prevent bots from scraping your data. If you, as a user, are able to navigate through pages and see content then the bot can basically do the same thing. Yes, there are various methods like ensuring certain HTTP headers are there and perhaps even use some tokens and cookies. All of this, however, can be easily spoofed or simulated by a bot which can basically do what you as a user are doing.

Back in the days we used to have a lot of the rendering on the server side where we would spit out HTML to the browser. This would be a little problematic for scrapers because they would need to parse HTML and get the data this way. With single page apps, client side rendering (eg. React) becoming more prevalent the data is already available in JSON and this is a problematic if you want to protect it from malicious users. As soon as I touch a React app it’s one of those things that comes to mind: how do I make sure I don’t let bots download my entire database.

While it is pretty tough to completely eliminate scraping there are things you can do to discourage it. You want to find a balance between a good usability for legitimate users and preventing bad bots from hitting you APIs. You also want to consider the low hanging fruit before getting too fancy. Remember there are a zillion things you can do to prevent hackers from stealing your data but some of the options are much cheaper than others. When I tried collecting data…. (shhhh… don’t tell anyone, this was experimental and personal use) here is a list of some of the things I found most annoying. And by annoying I mean I had to decide whether I want to keep going or give up because the time spent to retrieve the data was just not worth it anymore.

Limit Number of Requests from an IP

This is probably the first thing that you should be looking at. If an IP is hitting your site too many times per minute you want to set some time limit. You can do this on webserver level and there are ways to do this in Nginx and Apache for example. The limit should be slightly higher than what a typical user would use to avoid false positives. If there are routes that are needed for scraping but aren’t hit often by a typical user, then protecting those routes is a good idea (eg. token refresh routes, user registration and login).

Most scrapers will be able to change IPs via VPN, so I wouldn’t bet on this method to solve all your problems. It is definitely one of the annoying ones though. It just shows the attacker that you know what you are doing and that they should go to your competitors first. Unless the data is super-valuable the bot is probably not that fancy to rotate IPs automatically. So they would need to sit and babysit the process every-time it gets blocked. It will frustrate them.

Make use of Captcha On Anomalies

Let’s say there are too many requests from a certain IP but you don’t want to prevent real users from accessing the data. You can still bring up a captcha that users has to pass before they can access more pages. This is pretty standard with a lot of websites. Google is able to detect location changes using your IP and finds that an anomaly. Logins attempts from multiple locations are typically a red flag even though a small percentage of users might still be legitimate users (eg. using VPN to connect to different offices). But this is why you have the captcha, and when that fails you block the access. While this takes a bit of work to set up initially it is one of my favourite methods to block scrapers.

Create Honey Pots and Obfuscating Data

There is a way to bring up results occasionally and conceal them as real results. If a scraper hits these then you know that a bot hit them and not a real user. You will need to get a bit creative with how you want to bring these results in; the idea is to avoid having bots detect that the content is honey pot content. Some examples of this could be a phrase, or a list of codes that you use to filter those results out, but the bot may not know this.

Alternatively, or in conjunction with honey pot you can spit out garbage data occasionally if you detect bots. This means that if they have a process running unattended they might be under impression that everything is going fine. Few days later they will find that their data is garbage and can’t use it. Now they have to figure out when you’re introducing honey pots or obfuscating data. It will be frustrating because they will need to validate their data now; some might be valid but some might contain garbage.

Security by Obscurity Still Works

Developers have mixed feelings about introducing ambiguity in their product and I completely understand that. However, I find that security by obscurity works brilliantly in some cases. If you throw a random error code that isn’t a “429 Too Many Requests” error response the bot will not know how to deal with it. You don’t need to share too much information in your public API responses about why the request was denied. A similar example is the implementation of “Forgot your password” feature where you don’t tell the user whether the username exists in the database or not. If you say a password reset has been initiated regardless of whether the user exists or not then the attacker will not know if a username exists in the system.

Introduce Watermarks and Prevent Hotlinking

If you have images and they are part of the content that is being collected, then watermarking can really help. It creates a barrier for malicious users because they know that they can’t massage the content and use it as their own. If you can put several watermarks in several places, or closer to centre, it makes it impossible to crop out the watermark. There are legal implications to using copyrighted material for commercial use and this helps protect the images. The bad actor might not care about this, but whoever plans on using this data probably should.

Hotlinking is another thing you can do to prevent apps other than your own from using the image. This can be configured on webserver and has been around for a while. I don’t see a useful use case where you want to allow hotlinking so it’s a good practice to block it.

Change Your DOM Regularly

This is only applicable if your content is server rendered HTML. If your content is rendered on server side then attackers only have HTML to work with. By changing the structure, classes and IDs in the HTML often you will be breaking the scrapers. I don’t think this is the most effective method to discourage attackers since they can quickly update their script, but it can help. It is especially effective if you don’t leave too many unique attributes in your DOM to allow bots to query specific content. For example, if you have valuable dynamic content in <div> and <p> tags but the tags are similar to other useless static content, then it it’s not that easy to scrape only that content. Again, I don’t think you should loose sleep over how your HTML looks since there are always workarounds to still get this content. But if you can avoid listing all valuable data like IDs and making it super easy to query it in HTML it makes it a bit more difficult for bots.

Minify and Obfuscate Your Javascript

Bundling your javascript code properly not only makes your front end code smaller but also much harder to read and hack. Now, this is clearly not hacker proof, but remember that a person is going to have to trace the execution and intercept some routine if they want to figure out how your code works. By just doing this and introducing some measure of hiding content you’ve just made it more difficult for users to read your content. Clearly you don’t want to use this approach to secure sensitive data but you can use it to scramble public data. For example, in one of the apps some of the more valuable that I didn’t want Google to index nor did I want it scraped. So in this case I am happy to use a scrambler which is not so fun to figure out. By the time the attacker finds how to unscramble the content they’ll be hitting a bunch of other obstacles I prepared for them.

Hide the XML Sitemaps?

Hiding XML sites might be a bit controversial since it goes against SEO practices. But do you really need your sitemap to be in the root directory of your site and be named sitemap.xml or sitemap.xml.gz? Some even suggest that it should be in robots.txt. This is good to make your content discoverable but it also gives the links to all pages on your website. I already submit my sites in Google Webmaster Tools though and maybe I have more than one sitemap. I just don’t see the point of making these sitemaps available for everyone to see. If it’s named something other than the default I am at least sure that it’s not available for scrapers to use. If you don’t have that much content I don’t see it as a problem, but I have yet to find an app or website with a lot of data that makes its sitemap super easy to guess.

Make Content Available for Members Only

If you are tired of bots scraping your data and you see value in hiding it from the public then by all means do it. Of course, you can also do this selectively if you find that there is value in exposing some of it (eg. SEO, etc). If your content is available for members only then on application level it becomes much easier to detect which users are abusing your APIs. You can impose rate limits on the user, not just the IP. So if a certain user has been making a lot of requests, then we know something is suspicious about that user. If they have changing IPs or a logging in multiple times they are perhaps trying to game your rate limiting and it is a yellow flag. If members only pages is an option for you then it’s probably a no brainer to have it implemented. Depending on how strict the verification process after registration is, attackers may be less keen to abusing the account. For example, if you use email verification and ban one user then they need to register with a bunch of other email addresses. If you do phone verification too, well now it’s extra annoying because you’d be blocking their email and phone number.

Use Trackbacks

Introduce URLs pointing to your website or content you can trace later. This helps identify who the scrapers are after the damage is done, but identifying the offender is always a good thing to be able to do. You might want to be a bit clever with this and introduce these randomly to avoid having the bots do a search and replace before they persist your collected data.

Some Final Thoughts

Blocking scrapers completely is not possible. If a human can find and read your data so can a bot. This is why I’m suggesting you use Captcha to verify that this is a real human accessing your data. In many cases, bots can simulate the human so well that it’s not easy to tell that its a bot. Even with rate limiting, the bot could have random delays on each request to make it seem like a human and pass the rate limiting test. These jobs could be running for days in the background; it’s all machine time and we can afford to have them do this. If we tighten the security too much you start getting false positives and actual users will start to feel the experience so that’s usually not the best option either. The key is to find a balance and to discourage attacks to a point that data that’s being collected is just not worth the effort.

--

--

Tim Hysniu

Software Engineer, Technology Evangelist, Entrepreneur