Web scraping involves the collection of information in the form of data from websites or pages. Although yours might not be a conscious act, you've scraped the web one way or another too while gathering information. But that's usually subtle.
Web scraping or screen scraping is generally a purposeful act, and professionals automate the design to get enormous data. Whether by copying texts on a website manually, using dedicated tools, or writing web scraping scripts, web scrapers sometimes hit hard on a website by making multiple requests at once.
But while many businesses now leverage web scraping to drive competitive advantage, is it actually legal?
Which Websites Should and Shouldn't You Scrape?
The internet is a pool of information, giving people access to old and real-time data. Web scraping or screen scraping has been around for a while now. But how much should you use it, and which websites can you scrape?
Some websites are stringent with web crawlers or screen scrapers and block them out completely. So it's glaringly obvious that you shouldn't scrape such websites. But people still do so.
Unfortunately, there's hardly anything else such sites can do to stop it besides patching their loopholes.
Before you scrape a website, ideally, you should check if it allows crawling or not. Usually, you can find that out by checking the site's robots.txt file. You can do this by typing in "[website URL]/robots.txt".
A robots.txt typically sets rules for various crawlers or user agents. However, these rules vary, depending on the website involved. While some sites permit crawling on all pages, some specify the pages that a bot can crawl, and some block crawlers outrightly.
A website that blocks all user agents from crawling all pages typically sets the following rules:
user-agent: *
Disallow: /
A robots.txt file that blocks all bots from crawling certain directories or pages typically looks like this:
user-agent: *
Disallow: /URL to page 1
Disallow : /URL to page 2
If robots.txt doesn't disallow the page you want to crawl, then you can probably scrape it. Otherwise, you should back off or seek the admin's consent. They may grant you access.
Additionally, some websites explicitly state whether they allow crawling or not in their terms of use. Some even state this at the top of their robots.txt as well. Always check that out as well to be sure you're doing the right thing.
How Web Scraping Is Being Abused
So if you've received spam emails or SMS from websites or people you never supplied with your personal information, then you've probably been scraped somewhere, somehow. And mostly, it's via one of your social media handles.
That said, web scraping sometimes is more than merely collecting data that renders to the front end. If used maliciously, it can result in the leakage of personal and classified information.
While most social media platforms frown at it, crawling bots still access people's profiles, and their contact information gets leaked and scraped.
Facebook, for instance, has been reported to have vulnerabilities that leaked users' contact information in the past, even though users keep them private.
Similarly, LinkedIn recently suffered a security breach that resulted in the leaking of personal data belonging to over 500 million accounts. Consequently, that vulnerability resulted in the sharing of many email addresses and phone numbers without the consent of the profile owners.
Is It Illegal to Scrape a Website?
There has never been a conclusion on the legality of web scraping. Instead, the focus is on how a crawler works on a case-by-case basis and what they use the collected data to achieve.
So rather than concluding on its legality, scraping, when done maliciously, is illegal. But if done judiciously, it isn't illegal.
But as expected, there seems to be a more stringent policy on the scraping and use of social media data since users' privacy is so important. However, it all still boils down to how people scrape the data.
The Internet & Social Media Law Blog analyzed the case of hiQ Labs, a data scraping company that won a lawsuit against LinkedIn in 2019 after it tried to block hiQ Labs from scraping publicly-available LinkedIn users' data.
With hiQ Labs claiming that the Computer Fraud and Abuse Act (CFAA) only prohibits unauthorized access, the judgment affirmed that LinkedIn's data was publicly available, so anyone scraping them did so because they're accessible.
Besides, hiQ Labs only used the scraped data to provide analytics solutions to companies—so they can make better recruitment decisions.
Contrarily, Facebook recently sued Chrome extension developers who scraped Facebook users' profiles without their consent.
Similarly, a copycat site was sued by Facebook for scraping several Instagram users' profile information and then using these to create clones. According to that report, Facebook then went further to obtain a permanent court injunction against the offender.
These are a few cases where people might have used web scraping illegally. The said companies collected Facebook users' data deceitfully, without the consent of its users. So it violated privacy policies.
So, while web scraping might frustrate the site it gets data from, no general rule currently stops people from getting what they want, as long as they don't violate the internet laws outright.
Is Web Scraping Synonymous to Hacking?
There are a few myths surrounding web scraping. One of these is the belief that scraping a website means you've hacked it. Although hacking can eventually lead to scraping data, the claim that the term itself means hacking a website isn't true.
Web scraping can involve the use of dedicated crawling or scraping tools, Application Programming Interfaces (APIs), or web scraping scripts to get rendered data from a website. Unlike hacking, it neither compromises the website it scrapes nor disrupts the experience of its users.
So while hacking involves unauthorized access, usually into a website's database, web scraping only targets data that's already visible on the front end. Although people can use web scraping maliciously, it's still not synonymous with hacking.
In addition to that, unlike web scraping, deliberate and unethical hacking is illegal.
What Are the Positives of Web Scraping?
Web scraping has many positives, and even some tech companies now offer their data for free through APIs. That information is usually not enough to assess business trends and make decisions.
So companies now get more data by scraping the web to improve practices and drive sales. Additionally, data scientists feed machine learning algorithms with data collected via screen scraping.
Such data can be pictures used in image recognition, plain texts for sentiment analysis, or direct product data for market intelligence and consumer behavior analysis.
So web scraping is even more helpful because if you have access to information your competitor doesn't, you can beat them.
While some sites frown at web scrapers, some, even e-commerce services, don't care if you scrape their data or not. Web giants like eBay and Salesforce kicked off their API in 2000, offering programmers access to public data for the first time.
Should You Actually Scrape the Web?
We've established that web scraping isn't illegal when done the right way. But what you do with the data you scrape is also a concern. So rather than abuse this, use it to draw more insights that help you and others make informed decisions.
However, web scraping as a skill gives you access to large chunks of internet data, which can help you or your company stay above the business niche. As a data scientist, it even broadens your scope and improves your coding and technical skills.
For instance, Python is one of the programming languages that helps you easily scrape a website with its Beautiful Soup library or Scrapy framework.
0 Comments