Web scraping with Python – A beginner’s guide

Share
  • March 5, 2019

Imagine you have to pull out a huge amount of data from a particular website. Is it possible to do so, without manually going to each webpage and getting the data? Well yes, it is definitely possible using a technique called “Web Scraping”.

Web Scraping is an automated technique that is used to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. Web Scraping is becoming increasingly popular since the data extracted from the web can serve a lot of different purposes like:

  • Price Comparison: Web Scraping can be used to track product and service prices in different markets over time.
  • Social Media Scraping: Data from Social Media websites like Twitter can be collected by Web Scraping and used to find out what’s trending.
  • Research and Development: Web Scraping is used to collect large data sets like statistics, temperature, etc. from different websites, which are then used to carry out surveys.
  • Recruitment Web Scraping: Data from career-focused websites can be extracted by Web Scraping, in order to find the right people for certain job vacancies.
  • Extraction of Contact Information: Web Scraping can be used to scrape contact information such as emails, URLs, and phone numbers from websites.

Web Scraping has a lot of applications but implementing it can be slightly intimidating, so in this article, I will break down the process in elaborate steps to help you understand it better.

But before we get into that, here are some important points-to-remember about Web Scraping:

  1. Always read through the website’s Terms and Conditions to understand how you can legally use its data since most of the websites prohibit you from using the data for commercial purposes.
  2. Make sure you are not downloading the data at a rapid rate because this might break the website.

So, how does Web Scraping work?

To extract data using Web Scraping with Python, you need to follow the below steps:

  • Find the URL you want to scrape
  • Inspect the Page
  • Find the data you need to extract
  • Write the code
  • Run the code and extract the required data
  • Store the data in a specific format

Now, let us implement these steps in an example and see how to extract data from the Flipkart website using Python

Here are some libraries used for Web Scraping:

  • Selenium: A web testing library used to automate browser activities.
  • BeautifulSoup: It is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful for extracting data easily.
  • Pandas: Pandas is a library used for data analysis and data manipulation. It is specifically used to extract data and store it in the desired format.

Now, let’s get started with the demonstration.

SEE ALSO: Reahl: A web application framework purely in Python

Scraping the Flipkart Website

Pre-requisites: Python 2.x or Python 3.x with Selenium, BeautifulSoup, pandas libraries installed; Google-chrome browser; Ubuntu Operating System

Step 1: Find the URL you want to scrape

We are going scrape the Flipkart website to extract the data for Price, Name, and Rating of Laptops. URL (find more information here).

Step 2: Inspect the Page

The data on the website is nested in tags. So, we need to inspect the page to see under which tag the data we want to scrape is nested. To inspect, just right click on the element and click on “Inspect”.

When you click on “Inspect”, a “Browser Inspector Box” will open on your screen.

Step 3: Find the data you need to extract

For this example, let us extract the Price, Name, and Rating which is nested in the “div” tag.

Step 4: Write the code

First, create a Python file. For this, open a terminal in Ubuntu and type gedit with .py extension.

Let the file name is “web-s”. Now, here is the command:

gedit web-s.py

Now, let’s write our code in this file.
Before that, you need to import all the necessary libraries:

from selenium import webdriver
from BeautifulSoup import BeautifulSoup
import pandas as pd

We have to set the path to chromedriver, in order to configure webdriver to use Chrome browser

driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")

Refer the below code to open the URL:

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
driver.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")

Now that we have written the code to open the URL, let’s extract the data from the website. As mentioned earlier, the data we want to extract is nested in tags. So, we have to find the

tags with those respective class-names, extract the data and store it in a variable. Refer to the code below:

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text)

Step 5: Run the code to extract the data

Use the below command to run the code:

Python web-s.py

Step 6: Store the data in the desired format

After extracting the data, you might want to store it in the desired format. For this example, we will store it in a CSV (Comma Separated Value) format. To do this, add the following lines to your code:

df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')

Now, run the whole code again and you will get a file named “products.csv” which will contain your extracted data.

Python really makes the Web Scraping easy because of its easily understandable syntax and a large collection of Libraries.

I hope this article was informative and helped you guys get familiar with the concept of Web Scraping using Python. Now, you can go ahead and try Web Scraping by experimenting with different modules and applications of Python. If you don’t already know this language, why not learn Python this year?

The post Web scraping with Python – A beginner’s guide appeared first on JAXenter.

Source : JAXenter