Web Scrapping is one of the most popular uses of Python programming language, The requesting library makes it really easy to get data from the web, and then comes beautifulsoup to find out only use data from the crap. A few days back when I had to scrap some tables, all I was doing was get the data with requests, and then parse it with beautifulsoup, with a lot of for, if loops and many hardcoded list indexed, I finally started getting my data. Yes, my code was working fine, all the data I wanted was storing in my database bit by bit, but was it the best approach, actually not, there was an easy approach with just a few lines of code instead of a thousand lines.
One of the best data processing library for python makes it the game of kids to read data from the web, just a few lines of code and you have all the data.
without wasting the precious time anymore let's just dive into how you can use pandas to speed up your work and save yourself from writing hundreds of lines of code for a task which can be done with 10 lines of code.
To get started we need to have some library installed in our system, actually, you just have to install pandas. Furthermore, I will suggest you use Jupyter notebook as it makes the data easier to look at. If you don't have jupyter notebook installed on your system you can download it from the official website or even if you don't want to install the notebook it's ok, you can use pandas with anything you want, even idle will work.
So as of now, you have installed pandas in your system and if you are not then you can install it with pip by hitting the command
pip install pandas
Now after installing the library, open your favourite editor or notebook and import the library in this way.
import pandas as pd
Many people ask that why do I need to import it as pd, the answer is really simple its the format everyone follows and it's better to write what other understand, you can also import library without assigning it to pd, but I will suggest you go with this.
Now for an example we will be scrapping, List of countries by the number of Internet users from Wikipedia. if you want to follow along with the article you can access the page here (https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users). Now to fetch all the data from the webpage we just have to wring a single line.
Now we have all the tables we need, the page has multiple tables on the page so when you type (data) then it will show it as a list. so you can look at the lists by index. for example, if you want to look at the first list then you can type data and so on. Here is the tutorial, the list we are interested is on index 5 so you can hit data to finding out the list, in the jupyter notebook the table will look something like this.
Here we can see that the table is well arranged and much easier to look at. Now we have all the data but the goal of grabbing data is not just to look at the data but to store it and manipulate it, if you want to further manipulate the data like removing some column or anything, pandas have all the functions built in. A single table in pandas is called a data frame and you can perform almost every action on a data frame you would like to. Manipulating the data is a wast topic so we will be skipping that for now, instead, we will focus on storing the data. Like feting the data, storing the data is much easy with pandas, here are some examples of how you can store the data from padas dataframe to any format you want.
If you want to store the data in a CSV file then you can enter.
or if you want to save the data in a JSON format you can do it by the to_json method. you can get the whole list of the different saving method here (https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion).
I hope you enjoyed the journey of web scraping with pandas, if you have any doubt about it please let me know in comments.