News Scraping

News Scraping

August 07, 2022 465 read

As always, this project will start with definition. What is data scraping or data mining from a website? In simple terms, web scraping is data taken from there with a set of tools and codes. Well, like the copy-paste method? It can also be considered as a data scraping process, but slower, more manual, more time-consuming… Here the focus is to collect too much data from the website and simplify it into a document, or somewhere you want.

So why do we need this? This means writing a bunch of codes, letting them work, fixing mistakes and errors, waiting for some answers… Is it really worth it? So, to look for the answer to this question, we will have to look a little bit at the wants and requirements of the modern world. For example, what is the most important thing to everyone today? No, the answer is not the money, because it can be earned, and also it can be lost. Here it is mentioned that people are not able to earn that. Bingo, time! Let me remind you that wonderful saying again: time is money, money is time! Well, since money cannot buy the time, it is pretty nice to equate the two. 

Anyway, back to our topic. Is it really worth it? Data scraping comes into play when it is desired to collect the information or data dispersed in the world called the internet in one place in the easiest way. As I just said, isn't it great to do this in a short amount of time? You will write the code, hit run, and the information you want will appear in front of you, filtered the way you want it and useful.. In the blink of an eye, you will have done as much work as building pyramids! At the same time, you will have the chance to compare all the data you have and you will be able to direct your business life according to the data you have obtained. A last example for the explanation is that you can determine your price strategy by looking at the sales price of your product in other stores. Then, enjoy the best of both worlds!

There are many ways to do this: the first of these is to create a bot that acts like a real user and scrape data from the website, and the other is to download the HTML content of the website and parse the desired parts. In the project which mentioned in this article, the second method, that is, downloading and parsing HTML content will be preferred.

In this project, one of Python's best/the most beautiful libraries for data scraping will be used. I said the most beautiful because it is also named that way: beautifulsoup! So, what is this library, named after the story told by a tortoise in Alice in Wonderland, or, more directly, what is it? It is a powerful and fast library build for manipulating XML or HTML files. In other words, beautifulsoup is a bot that parses the HTML file on a website and takes the useful parts for us. 

To begin with, I need to define a strategy. For this, I need to decide what I will consider on the website from which data will be received. Since it is a news site, I will get the headlines and the description of the news. In short, it will be decided in which ways and what kind of pipeline the data will be scraped. 

First, import the libraries I will use:

import requests
from bs4 import BeautifulSoup
from csv import writer

Variables such as url and page have to be set. The url variable is the link/url of the website where we get data. Page is the variable for which we request this url.

url = ‘website url’
page = requests.get(url)

When running the code, Response [200] is given which is showing the code is running well. Then the object has to be created, and inside of the beautifulsoup class write the parameters. The first argument is the content and the second argument is the type of content which is HTML. 

soup = BeautifulSoup(page.contet, ‘html.parser’)

Now, all HTML documents will be placed inside of the soup object. Then, the sections or considered parts have to be filtered, or selected. 

selects = soup.find_all('article', class_='teaser')

it said I want to find all articles with teaser class names, then all articles are stored inside of the selects. Now, create a for loop through the selects and find title and text.

for list in selects:
      title = list.find('span', class_='teaser-title).text

Since some news headlines do not contain text, I will write a try … except block here for the text. If there is no description under the title, the result will return as a None. Document_label defines the news as any kind of material within such as audio, video, etc… text.replace('\n', '') used to get rid of the ‘\n’, new line.

      try:
      text = list.find('p', class_='teaser-text).text
      document_label = list.find('span', class_='teaser-document_label).text.replace('\n', '')
      except Exception as e:
      text = None
      document_label = None

Then collecting all data that are scrapped in a list which is called info.

      collection = [title, text, document_label]

After the code works successfully, I want to save all collections into a csv file.

      savetheinfo.writerow(collection)

This is the time of using from csv import writer. (My recommendation is to write this code before the for loop, but I will continue from here to preserve the integrity of this post.)

with open('dailynews.csv', ‘w’, encoding='utf-8', newline='' ) as n:
      savetheinfo = writer(n)

the first row as a header is created with the following code in the csv file.

      header = ['Title', 'New Text', 'Additional Label']
      savetheinfo.writerow(header)

Before I finish the post, I would like to discuss the line that starts with open and continues with a lot of arguments in it. 

As it is obvious, the first single-quote argument contains the name of the file. But the second one is a little bit ununderstandable. The mode is defined there. If we want to write something to a file, then w(rite) mode, but if we only want to read it, then r(ead), or if we want to add something to the previous file, then a. Usually, the encoding part is not required, but when I tried without, it didn't work for me. 

You can find the project codes here, and the result here.