Detecting Fake News

Detecting Fake News

August 04, 2022 449 read

First, the term of “fake news” must be defined. It is simply false or misleading information presented as news. Its production or dissemination serves many purposes. For example, fake news can be one of the most effective ways to manipulate people or dictate them certain ideas. It may be the easiest way to dominate the society with those who seek control or fake news. With these, politics can be shaped, governments can be changed, democracies can be turned into autocracy, conflicts between the rich and the poor can be raised, and a new rich class can be created from it.

Today, mostly this news is used to obtain political agendas or political gain. Such news may contain false and/or exaggerated claims and may go viral by algorithms and trolls (fake accounts), and normal/real users may be trapped in a filter bubble.

A model can be developed to prevent all these, minimize their effects, and reduce the spread of false information or news. To provide all these, the main purpose of this project was to classify the news as true or false. Before start, the github file of the project can be accessed on this link, here.

Two important things will be dealt with in this constructed model. The first thing is TfidfVectorizer and the second is PassiveAggressiveClassifier. Let’s take a quick look at what they do. 

1. TfidfVectorizer: 

It comes with sklearn is a library that supports Python numerical and scientific library. 

The term tf-idf basically comes from the term frequencyinverse document frequency. Trying to explain in a simple language, the tf-idf value is proportional to the number of times a word is displayed in the document, or text. Let’s have a look both terms separately:

Term Frequency (TF): that provides the number of repeats of a word in a document. It increments as the amount of the repeat of that word. Thus, having a high value indicates that a term appears more often than others, so the document is fit for the term searched. 

Inverse Data Frequency (IDF): words can be appearing many times in a document, but at the same time can be appear many times in many other documents, so they may be unrelated. But the words that appear rarely in the corpus are valuable and have a high IDF score. Basically, IDF is a way to show or prove of how significant a term is in the entire corpus. 

2. PassiveAggressiveClassifier:

It is an online learning algorithm. If there is a correct classification outcome, the algorithm remains passive for it. But if there is any miscalculation, updating and adjusting, it turns aggressive.  

For this project, there is a dataset has a shape of 7796x4. Columns: 1. The news, 2. The title, 3. Text, 4. Denoting where the news is REAL or Fake. Also, Jupyter notebook is used, and the libraries are installed with pip.

pip install numpy pandas sklearn

After all, the required imports are done.

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

data file is read and get the shape of the data.

#read the data
df = pd.read_csv('news.csv')

#get shape and head
df.shape
df.head()

dividing dataset into training and test sets.

x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=7)

get the labels from the dataset

#Get the labels
labels = df.labellabels.head()

here, tfidfVectorizer is initialising with stop words that are the most common words in a language that are to be filtered out before processing, and a max document frequency of 0.7.

TfidfVectorizer updates a collection of raw documents in a matrix of TF-IDF features.

#Initialize a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'turkish', max_df=0.7)
#Fit and transform train set, transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)tfidf_test = tfidf_vectorizer.transform(x_test)

Start PassiveAggressiveClassifier and calculate the accuracy from TfidfVectorizer.

#initialize a PassiveAggressiveClassifier
pclassifier = PassiveAggressiveClassifier(max_iter=50)pclassifier.fit(tfidf_train, y_train)
#predict on the test set and calculate accuracy
y_predict = pclassifier.predict(tfidf_test)score = accuracy_score(y_test, y_predict)print(f'Accurancy: {round(score*100,2)}%')

print out a confusion matrix to gain insight into the number of FALSE and TRUE negatives and positives. 

#build confusion matrix
confusion_matrix(y_test, y_predict, labels =['FAKE', 'REAL'])

the result:

array([[589,  49], [ 41, 588]], dtype=int64)

with this model, there are 589 true positives, 588 true negatives, 41 false positives, 49 false negatives