Natural Language Processing - Spam Detection

Using the techniques found in Jose Portilla's Data Science course on Udemy, we are going to attempt to create a spam classifier on a corpus of text messages using a Naive Bayes model.

In [1]:
import nltk
nltk.download() #Downloads list of stopwords
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import pandas as pd
import numpy as np
%matplotlib inline
mpl.rcParams['patch.force_edgecolor'] = True #Forces black outline on histogram bins
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report,confusion_matrix
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

Loading and exploring the data

In [2]:
messages = pd.read_csv('smsspamcollection\SMSSpamCollection',sep='\t',names=['label','message'])
In [3]:
messages.head()
Out[3]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Let's add another feature and see if it could be useful to the model:

In [4]:
messages['characters'] = messages['message'].apply(len)
In [5]:
messages.head()
Out[5]:
label message characters
0 ham Go until jurong point, crazy.. Available only ... 111
1 ham Ok lar... Joking wif u oni... 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155
3 ham U dun say so early hor... U c already then say... 49
4 ham Nah I don't think he goes to usf, he lives aro... 61
In [6]:
messages.hist(column='characters', by='label', figsize=(14,5), bins=30)
plt.show()

The histogram shows that the character length of text messages could be a helpful feature in determining whether or not a message is spam.

Cleaning up the text

In [7]:
def text_processing(mess):
    no_punc = [char.lower() for char in mess if char not in string.punctuation]
    no_punc = ''.join(no_punc)
    no_punc = no_punc.split(' ')
    return [word for word in no_punc if word not in stopwords.words('english') and word is not '']

Testing the text_processing() function, notice the text has been changed to all lower-case, and the punctuation and stopwords have been removed.

In [8]:
test = text_processing("Hello, this is: Colin. I am 26 years old - and am six feet tall!")
In [9]:
test
Out[9]:
['hello', 'colin', '26', 'years', 'old', 'six', 'feet', 'tall']

It is customary in natural language processing to 'stem' your text, though the lack of proper spelling and use of slang in these text messages may make this step unnecessary.

Creating the bag of words transformer using CountVectorizer and the text_processing() function

In [10]:
bag_of_words = CountVectorizer(analyzer=text_processing).fit(messages['message'])

The vocabulary_ attribute shows there are over 9500 unique words in the corpus

In [11]:
len(bag_of_words.vocabulary_)
Out[11]:
9532

Bag of words using 10th message in the dataset:

In [12]:
mess_10 = messages['message'][9]
print(mess_10)
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
In [13]:
bow_10 = bag_of_words.transform([mess_10])
In [14]:
print(bow_10)
  (0, 57)	1
  (0, 313)	1
  (0, 1927)	1
  (0, 1963)	1
  (0, 2256)	1
  (0, 2290)	1
  (0, 3149)	1
  (0, 3571)	2
  (0, 4914)	1
  (0, 5531)	2
  (0, 5532)	1
  (0, 5576)	1
  (0, 6753)	1
  (0, 8628)	1
  (0, 8717)	2

Using the get_feature_names() method, we can find the words that were used twice in the 10th message:

In [15]:
bow_list = [3571,5531,8717]
for i in bow_list:
    print(bag_of_words.get_feature_names()[i])
free
mobile
update

Fitting the bag of words transformer to the entire corpus:

In [16]:
messages_bow = bag_of_words.transform(messages['message'])

Creating a Term Frequency - Inverse Document Frequency transformer on the bag of words matrix

In [17]:
tfidf = TfidfTransformer().fit(messages_bow)
In [18]:
tfidf_10 = tfidf.transform(bow_10)
In [19]:
messages_tfidf = tfidf.transform(messages_bow)

TF-IDF for the 10th message:

In [20]:
print(tfidf_10)
  (0, 8717)	0.461371020883
  (0, 8628)	0.10124664752
  (0, 6753)	0.162475382781
  (0, 5576)	0.24655192062
  (0, 5532)	0.243788262112
  (0, 5531)	0.331280806432
  (0, 4914)	0.206753010424
  (0, 3571)	0.290691434469
  (0, 3149)	0.256484805579
  (0, 2290)	0.22562368844
  (0, 2256)	0.260551535679
  (0, 1963)	0.211624073381
  (0, 1927)	0.115310499433
  (0, 313)	0.252846991542
  (0, 57)	0.294416920656

Creating the Naive Bayes Model

Using a Pipeline object streamlines the amount of code to write and makes it much simpler to read:

In [21]:
X = messages['message']
y = messages['label']
In [22]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.33,random_state=42)
In [23]:
pipeline = Pipeline([
        ('bag_of_words',CountVectorizer(analyzer=text_processing)),
        ('tfidf',TfidfTransformer()),
        ('classifier',MultinomialNB())
    ])
In [24]:
pipeline.fit(X_train,y_train)
pred = pipeline.predict(X_test)
In [25]:
print(classification_report(y_test,pred))
             precision    recall  f1-score   support

        ham       0.96      1.00      0.98      1593
       spam       1.00      0.73      0.85       246

avg / total       0.97      0.96      0.96      1839

In [26]:
print(confusion_matrix(y_test,pred))
[[1593    0]
 [  66  180]]

Testing Naive Bayes model against 3 new texts:

In [27]:
X_test_2 = pd.Series(['You won a free trip to Disney World. Hurry to claim your prize.','Hey, what are you doing later?','What time is the party?'])
for i in X_test_2:
    print(i)
You won a free trip to Disney World. Hurry to claim your prize.
Hey, what are you doing later?
What time is the party?
In [28]:
pipeline.predict(X_test_2)
Out[28]:
array(['spam', 'ham', 'ham'], 
      dtype='<U4')

The Naive Bayes model correctly classified each text message.

Compare against RandomForest model. Note the slight decrease in misclassified spam messages (66 vs 47)

In [29]:
pipeline = Pipeline([
        ('bag_of_words',CountVectorizer(analyzer=text_processing)),
        ('tfidf',TfidfTransformer()),
        ('classifier',RandomForestClassifier(n_estimators=500, random_state=42))
    ])
In [30]:
pipeline.fit(X_train,y_train)
pred = pipeline.predict(X_test)
In [31]:
print(classification_report(y_test,pred))
             precision    recall  f1-score   support

        ham       0.97      1.00      0.99      1593
       spam       1.00      0.81      0.89       246

avg / total       0.98      0.97      0.97      1839

In [32]:
print(confusion_matrix(y_test,pred))
[[1593    0]
 [  47  199]]

From an initial analysis, it appears that both a Naive Bayes and a Random Forest model are both accurate in identifying spam messages.