### Natural Language Processing - Spam Detection¶

Using the techniques found in Jose Portilla's Data Science course on Udemy, we are going to attempt to create a spam classifier on a corpus of text messages using a Naive Bayes model.

In [1]:
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import pandas as pd
import numpy as np
%matplotlib inline
mpl.rcParams['patch.force_edgecolor'] = True #Forces black outline on histogram bins
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report,confusion_matrix

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [2]:
messages = pd.read_csv('smsspamcollection\SMSSpamCollection',sep='\t',names=['label','message'])

In [3]:
messages.head()

Out[3]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Let's add another feature and see if it could be useful to the model:

In [4]:
messages['characters'] = messages['message'].apply(len)

In [5]:
messages.head()

Out[5]:
label message characters
0 ham Go until jurong point, crazy.. Available only ... 111
1 ham Ok lar... Joking wif u oni... 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155
3 ham U dun say so early hor... U c already then say... 49
4 ham Nah I don't think he goes to usf, he lives aro... 61
In [6]:
messages.hist(column='characters', by='label', figsize=(14,5), bins=30)
plt.show()


The histogram shows that the character length of text messages could be a helpful feature in determining whether or not a message is spam.

### Cleaning up the text¶

In [7]:
def text_processing(mess):
no_punc = [char.lower() for char in mess if char not in string.punctuation]
no_punc = ''.join(no_punc)
no_punc = no_punc.split(' ')
return [word for word in no_punc if word not in stopwords.words('english') and word is not '']


Testing the text_processing() function, notice the text has been changed to all lower-case, and the punctuation and stopwords have been removed.

In [8]:
test = text_processing("Hello, this is: Colin. I am 26 years old - and am six feet tall!")

In [9]:
test

Out[9]:
['hello', 'colin', '26', 'years', 'old', 'six', 'feet', 'tall']

It is customary in natural language processing to 'stem' your text, though the lack of proper spelling and use of slang in these text messages may make this step unnecessary.

### Creating the bag of words transformer using CountVectorizer and the text_processing() function¶

In [10]:
bag_of_words = CountVectorizer(analyzer=text_processing).fit(messages['message'])


The vocabulary_ attribute shows there are over 9500 unique words in the corpus

In [11]:
len(bag_of_words.vocabulary_)

Out[11]:
9532

Bag of words using 10th message in the dataset:

In [12]:
mess_10 = messages['message'][9]
print(mess_10)

Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

In [13]:
bow_10 = bag_of_words.transform([mess_10])

In [14]:
print(bow_10)

  (0, 57)	1
(0, 313)	1
(0, 1927)	1
(0, 1963)	1
(0, 2256)	1
(0, 2290)	1
(0, 3149)	1
(0, 3571)	2
(0, 4914)	1
(0, 5531)	2
(0, 5532)	1
(0, 5576)	1
(0, 6753)	1
(0, 8628)	1
(0, 8717)	2


Using the get_feature_names() method, we can find the words that were used twice in the 10th message:

In [15]:
bow_list = [3571,5531,8717]
for i in bow_list:
print(bag_of_words.get_feature_names()[i])

free
mobile
update


Fitting the bag of words transformer to the entire corpus:

In [16]:
messages_bow = bag_of_words.transform(messages['message'])


### Creating a Term Frequency - Inverse Document Frequency transformer on the bag of words matrix¶

In [17]:
tfidf = TfidfTransformer().fit(messages_bow)

In [18]:
tfidf_10 = tfidf.transform(bow_10)

In [19]:
messages_tfidf = tfidf.transform(messages_bow)


TF-IDF for the 10th message:

In [20]:
print(tfidf_10)

  (0, 8717)	0.461371020883
(0, 8628)	0.10124664752
(0, 6753)	0.162475382781
(0, 5576)	0.24655192062
(0, 5532)	0.243788262112
(0, 5531)	0.331280806432
(0, 4914)	0.206753010424
(0, 3571)	0.290691434469
(0, 3149)	0.256484805579
(0, 2290)	0.22562368844
(0, 2256)	0.260551535679
(0, 1963)	0.211624073381
(0, 1927)	0.115310499433
(0, 313)	0.252846991542
(0, 57)	0.294416920656


### Creating the Naive Bayes Model¶

Using a Pipeline object streamlines the amount of code to write and makes it much simpler to read:

In [21]:
X = messages['message']
y = messages['label']

In [22]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.33,random_state=42)

In [23]:
pipeline = Pipeline([
('bag_of_words',CountVectorizer(analyzer=text_processing)),
('tfidf',TfidfTransformer()),
('classifier',MultinomialNB())
])

In [24]:
pipeline.fit(X_train,y_train)
pred = pipeline.predict(X_test)

In [25]:
print(classification_report(y_test,pred))

             precision    recall  f1-score   support

ham       0.96      1.00      0.98      1593
spam       1.00      0.73      0.85       246

avg / total       0.97      0.96      0.96      1839


In [26]:
print(confusion_matrix(y_test,pred))

[[1593    0]
[  66  180]]


Testing Naive Bayes model against 3 new texts:

In [27]:
X_test_2 = pd.Series(['You won a free trip to Disney World. Hurry to claim your prize.','Hey, what are you doing later?','What time is the party?'])
for i in X_test_2:
print(i)

You won a free trip to Disney World. Hurry to claim your prize.
Hey, what are you doing later?
What time is the party?

In [28]:
pipeline.predict(X_test_2)

Out[28]:
array(['spam', 'ham', 'ham'],
dtype='<U4')

The Naive Bayes model correctly classified each text message.

#### Compare against RandomForest model. Note the slight decrease in misclassified spam messages (66 vs 47)¶

In [29]:
pipeline = Pipeline([
('bag_of_words',CountVectorizer(analyzer=text_processing)),
('tfidf',TfidfTransformer()),
('classifier',RandomForestClassifier(n_estimators=500, random_state=42))
])

In [30]:
pipeline.fit(X_train,y_train)
pred = pipeline.predict(X_test)

In [31]:
print(classification_report(y_test,pred))

             precision    recall  f1-score   support

ham       0.97      1.00      0.99      1593
spam       1.00      0.81      0.89       246

avg / total       0.98      0.97      0.97      1839


In [32]:
print(confusion_matrix(y_test,pred))

[[1593    0]
[  47  199]]


From an initial analysis, it appears that both a Naive Bayes and a Random Forest model are both accurate in identifying spam messages.