Sebastian Raschka, 2015

https://github.com/rasbt/python-machine-learning-book

Python Machine Learning - Code Examples¶

Chapter 8 - Applying Machine Learning To Sentiment Analysis¶

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

In [1]:

%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn,nltk

Sebastian Raschka 
Last updated: 09/10/2015 

CPython 3.4.3
IPython 4.0.0

numpy 1.9.2
pandas 0.16.2
matplotlib 1.4.3
scikit-learn 0.16.1
nltk 3.0.4

In [ ]:

# to install watermark just uncomment the following line:
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

Overview¶

Obtaining the IMDb movie review dataset
Introducing the bag-of-words model
Training a logistic regression model for document classification
Working with bigger data – online algorithms and out-of-core learning
Summary

Obtaining the IMDb movie review dataset¶

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

In [77]:

import pyprind
import pandas as pd
import os

labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path ='./aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 725.001 sec

Shuffling the DataFrame:

In [78]:

import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

Optional: Saving the assembled data as CSV file:

In [ ]:

df.to_csv('./movie_data.csv', index=False)

In [1]:

import pandas as pd
df = pd.read_csv('./movie_data.csv')
df.head(3)

Out[1]:

	review	sentiment
0	In 1974, the teenager Martha Moxley (Maggie Gr...	1
1	OK... so... I really like Kris Kristofferson a...	0
2	*SPOILER* Do not read this, if you think a...	0

Introducing the bag-of-words model¶

...

Transforming documents into feature vectors¶

In [2]:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)

In [3]:

print(count.vocabulary_)

{'sweet': 4, 'is': 1, 'shining': 2, 'weather': 6, 'sun': 3, 'the': 5, 'and': 0}

In [4]:

print(bag.toarray())

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]

Assessing word relevancy via term frequency-inverse document frequency¶

In [5]:

np.set_printoptions(precision=2)

In [6]:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]

In [7]:

tf_is = 2 
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1) )
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 2.00

In [8]:

tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

Out[8]:

array([ 1.69,  2.  ,  1.29,  1.29,  1.29,  2.  ,  1.29])

In [9]:

l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

Out[9]:

array([ 0.4 ,  0.48,  0.31,  0.31,  0.31,  0.48,  0.31])

Cleaning text data¶

In [10]:

df.loc[0, 'review'][-50:]

Out[10]:

'is seven.<br /><br />Title (Brazil): Not Available'

In [2]:

import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + \
           ' '.join(emoticons).replace('-', '')
    return text

In [12]:

preprocessor(df.loc[0, 'review'][-50:])

Out[12]:

'is seven title brazil not available'

In [13]:

preprocessor("</a>This :) is :( a test :-)!")

Out[13]:

'this is a test :) :( :)'

In [14]:

df['review'] = df['review'].apply(preprocessor)

Processing documents into tokens¶

In [36]:

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [37]:

tokenizer('runners like running and thus they run')

Out[37]:

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [38]:

tokenizer_porter('runners like running and thus they run')

Out[38]:

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [39]:

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sebastian/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[39]:

True

In [41]:

from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

Out[41]:

['runner', 'like', 'run', 'run', 'lot']

Training a logistic regression model for document classification¶

Strip HTML and punctuation to speed up the GridSearch later:

In [19]:

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [21]:

from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False, 
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
             ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, 
                           scoring='accuracy',
                           cv=5, verbose=1,
                           n_jobs=-1)

In [25]:

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits

[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:   28.9s
[Parallel(n_jobs=-1)]: Done  50 jobs       | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 200 jobs       | elapsed: 34.1min
[Parallel(n_jobs=-1)]: Done 226 out of 240 | elapsed: 38.9min remaining:  2.4min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 40.7min finished

Out[25]:

GridSearchCV(cv=5,
       estimator=Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, charset=None,
        charset_error=None, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='...alse, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=0, tol=0.0001))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=-1,
       param_grid=[{'clf__C': [1.0, 10.0, 100.0], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 't...okenizer': [<function tokenizer at 0x7f6c704948c8>, <function tokenizer_porter at 0x7f6c70494950>]}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='accuracy', verbose=1)

In [26]:

print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'vect__stop_words': None, 'clf__penalty': 'l2', 'vect__tokenizer': <function tokenizer at 0x7f6c704948c8>, 'vect__ngram_range': (1, 1)} 
CV Accuracy: 0.897

In [27]:

clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899

Working with bigger data - online algorithms and out-of-core learning¶