Figure 9.2

Naive Bayes

%pylab inline

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['dist', 'norm']
`%matplotlib` prevents importing * from pylab and numpy

Here is the original code:

%load http://www.astroml.org/_downloads/fig_simple_naivebayes.py

"""
Simple Gaussian Naive Bayes Classification
------------------------------------------
Figure 9.2

A decision boundary computed for a simple data set using Gaussian naive Bayes
classification. The line shows the decision boundary, which corresponds to the
curve where a new point has equal posterior probability of being part of each
class. In such a simple case, it is possible to find a classification with
perfect completeness and contamination. This is rarely the case in the real
world.
"""
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.naive_bayes import GaussianNB

#----------------------------------------------------------------------
# This function adjusts matplotlib settings for a uniform feel in the textbook.
# Note that with usetex=True, fonts are rendered with LaTeX.  This may
# result in an error if LaTeX is not installed on your system.  In that case,
# you can set usetex to False.
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

#------------------------------------------------------------
# Simulate some data
np.random.seed(0)
mu1 = [1, 1]
cov1 = 0.3 * np.eye(2)

mu2 = [5, 3]
cov2 = np.eye(2) * np.array([0.4, 0.1])

X = np.concatenate([np.random.multivariate_normal(mu1, cov1, 100),
                    np.random.multivariate_normal(mu2, cov2, 100)])
y = np.zeros(200)
y[100:] = 1

#------------------------------------------------------------
# Fit the Naive Bayes classifier
clf = GaussianNB()
clf.fit(X, y)

# predict the classification probabilities on a grid
xlim = (-1, 8)
ylim = (-1, 5)
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71),
                     np.linspace(ylim[0], ylim[1], 81))
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Z = Z[:, 1].reshape(xx.shape)

#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(5, 3.75))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.binary, zorder=2)

ax.contour(xx, yy, Z, [0.5], colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

plt.show()

Modified figure:

Preamble: import the modules

import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors

from sklearn.naive_bayes import GaussianNB

from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=16, usetex=True)

Make synthetic, noisy, data

First, I'm going to experiment with some of these array manipulations to show that there is no magic here. Just clever ways to make the synthetic data.

np.eye(d) is the identity matrix of dimension $d$

print np.eye(4)

[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]]

The covariance matrix has no off-diagonal terms
Keep in mind, a $2 \times 2$ array times a $2 \times 1 $ array is a $2 \times 2$ array (i.e., it is not matrix multiplication, it is just row times scalar multiplication, or something like "broadcasting"...) $$\left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \\ \end{array} \right) \begin{array}{c} a \\ b \\ \end{array} = \left( \begin{array}{cc} a & 0 \\ 0 & b \\ \end{array} \right) $$

We have two multivariate normal distributions, in two dimensions (2D). $$\mathcal{N_0}(\mu_0, \Sigma_0)$$ $$\mathcal{N_1}(\mu_1, \Sigma_1)$$
We pick $\mu_0=(1,1)$ and $\mu_0=(5,3)$.
The covariances are: $$\boldsymbol{\Sigma_0} = \left( \begin{array}{cc} 0.3 & 0 \\ 0 & 0.3 \\ \end{array} \right)$$

$$\boldsymbol{\Sigma_1} = \left( \begin{array}{cc} 0.4 & 0 \\ 0 & 0.1 \\ \end{array} \right)$$

#------------------------------------------------------------
# Simulate some data
np.random.seed(0)
mu1 = [1, 1]
cov1 = 0.3 * np.eye(2)

mu2 = [5, 3]
cov2 = np.eye(2) * np.array([0.4, 0.1])

We seek a synthetic data set composed of these two populations

Npts=100
mvn0 = np.random.multivariate_normal(mu1, cov1, Npts)
mvn1 = np.random.multivariate_normal(mu2, cov2, Npts)

An aside: np.concatenate just sticks the arrays end-on-end, so that they are one dataset composed of two populations, which were originally distinct, but are now only distinguishable from their aggregate properties.

#np.concatenate?
print shape(mvn0)
print shape([mvn0])
print shape([mvn0, mvn1])
print shape(np.concatenate([mvn0, mvn1]))
X = np.concatenate([mvn0,mvn1], axis=0)

(100, 2)
(1, 100, 2)
(2, 100, 2)
(200, 2)

We assign those points drawn from $\mathcal{N_1}$ a class, $C$, equal to 1, which is possible since we know they were drawn from the same population, since we synthesized the data.

y = np.zeros(200)
y[100:] = 1

print y

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.]

Here is the actual machine learning:

'''
class GaussianNB(BaseNB)
 |  Gaussian Naive Bayes (GaussianNB)
 |  
 |  Parameters
 |  ----------
 |  X : array-like, shape = [n_samples, n_features]
 |      Training vector, where n_samples in the number of samples and
 |      n_features is the number of features.
 |  
 |  y : array, shape = [n_samples]
 |      Target vector relative to X
'''
clf1 = GaussianNB()
clf1.fit(X, y)
print shape(X)
print shape(y)

(200, 2)
(200,)

With a help(clf1), you can see all the Parameters (X, y); Attributes (classprior, theta, sigma); and Methods (fit, predict, predict_log_proba, predict_proba, get_params, set_params, score).

The next section of code is all a clever way of making a list of $(x, y, P)$, where $x, y$ are coordinates and $P$ is the probability of classification. Ultimately we simply use this list to make a contour plot with a single contour demarcating $P=0.5$, which is the break-even point for naive Bayes classification.

predict_proba(self, X) method of sklearn.naive_bayes.GaussianNB instance

Return probability estimates for the test vector X.

Parameters

X : array-like, shape = [n_samples, n_features]

help(clf.predict_proba)

Help on method predict_proba in module sklearn.naive_bayes:

predict_proba(self, X) method of sklearn.naive_bayes.GaussianNB instance
    Return probability estimates for the test vector X.
    
    Parameters
    ----------
    X : array-like, shape = [n_samples, n_features]
    
    Returns
    -------
    C : array-like, shape = [n_samples, n_classes]
        Returns the probability of the sample for each class in
        the model, where classes are ordered arithmetically.

# predict the classification probabilities on a grid
xlim = (-1, 8)
ylim = (-1, 5)

#help(np.meshgrid):
#    Return coordinate matrices from two or more coordinate vectors.

xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 71),
                     np.linspace(ylim[0], ylim[1], 81))

#help(np.c_):
#    Translates slice objects to concatenation along the second axis.
#help(ravel):
#    Return a flattened array.
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
Zorig = Z
Z = Z[:, 1].reshape(xx.shape)

print shape(xx)
print shape(xx)[0]*shape(xx)[1]
print shape(xx.ravel())
print shape(np.c_[xx.ravel(),  yy.ravel()])
print shape(Zorig)
print shape(Z)

(81, 71)
5751
(5751,)
(5751, 2)
(5751, 2)
(81, 71)

#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(10, 7.5))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.binary, zorder=2)

ax.contour(xx, yy, Z, [0.5], colors='k')

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

plt.show()

Modified figure, modeled after Figure 3.2

Marginal and conditional probabilities

xbins = xx[0, :]
ybins = yy[:, 0]

# plot the result
fig = plt.figure(figsize=(12, 6))

# define axes
ax_Pxy = plt.axes((0.2, 0.34, 0.27, 0.52))
ax_Px = plt.axes((0.2, 0.14, 0.27, 0.2))
ax_Py = plt.axes((0.1, 0.34, 0.1, 0.52))
ax_cb = plt.axes((0.48, 0.34, 0.01, 0.52))
ax_Px_y = [plt.axes((0.65, 0.62, 0.32, 0.23)),
           plt.axes((0.65, 0.38, 0.32, 0.23)),
           plt.axes((0.65, 0.14, 0.32, 0.23))]

# set axis label formatters
ax_Px_y[0].xaxis.set_major_formatter(NullFormatter())
ax_Px_y[1].xaxis.set_major_formatter(NullFormatter())

ax_Pxy.xaxis.set_major_formatter(NullFormatter())
ax_Pxy.yaxis.set_major_formatter(NullFormatter())

ax_Px.yaxis.set_major_formatter(NullFormatter())
ax_Py.xaxis.set_major_formatter(NullFormatter())

# draw the joint probability
plt.axes(ax_Pxy)
Z *= 1000
plt.imshow(Z, interpolation='nearest', origin='lower', aspect='auto',
           extent=[-1, 9, -1, 6], cmap=plt.cm.binary)

cb = plt.colorbar(cax=ax_cb)
cb.set_label('$p(x, y)$')
plt.text(0, 1.02, r'$\times 10^{-3}$',
         transform=ax_cb.transAxes) 
''' This transform argument is sort of sophisticated...
    -gully'''

# draw p(x) distribution
ax_Px.plot(xbins, Z.sum(0), '-k', drawstyle='steps')

# draw p(y) distribution
ax_Py.plot(Z.sum(1), ybins, '-k', drawstyle='steps')

''' Summing along an axis is as easy as H.sum()
    The argument of the sum is the dimension over which 
    the sum is performed.  0 is the x-axis, 1 is the y-axis.
    -gully'''

# define axis limits
ax_Pxy.set_xlim(-1, 8)
ax_Pxy.set_ylim(-1, 5)
ax_Px.set_xlim(-1, 8)
ax_Py.set_ylim(-1, 5)

# label axes
ax_Pxy.set_xlabel('$x$')
ax_Pxy.set_ylabel('$y$')
ax_Px.set_xlabel('$x$')
ax_Px.set_ylabel('$p(x)$')
ax_Px.yaxis.set_label_position('right')
ax_Py.set_ylabel('$y$')
ax_Py.set_xlabel('$p(y)$')
ax_Py.xaxis.set_label_position('top')

# draw marginal probabilities
Ngrid = 71
iy = [3 * Ngrid / 4, Ngrid / 2, Ngrid / 4]
''' which slices of the p(x,y) to take.  
    Here we define the index 0.25, 0.50, 0.75 of the range
    p(x | y = 0.5), p(x | y = 1.0), p(x | y = 1.5)
    -gully '''

colors = 'rgc'
#print 'type(colors)', type(colors)
''' apparently a string is also a list of characters
    -gully''' 

axis = ax_Pxy.axis()
for i in range(3):
    # overplot range on joint probability
    ax_Pxy.plot([-1, 9, -1, 9],
                [ybins[iy[i] + 1], ybins[iy[i] + 1],
                 ybins[iy[i]], ybins[iy[i]]], c=colors[i], lw=1)
    Px_y = Z[iy[i]] / Z[iy[i]].sum() 
    ''' Px_y is the same as P( x | y = something )
        -gully '''
    ax_Px_y[i].plot(xbins, Px_y, drawstyle='steps', c=colors[i])
    ax_Px_y[i].yaxis.set_major_formatter(NullFormatter())
    ''' a-ha, this is why the intermediate plots showed labels on
        the y-axis for the right-hand-side plot.
        -gully'''
    ax_Px_y[i].set_ylabel('$p(x | %.1f)$' % ybins[iy[i]])
ax_Pxy.axis(axis)
''' I don't understand what the above axis call does... just a reset?'''

ax_Px_y[2].set_xlabel('$x$')

ax_Pxy.set_title('Joint Probability')
ax_Px_y[0].set_title('Conditional Probability')

plt.show()

Hmm... this can't be quite right, since the figure will change depending on which $x-$ and $y-$ limits I select. The conditional probably plots are fine, but the marginal probabilities are not normalizable (the area under the curve is infinite). That makes sense because the probability we're plotting is not a probability density, $p$, it's a probability $P$. Anyways, it's not integrable so it's not normalizable. The end.