some practical pydata tips

and Python for Astrophysics

Michael Gully-Santiago, PhD
Postdoctoral Researcher
Kavli Institute for Astronomy & Astrophysics
at Peking University, Beijing, China

Practical pydata tips

  1. Conda/anaconda is the easiest way to manage python dependencies.
  2. Jupyter notebooks are great for interactive data analysis, reproducibility.
  3. Jupyter notebooks shift-tab feature is the best way to explore APIs through docstrings and experimentation.
  4. This presentation was made from a Jupyter Notebook, converted to reveal.js.
  5. Pandas pd.read_csv() is probably the most useful thing for a data scientist.
  6. HackerRank is a great way to learn Python and discover "coding accents"-- large variety of how people code the same thing.
  7. Bokeh is a cool thing for making interactive D3 plots from Python.
  8. Tensorflow has a Python API.

Conda/anaconda is the easiest way to manage python dependencies.

In [4]:
! wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
--2016-08-21 18:22:53--  https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
Resolving repo.continuum.io... 107.22.253.242, 107.22.243.130, 107.22.242.170, ...
Connecting to repo.continuum.io|107.22.253.242|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Accept-Ranges: bytes
  Content-Type: application/octet-stream
  Date: Sun, 21 Aug 2016 10:22:54 GMT
  ETag: "579a7a38-17ac367"
  Last-Modified: Thu, 28 Jul 2016 21:33:44 GMT
  Server: nginx/1.8.1
  Content-Length: 24822631
  Connection: keep-alive
Length: 24822631 (24M) [application/octet-stream]
Saving to: 'Miniconda3-latest-MacOSX-x86_64.sh'

Miniconda3-latest-M 100%[===================>]  23.67M   566KB/s    in 44s     

2016-08-21 18:23:39 (549 KB/s) - 'Miniconda3-latest-MacOSX-x86_64.sh' saved [24822631/24822631]

In [6]:
! ls -1 Miniconda*
Miniconda2-latest-MacOSX-x86_64.sh
Miniconda3-latest-MacOSX-x86_64.sh
In [ ]:
! conda update conda
In [8]:
! conda install numpy scipy pandas matplotlib jupyter seaborn bokeh mkl
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata .........
Solving package specifications: ..........

Package plan for installation in environment //anaconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bokeh-0.12.1               |           py34_0         3.3 MB
    seaborn-0.7.1              |           py34_0         283 KB
    ------------------------------------------------------------
                                           Total:         3.5 MB

The following packages will be UPDATED:

    bokeh:   0.11.1-py34_0 --> 0.12.1-py34_0
    seaborn: 0.7.0-py34_0  --> 0.7.1-py34_0 

Proceed ([y]/n)? ^C
Operation aborted.  Exiting.

In [6]:
! conda list
# packages in environment at //anaconda:
#
Using Anaconda Cloud api site https://api.anaconda.org
_license                  1.1                      py34_0    <unknown>
appscript                 1.0.1                    py34_0    <unknown>
beautiful-soup            4.3.2                    py34_0    <unknown>
binstar                   0.11.0                   py34_0    <unknown>
bitarray                  0.8.1                    py34_0    <unknown>
conda-build               1.14.1                   py34_0    <unknown>
configobj                 5.0.6                    py34_0    <unknown>
docutils                  0.12                     py34_0    <unknown>
fastcache                 1.0.2                    py34_0    <unknown>
idna                      2.0                      py34_0    <unknown>
itsdangerous              0.24                     py34_0    <unknown>
jpeg                      8d                            1    <unknown>
jsonschema                2.4.0                    py34_0    <unknown>
launcher                  1.0.0                         3    <unknown>
libdynd                   0.6.5                         0    <unknown>
libpng                    1.6.17                        0    <unknown>
libsodium                 0.4.5                         0    <unknown>
libxml2                   2.9.2                         0    <unknown>
markupsafe                0.23                     py34_0    <unknown>
mock                      1.0.1                    py34_0    <unknown>
node-webkit               0.10.1                        0    <unknown>
nose                      1.3.7                    py34_0    <unknown>
pycosat                   0.6.1                    py34_0    <unknown>
pycparser                 2.14                     py34_0    <unknown>
pycrypto                  2.6.1                    py34_0    <unknown>
pyparsing                 2.0.3                    py34_0    <unknown>
python.app                1.2                      py34_4    <unknown>
pyyaml                    3.11                     py34_1    <unknown>
redis                     2.6.9                         0    <unknown>
redis-py                  2.10.3                   py34_0    <unknown>
rope                      0.9.4                    py34_1    <unknown>
runipy                    0.1.3                    py34_0    <unknown>
ujson                     1.33                     py34_0    <unknown>
xlwt                      1.0.0                    py34_0    <unknown>
yaml                      0.1.6                         0    <unknown>
abstract-rendering        0.5.1               np110py34_0  
acor                      1.1.1                     <pip>
alabaster                 0.7.7                    py34_0  
anaconda                  4.0.0               np110py34_0  
anaconda-client           1.4.0                    py34_0  
APLpy                     2.0.dev857                <pip>
appnope                   0.1.0                    py34_0  
argcomplete               1.0.0                    py34_1  
astroML                   0.3                       <pip>
astropy                   1.1.2               np110py34_0  
astroquery                0.3.0                     <pip>
babel                     2.2.0                    py34_0  
backports                 1.0                      py34_0  
backports_abc             0.4                      py34_0  
bcolz                     0.11.0                   py34_0  
beautifulsoup4            4.4.1                    py34_0  
blaze                     0.9.1                    py34_0  
blaze-core                0.8.3                    py34_0  
blz                       removed                       0  
bokeh                     0.11.1                   py34_0  
boto                      2.39.0                   py34_0  
bottleneck                1.0.0               np110py34_0  
certifi                   14.05.14                 py34_0  
cffi                      1.5.2                    py34_0  
chest                     0.2.3                    py34_0  
cloudpickle               0.1.1                    py34_0  
clyent                    1.2.1                    py34_0  
colorama                  0.3.7                    py34_0  
conda                     4.1.11                   py34_0  
conda-env                 2.5.2                    py34_0  
conda-manager             0.3.1                    py34_0  
corner                    2.0.1                     <pip>
cryptography              1.3                      py34_0  
curl                      7.45.0                        0  
cycler                    0.10.0                   py34_0  
cython                    0.23.4                   py34_1  
cytoolz                   0.7.5                    py34_0  
dask                      0.8.1                    py34_0  
datashape                 0.5.1                    py34_0  
decorator                 4.0.9                    py34_0  
dill                      0.2.4                    py34_0  
dynd-python               removed                       0  
emcee                     2.1.0                     <pip>
entrypoints               0.2                      py34_1  
et_xmlfile                1.0.1                    py34_0  
flask                     0.10.1                   py34_1  
flask-cors                2.1.2                    py34_0  
freetype                  2.5.5                         0  
gatspy                    0.4.dev0                  <pip>
get_terminal_size         1.0.0                    py34_0  
gevent                    1.1.0                    py34_0  
greenlet                  0.4.9                    py34_0  
h5py                      2.5.0               np110py34_4  
hdf5                      1.8.15.1                      2  
heapdict                  1.0.0                    py34_0  
html5lib                  0.9999999                 <pip>
icu                       54.1                          0  
image_registration        0.2.1                     <pip>
ipykernel                 4.3.1                    py34_0  
ipython                   4.1.2                    py34_1  
ipython-notebook          4.0.4                    py34_0  
ipython-qtconsole         4.0.1                    py34_0  
ipython_genutils          0.1.0                    py34_0  
ipywidgets                4.1.1                    py34_0  
isochrones                0.9.0                     <pip>
jbig                      2.1                           0  
jdcal                     1.2                      py34_0  
jedi                      0.9.0                    py34_0  
jinja2                    2.8                      py34_0  
jupyter                   1.0.0                    py34_3  
jupyter_client            4.2.2                    py34_0  
jupyter_console           4.1.1                    py34_0  
jupyter_core              4.1.0                    py34_0  
K2fov                     5.0.0                     <pip>
keyring                   5.7.1                     <pip>
libnetcdf                 4.3.3.1                       3  
libtiff                   4.0.6                         2  
libxslt                   1.1.28                        2  
llvmlite                  0.9.0                    py34_0  
locket                    0.2.0                    py34_0  
lockfile                  0.10.2                   py34_0  
lxml                      3.6.0                    py34_0  
matplotlib                1.5.1               np111py34_0  
mistune                   0.7.2                    py34_1  
mkl                       11.3.3                        0  
mkl-rt                    11.1                         p0  
mkl-service               1.1.2                    py34_2  
mpi4py                    2.0.0                     <pip>
mpmath                    0.19                     py34_0  
multipledispatch          0.4.8                    py34_0  
nbconvert                 4.1.0                    py34_0  
nbformat                  4.0.1                    py34_0  
networkx                  1.11                     py34_0  
nltk                      3.2                      py34_0  
notebook                  4.1.0                    py34_2  
numba                     0.24.0              np110py34_0  
numexpr                   2.6.1               np111py34_0  
numpy                     1.11.1                    <pip>
numpy                     1.11.1                   py34_0  
odo                       0.4.2                    py34_0  
openpyxl                  2.3.2                    py34_0  
openssl                   1.0.2h                        1  
pandas                    0.18.1              np111py34_0  
partd                     0.3.2                    py34_1  
path.py                   8.1.2                    py34_1  
patsy                     0.4.0               np110py34_0  
pcre                      8.39                          0  
pep8                      1.7.0                    py34_0  
pexpect                   4.0.1                    py34_0  
pickleshare               0.5                      py34_0  
pillow                    3.1.1                    py34_0  
pip                       8.1.2                     <pip>
pip                       8.1.2                    py34_0  
plotutils                 0.3.2                     <pip>
ply                       3.8                      py34_0  
protobuf                  3.0.0b2                   <pip>
psutil                    4.1.0                    py34_1  
ptyprocess                0.5                      py34_0  
py                        1.4.31                   py34_0  
pyasn1                    0.1.9                    py34_0  
pycurl                    7.19.5.3                 py34_0  
pyfits                    3.3                       <pip>
pyflakes                  1.1.0                    py34_0  
pygments                  2.1.1                    py34_0  
pymc                      2.3.5               np19py34_p0  [mkl]
pyopenssl                 0.15.1                   py34_2  
pyparsing                 2.1.5                     <pip>
pyqt                      4.11.4                   py34_1  
pytables                  3.2.2               np110py34_1  
pytest                    2.8.5                    py34_0  
python                    3.4.5                         0  
python-dateutil           2.5.1                    py34_0  
python-dateutil           2.5.3                     <pip>
pytz                      2016.2                   py34_0  
pytz                      2016.4                    <pip>
pyzmq                     15.2.0                   py34_0  
qt                        4.8.7                         1  
qtawesome                 0.3.2                    py34_0  
qtconsole                 4.2.0                    py34_0  
qtpy                      1.0                      py34_0  
gsl                       1.16                          2    r
libgcc                    4.8.5                         1    r
ncurses                   5.9                           8    r
r                         3.3.1                  r3.3.1_0    r
r-base                    3.3.1                         0    r
r-boot                    1.3_18                 r3.3.1_0    r
r-class                   7.3_14                 r3.3.1_0    r
r-cluster                 2.0.4                  r3.3.1_0    r
r-codetools               0.2_14                 r3.3.1_0    r
r-foreign                 0.8_66                 r3.3.1_0    r
r-kernsmooth              2.23_15                r3.3.1_0    r
r-lattice                 0.20_33                r3.3.1_0    r
r-mass                    7.3_45                 r3.3.1_0    r
r-matrix                  1.2_6                  r3.3.1_0    r
r-mgcv                    1.8_12                 r3.3.1_0    r
r-nlme                    3.1_128                r3.3.1_0    r
r-nnet                    7.3_12                 r3.3.1_0    r
r-recommended             3.3.1                  r3.3.1_0    r
r-rpart                   4.1_10                 r3.3.1_0    r
r-spatial                 7.3_11                 r3.3.1_0    r
r-survival                2.39_4                 r3.3.1_0    r
readline                  6.2                           2  
requests                  2.9.1                    py34_0  
ruamel_yaml               0.11.7                   py34_0  
scikit-image              0.12.3              np110py34_0  
scikit-learn              0.17.1              np111py34_2  
scipy                     0.18.0              np111py34_0  
seaborn                   0.7.0                    py34_0  
setuptools                23.0.0                   py34_0  
setuptools                23.1.0                    <pip>
simplegeneric             0.8.1                    py34_0  
singledispatch            3.4.0.3                  py34_0  
sip                       4.16.9                   py34_0  
six                       1.10.0                    <pip>
six                       1.10.0                   py34_0  
snowballstemmer           1.2.1                    py34_0  
sockjs-tornado            1.0.1                    py34_0  
sphinx                    1.3.5                    py34_0  
sphinx_rtd_theme          0.1.9                    py34_0  
spyder                    2.3.8                    py34_1  
spyder-app                2.3.8                    py34_0  
sqlalchemy                1.0.12                   py34_0  
sqlite                    3.13.0                        0  
statsmodels               0.6.1               np110py34_0  
supersmoother             0.3.2                     <pip>
sympy                     1.0                      py34_0  
tensorflow                0.9.0                     <pip>
terminado                 0.5                      py34_1  
tk                        8.5.18                        0  
toolz                     0.7.4                    py34_0  
tornado                   4.3                      py34_0  
traitlets                 4.2.1                    py34_0  
triangle-plot             0.3.0                     <pip>
unicodecsv                0.14.1                   py34_0  
wcsaxes                   0.9                       <pip>
werkzeug                  0.11.4                   py34_0  
wheel                     0.29.0                    <pip>
wheel                     0.29.0                   py34_0  
xlrd                      0.9.4                    py34_0  
xlsxwriter                0.8.4                    py34_0  
xlwings                   0.7.0                    py34_0  
xz                        5.2.2                         0  
zeromq                    4.1.3                         0  
zlib                      1.2.8                         3  

Jupyter notebooks are great for interactive data analysis, reproducibility.

The function is $y = \frac{\sin^2(x)}{x^2}$.

In [20]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.linspace(-2.0*np.pi, 2.0*np.pi, 10000)
y = np.sinc(x)**2
plt.plot(x, np.log10(y))
plt.ylim(-3, 0)
Out[20]:
(-3, 0)

Jupyter notebooks shift-tab feature is the best way to explore APIs through docstrings and experimentation.

In [22]:
import pandas as pd
In [ ]:
pd.read_csv( #hit shift-tab and see what happens!
In [12]:
Image('../figs/shift_tab_example.png')
Out[12]:

This presentation was made from a Jupyter Notebook, and converted to reveal.js.

In [25]:
! jupyter-nbconvert BJpy_01-01_First_trial.ipynb --to slides --post serve
[NbConvertApp] Converting notebook BJpy_01-01_First_trial.ipynb to slides
[NbConvertApp] Writing 561235 bytes to BJpy_01-01_First_trial.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.1.0
Serving your slides at http://127.0.0.1:8000/BJpy_01-01_First_trial.slides.html
Use Control-C to stop this server
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 0.51ms
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 0.55ms
^C
Interrupted

Pandas pd.read_csv() is probably the most useful thing for a data scientist.

In [27]:
Image(filename='../figs/shift_tab_example.png') 
Out[27]:

HackerRank is a great way to learn Python and discover "coding accents"-- the variety of how people code the same problem in the same language (Python).

In [29]:
Image('../figs/hackerrank_python.png', width=300)
Out[29]:
In [30]:
Image('../figs/hackerrank_problem_setup.png')
Out[30]:
In [32]:
# Python 2!
n = int(raw_input())
uniq_vals = list(set(map(int, raw_input().split())))
uniq_vals.sort()
print uniq_vals[-2]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-32-057cc2d70294> in <module>()
      1 # Enter your code here. Read input from STDIN. Print output to STDOUT
----> 2 n = int(raw_input())
      3 uniq_vals = list(set(map(int, raw_input().split())))
      4 uniq_vals.sort()
      5 print(uniq_vals[-2])

NameError: name 'raw_input' is not defined
In [34]:
# Python 3
n = int(input())
uniq_vals = list(set(map(int, input().split())))
uniq_vals.sort()
print(uniq_vals[-2])
5
2 3 6 6 5
5
In [ ]:
# From user alexander_zhou
def findSM(l):
    f, s = l[0], l[0]
    for i in range(len(l)):
        if l[i] > f:
            s, f = f, l[i]
        elif l[i] < f:
            if f == s:
                s = l[i]
            elif l[i] > s:
                s = l[i]
    return s            

n = int(input())
l = input().split()
for i in range(n):
    l[i] = int(l[i])

print(findSM(l))
In [ ]:
# From user richmond
import random
import sys
i = 1
for line in sys.stdin:
    if i == 1:
        if 2>int(line) and int(line)>100:
            break
    else:
        r = line
        s = []
        sss = r.split(' ')
        for ss in sss:
            s.append(int(ss))
        __s = -100
        for i in s:
            if i > __s:
                __s = i
        _s = -100
        for i in s:
            if i != __s:
                if i > _s:
                    _s = i
                    
        print _s
        break
    i += 1
In [ ]:
# From user dragonfury
a = int(raw_input())

b = [int(x) for x in raw_input().split(' ')]

m=-101
for i in range(0,a):
    if m<b[i]:
        m=b[i]

for i in range(0,a):
    if b[i] == m:
      b[i]=-101  

m=-101
for i in range(0,a):
    if m<b[i]:
        m=b[i]

print(m)
In [ ]:
x=input()
a=raw_input()
a=a.split(' ')
a=map(int,a)
a=set(a)
a=list(a)
a.sort()
print a[len(a)-2]
In [ ]:
# From user Matiel
N = int(input())
liste = input().split()
i = 0
for nombre in liste:
    liste[i] = int(nombre)
    i += 1

maximum = max(liste)
preMax = min(liste)
for nombre in liste:
    if (nombre < maximum and nombre > preMax):
        preMax = nombre

print(preMax)
In [ ]:
# from user Kabashka
import subprocess
import os
import sys

sizeOfData = int(sys.stdin.readline())
a = map(int,sys.stdin.readline().replace('\n','').split(' '))

lastMax = a[0]
max = a[0]

for i in a:
  if i > max:
    lastMax=max
    max = i
  elif i==max:
    continue
  elif i > lastMax:
    lastMax = i
  elif lastMax==max:
    lastMax = i

sys.stdout.write(str(lastMax))

Bokeh makes interactive D3 plots from Python matplotlib code.

In [38]:
import numpy as np

from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

N = 4000
x = np.random.random(size=N) * 100
y = np.random.random(size=N) * 100
radii = np.random.random(size=N) * 1.5
colors = [
    "#%02x%02x%02x" % (int(r), int(g), 150) for r, g in zip(50+2*x, 30+2*y)
]

TOOLS="crosshair,pan,wheel_zoom,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select"

p = figure(tools=TOOLS)

p.scatter(x, y, radius=radii,
          fill_color=colors, fill_alpha=0.6,
          line_color=None)

output_file("color_scatter.html", title="color_scatter.py example")

show(p)  # open a browser
Loading BokehJS ...
Out[38]:

<Bokeh Notebook handle for In[38]>

Tensorflow has a Python API.

interviewer: OK, so are you familiar with "fizz buzz"?

me: ...

interviewer: Is that a yes or a no?

me: It's more of a "I can't believe you're asking me that."

interviewer: OK, so I need you to print the numbers from 1 to 100, except that if the number is divisible by 3 print "fizz", if it's divisible by 5 print "buzz", and if it's divisible by 15 print "fizzbuzz".

me: I'm familiar with it.

interviewer: Great, we find that candidates who can't get this right don't do well here.

me: ...

interviewer: Here's a marker and an eraser.

interviewer: Do you need help getting started?

me: No, no, I'm good. So let's start with some standard imports:

In [ ]:
import numpy as np
import tensorflow as tf

interviewer: Um, you understand the problem is fizzbuzz, right?

me: Do I ever. So, now let's talk models. I'm thinking a simple multi-layer-perceptron with one hidden layer.

interviewer: OK, that's probably enough.

me: That's enough setup, you're exactly right. Now we need to generate some training data. It would be cheating to use the numbers 1 to 100 in our training data, so let's train it on all the remaining numbers up to 1024:

In [ ]:
def fizz_buzz_encode(i):
    if   i % 15 == 0: return np.array([0, 0, 0, 1])
    elif i % 5  == 0: return np.array([0, 0, 1, 0])
    elif i % 3  == 0: return np.array([0, 1, 0, 0])
    else:             return np.array([1, 0, 0, 0])

def fizz_buzz(i, prediction):
    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]

def model(X, w_h, w_o):
    h = tf.nn.relu(tf.matmul(X, w_h))
    return tf.matmul(h, w_o)
ipython
In [185]: output
Out[185]:
array(['1', '2', 'fizz', '4', 'buzz', 'fizz', '7', '8', 'fizz', 'buzz',
       '11', 'fizz', '13', '14', 'fizzbuzz', '16', '17', 'fizz', '19',
       'buzz', '21', '22', '23', 'fizz', 'buzz', '26', 'fizz', '28', '29',
       'fizzbuzz', '31', 'fizz', 'fizz', '34', 'buzz', 'fizz', '37', '38',
       'fizz', 'buzz', '41', '42', '43', '44', 'fizzbuzz', '46', '47',
       'fizz', '49', 'buzz', 'fizz', '52', 'fizz', 'fizz', 'buzz', '56',
       'fizz', '58', '59', 'fizzbuzz', '61', '62', 'fizz', '64', 'buzz',
       'fizz', '67', '68', '69', 'buzz', '71', 'fizz', '73', '74',
       'fizzbuzz', '76', '77', 'fizz', '79', 'buzz', '81', '82', '83',
       '84', 'buzz', '86', '87', '88', '89', 'fizzbuzz', '91', '92', '93',
       '94', 'buzz', 'fizz', '97', '98', 'fizz', 'fizz'],
      dtype='<U8')

The end!