Python, Jupyter, Pandas, Numpy

Python, Jupyter, Pandas, Numpy

Jupyter Notebooks

Jupyter notebooks are a way of writing code, running code, and displaying output in one convenient place. You can write code in code blocks, or write markdown/text in blocks like this. It’s often useful to explain what you’re doing and finding so when you or someone else picks up the notebook in the future they’ll know what’s going on.

You can execute code chunks by clicking the cell to run and hitting “Run” button on the top bar, or by typing “shift-enter”. You can always return to a previous code chunk, modify it, and re-run it.

You can write math in notebooks, just like in Rmarkdown: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$

Jupyter is great for prototyping, but for more heavy duty use, replication, running on a server, etc., I recommend re-writing the code into a .py file that can be called from the command line with python mycode.py. This process also gives you a chance to refactor your code, making it more efficient, readable, and dependable. You can convert a notebook to (messy) Python code by going File > Download as > Python (.py).

Take notes!

Python vs. R

Python and R are fairly similar. This is a quick overview of the differences to help you get up to speed.

Installing and importing packages

  • In Python, packages are installed from the command line, NOT from inside Python itself. From the terminal, run pip install mypackage.
  • Importing packages is similar: instead of [R] library(mypackage), do [Python] import mypackage.
  • Python also lets you import specific functions from a package: from mypackage import cool_function
  • You can also rename packages if they’re too long: import numpy as np
# practice!
import pandas as pd
import numpy as np
from collections import Counter
import re

What are all those dots for? (Or, methods, attributes, and namespaces)

  • Dots have special meaning in Python. It’s not like R, where people put dots in all sorts of names.
  • Python is much more careful about keeping packages’ functions attached to the functions. If the requests library has a function called get, you call it like this requests.get(). This reminds you where the get function came from and prevents you from overwriting some other package’s get function.
  • Python is also more “object oriented” than R. Objects often have built in or attached functions, called methods.
  • These methods are called with a dot notation. Compare:
[R] strsplit("Andy Halterman", " ")

and

    [Python] "Andy Halterman".split(" ")
  • Objects can also have attributes, which are pieces of data attached to an object. Example: andy.subfields = ['methods', 'security']
## Practice! Can you figure out how to make a string all upper case?
print("andy".title())
print("Andy".lower())
Andy
andy
my_list = ["x", "y", "z"]
my_list.append("asdf") 

for i in my_list:
    print(i)
x
y
z
asdf

Data Structures

  • Like R’s vectors, Python uses a lot of lists. These are ordered arrays. Note that Python starts with 0!
my_list = ["x", "y", "z"]
> my_list[0] 
"x"
  • Python has a data structure called a dictionary, which are like lists that you access by key name instead of by position (think a more general form of R’s dataframes). Example:
article = {"title": "Rivalry and Revenge",
           "author" : "Balcells",
           "year" : "2017"}

> article.keys()
['title', 'author', 'year']

> article['author']
"Balcells"

Loops and functions

  • Functions are only slightly different from R:
def my_function(x):
    z = (x + 2)^2
    for i in x:
        print(i)
  • Loops are fast and nice in Python, unlike in R, and are very easily done:
output = []
for i in my_list:
    o = my_function(i)
    output.append(o)
  • Pro move: list comprehensions:
output = [my_function(i) for i in my_list]
## Practice! Can you make a list of words, write a loop that
## goes over the list, and 
## prints a capitalized version of each one?
my_list = ["x", "y", "z"]
my_list.append("asdf") 

for i in my_list:
    print(i.upper())
    
[i.upper() for i in my_list]
X
Y
Z
ASDF





['X', 'Y', 'Z', 'ASDF']

Whitespace

  • As you can tell, Python makes heavy use of whitespace to set apart different levels of functions, for loops, etc. Use four spaces (Jupyter converts tabs to four spaces automatically.
  • No need for curly braces!
def my_function(big_list):
    print(len(big_list))
    for l in big_list:
        for i in l:
            ...
    return stuff
# tqdm is one of my favorite libraries in Python.
# It makes very nice progress bars without any real effort:
from tqdm.autonotebook import tqdm
from time import sleep

_ = [sleep(0.01) for i in tqdm(range(0, 500))]

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

pandas and exploring data

We’ll be working with data from Muchlinski et al. on forecasting civil war onset with random forests.

@article{muchlinski2015comparing,
	Author = {Muchlinski, David and Siroky, David and He, Jingrui and Kocher, Matthew},
	Journal = {Political Analysis},
	Number = {1},
	Pages = {87--103},
	Title = {Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data},
	Volume = {24},
	Year = {2015}}

Skills to learn:

  • reading CSVs (skipping lines and headers)
  • looking at columns and shapes
  • renaming columns
  • pandas unique/describe
  • pandas string operations
df = pd.read_csv("Muchlinski_et_al/SambnisImp.csv",
                   header = 0) # 0th row has variable names
df.shape
(7140, 286)
df.columns[0:30]
Index(['Unnamed: 0', 'atwards', 'X', 'id', 'cid', 'cowcode', 'year', 'warstds',
       'ptime', 'yrint', 'autonomy', 'rf', 'popdense', 'auto98', 'dem98',
       'pol98', 'army85', 'milgnp92', 'milper', 'milex', 'trade', 'logpop',
       'ienergy', 'ienercap', 'lnienc', 'gdpmkt', 'gdpcap', 'irgdp', 'decade',
       'regy'],
      dtype='object')
# accessing rows by their number
df.iloc[44,:]
Unnamed: 0         4.500000e+01
atwards            1.000000e+00
X                  4.500000e+01
id                 1.000000e+00
cid                1.000000e+00
cowcode            7.000000e+02
year               1.989000e+03
warstds            0.000000e+00
ptime              3.990000e+02
yrint              4.400000e+01
autonomy           0.000000e+00
rf                 1.000000e+00
popdense           3.059774e+01
auto98             8.000000e+00
dem98              0.000000e+00
pol98             -8.000000e+00
army85             4.700000e+04
milgnp92           4.006720e+00
milper             1.308318e+02
milex              1.375957e+06
trade              4.752469e+01
logpop             1.680886e+01
ienergy            4.430821e+01
ienercap           1.560000e-06
lnienc            -1.466382e+01
gdpmkt             5.764021e+10
gdpcap             1.473724e+03
irgdp              3.122953e+03
decade             3.000000e+00
regy               3.809845e+03
                       ...     
ln_gdpen_region    9.617411e-01
regd4             -1.000000e+00
regd4_alt         -5.000000e-01
regd4m            -5.000000e-01
nmdp4             -6.000000e+00
nmdp4_alt         -6.000000e+00
nmdp4m            -6.000000e+00
nat_war            0.000000e+00
tnatwar            0.000000e+00
nmdgdp             9.621006e-01
indepyear          1.929000e+03
ef2                5.636961e-01
mirps0             0.000000e+00
mirps1             0.000000e+00
mirps2             1.000000e+00
mirps3             0.000000e+00
sxpsq              2.209059e-02
sxpsq100           2.209059e+00
pol4sq             4.900000e+01
ln_popns           1.657641e+01
decade1            0.000000e+00
decade2            0.000000e+00
decade3            1.000000e+00
decade4            0.000000e+00
independ           1.000000e+00
tip                1.777491e+01
anocracy           0.000000e+00
proxregc           3.810000e-06
sxpnew.2          -4.568928e-01
sxpsq.2           -4.568928e-01
Name: 44, Length: 286, dtype: float64
### Using only the variables specified in Sambanis (2006) Appendix ###
subset_list = ["cowcode", "warstds", "ager", "agexp", "anoc", "army85", "autch98", "auto4",
        "autonomy", "avgnabo", "centpol3", "coldwar", "decade1", "decade2",
        "decade3", "decade4", "dem", "dem4", "demch98", "dlang", "drel",
        "durable", "ef", "ef2", "ehet", "elfo", "elfo2", "etdo4590",
        "expgdp", "exrec", "fedpol3", "fuelexp", "gdpgrowth", "geo1", "geo2",
        "geo34", "geo57", "geo69", "geo8", "illiteracy", "incumb", "infant",
        "inst", "inst3", "life", "lmtnest", "ln_gdpen", "lpopns", "major", "manuexp", "milper",
        "mirps0", "mirps1", "mirps2", "mirps3", "nat_war", "ncontig",
        "nmgdp", "nmdp4_alt", "numlang", "nwstate", "oil", "p4mchg",
        "parcomp", "parreg", "part", "partfree", "plural", "plurrel",
        "pol4", "pol4m", "pol4sq", "polch98", "polcomp", "popdense",
        "presi", "pri", "proxregc", "ptime", "reg", "regd4_alt", "relfrac", "seceduc",
        "second", "semipol3", "sip2", "sxpnew", "sxpsq", "tnatwar", "trade",
        "warhist", "xconst"]
df = df[subset_list]
df.shape
(7140, 92)
# rename a column
df2 = df.rename({"warhist" : "war_hist"})
df.head(10)

cowcode warstds ager agexp anoc army85 autch98 auto4 autonomy avgnabo ... seceduc second semipol3 sip2 sxpnew sxpsq tnatwar trade warhist xconst
0 700 0 34.461765 8.510845 0 129472.9042 0 3.925812 0.005151 0.432955 ... 43.770298 0.253 0.058441 0.46176 0.158275 0.052989 0.443259 72.881375 0 3.995912
1 700 0 34.346348 8.478997 0 129413.0225 0 10.000000 0.000000 0.045052 ... 43.588363 0.253 0.000000 0.00000 0.158321 0.052663 1.000000 72.900089 0 1.000000
2 700 0 77.000000 8.481015 0 130431.0145 0 10.000000 0.000000 0.030034 ... 43.538181 0.253 0.000000 0.00000 0.158425 0.052891 2.000000 72.962880 0 1.000000
3 700 0 78.000000 8.451628 0 126781.6866 0 10.000000 0.000000 0.022526 ... 43.490005 0.253 0.000000 0.00000 0.159006 0.052902 2.000000 73.102449 0 1.000000
4 700 0 79.000000 8.500172 0 130979.2470 0 10.000000 0.000000 0.022526 ... 43.602238 0.253 0.000000 0.00000 0.158074 0.052706 2.000000 72.850389 0 1.000000
5 700 0 80.000000 8.528873 0 130616.5315 0 10.000000 0.000000 0.022526 ... 43.816399 0.253 0.000000 0.00000 0.157900 0.052757 2.000000 72.768380 0 1.000000
6 700 0 81.000000 8.546965 0 129142.7125 0 10.000000 0.000000 0.022526 ... 43.775367 0.253 0.000000 0.00000 0.158282 0.052932 2.000000 72.861897 0 1.000000
7 700 0 82.000000 8.550921 0 129067.3178 0 10.000000 0.000000 0.022526 ... 43.753646 0.253 0.000000 0.00000 0.158117 0.052820 1.000000 72.825320 0 1.000000
8 700 0 83.000000 8.530715 0 130133.2989 0 10.000000 0.000000 0.022526 ... 43.842575 0.253 0.000000 0.00000 0.157872 0.052776 0.000000 72.844025 0 1.000000
9 700 0 84.000000 8.544636 0 131839.4309 0 10.000000 0.000000 0.000000 ... 43.823504 0.253 0.000000 0.00000 0.157677 0.052666 0.000000 72.696400 0 1.000000

10 rows × 92 columns

# Counter is a useful way of getting unique values
from collections import Counter
Counter(df['cowcode']).most_common(10)
[(700, 56),
 (339, 56),
 (160, 56),
 (900, 56),
 (305, 56),
 (211, 56),
 (145, 56),
 (140, 56),
 (355, 56),
 (20, 56)]
# Check how many unique values
df['cowcode'].unique()
array([700, 339, 615, 540, 160, 371, 900, 305, 373,  31, 692, 771,  53,
       370, 211,  80, 434, 760, 145, 346, 571, 140, 355, 439, 516, 811,
       471,  20, 402, 482, 483, 155, 710, 100, 581, 484,  94, 344,  40,
       352, 315, 316, 390, 522,  42, 130, 651,  92, 531, 366, 530, 529,
       950, 375, 220, 481, 420, 372, 260, 255, 452, 350,  55,  90, 438,
       404, 110,  41,  91, 310, 395, 750, 850, 630, 645, 205, 666, 325,
       437,  51, 740, 663, 705, 501, 732, 690, 703, 812, 367, 660, 570,
       450, 620, 368, 212, 343, 511, 553, 820, 432, 338, 435, 590,  70,
       359, 712, 600, 541, 775, 565, 790, 210, 920,  93, 436, 475, 385,
       698, 770,  95, 910, 150, 135, 840, 290, 235, 694, 360, 517, 670,
       433, 591, 451, 830, 317, 349, 940, 520, 560, 230, 780, 625, 115,
       572, 380, 225, 652, 713, 702, 510, 800, 461,  52, 616, 640, 701,
       200,   2, 364, 365, 500, 369, 696, 165, 704, 935, 101, 817, 818,
       990, 679, 678, 680, 345, 347, 490, 551, 552])
df['numlang'].describe()
count    7140.000000
mean        6.869385
std         6.917369
min         1.000000
25%         2.000000
50%         4.000000
75%         9.000000
max        46.000000
Name: numlang, dtype: float64
# We can do similar indexing as in R
df[df['numlang'] > 6].shape
(2889, 92)
# Create new variables. For instance, is 
df["low_lang"] = df["numlang"] <= 2
df["low_lang"].head()
0    False
1    False
2    False
3    False
4    False
Name: low_lang, dtype: bool
# After county name fix, append on fips codes
cow = pd.read_csv("COW country codes.csv")
cow.head()
cow.shape
(243, 3)

New section

asdf

# can also do left_on, right_on if columns are named differently.
df = df.join(cow.set_index(['CCode']), on=['cowcode'],
            how = "left")

df.iloc[44]
cowcode              700
warstds                0
ager                 119
agexp            13.8707
anoc                   0
army85             47000
autch98                0
auto4                  7
autonomy               0
avgnabo         0.376937
centpol3               1
coldwar                1
decade1                0
decade2                0
decade3                1
decade4                0
dem                    0
dem4                   0
demch98                0
dlang                 70
drel                  20
durable                9
ef              0.750797
ef2             0.563696
ehet                  90
elfo                  66
elfo2               4356
etdo4590               1
expgdp           20.9773
exrec                  3
                ...     
part                   1
partfree               0
plural              0.38
plurrel               84
pol4                  -7
pol4m                 -7
pol4sq                49
polch98                0
polcomp                2
popdense         30.5977
presi                  0
pri                   20
proxregc        3.81e-06
ptime                399
reg                    1
regd4_alt           -0.5
relfrac           0.2718
seceduc                8
second             0.253
semipol3               0
sip2           0.0555556
sxpnew          0.102386
sxpsq          0.0220906
tnatwar                0
trade            47.5247
warhist                0
xconst                 2
low_lang           False
StateAbb             AFG
StateNme     Afghanistan
Name: 44, Length: 95, dtype: object
def transform_text(x):
    """
    A simple function to transform text.
    
    A longer description here about what the function does.
    
    Parameters
    ---------
    x: str
      "StateNme" in the merged data
    
    Returns
    -------
    name_mod: str,
      A transformed version of StateNme.
    """
    name_mod = x.lower()
    name_mod = re.sub("a", "Z", name_mod)
    print(name_mod)
    
# let's check out our cool new docstring!
?transform_text
#df['StateNme'].apply(transform_text)
# using groupby to get the number of civil war onsets
df.groupby("StateNme")["warstds"].sum().head()
StateNme
Afghanistan    1
Albania        0
Algeria        2
Angola         1
Argentina      1
Name: warstds, dtype: int64

scikit-learn

scikit-learn or sklearn is the most common and useful (non-neural network) machine learning library in Python.

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# subset to the variables used in the study
X_vars = ["ager", "agexp", "anoc", "army85", "autch98", "auto4",
        "autonomy", "avgnabo", "centpol3", "coldwar", "decade1", "decade2",
        "decade3", "decade4", "dem", "dem4", "demch98", "dlang", "drel",
        "durable", "ef", "ef2", "ehet", "elfo", "elfo2", "etdo4590",
        "expgdp", "exrec", "fedpol3", "fuelexp", "gdpgrowth", "geo1", "geo2",
        "geo34", "geo57", "geo69", "geo8", "illiteracy", "incumb", "infant",
        "inst", "inst3", "life", "lmtnest", "ln_gdpen", "lpopns", "major", "manuexp", "milper",
        "mirps0", "mirps1", "mirps2", "mirps3", "nat_war", "ncontig",
        "nmgdp", "nmdp4_alt", "numlang", "nwstate", "oil", "p4mchg",
        "parcomp", "parreg", "part", "partfree", "plural", "plurrel",
        "pol4", "pol4m", "pol4sq", "polch98", "polcomp", "popdense",
        "presi", "pri", "proxregc", "ptime", "reg", "regd4_alt", "relfrac", "seceduc",
        "second", "semipol3", "sip2", "sxpnew", "sxpsq", "tnatwar", "trade",
        "warhist", "xconst"]

X = df[X_vars]
y = df['warstds']
np.mean(y)
0.014584578601315004
# doing a train-test split by hand
np.random.seed(73071)

msk = np.random.rand(len(X)) < 0.8
X_train = X[msk]
y_train = y[msk]

X_test = X[~msk]
y_test = y[~msk]
# doing a train-test split the sklearn way
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                        test_size=0.2, 
                                        random_state=42,
                                        shuffle=True # the default
                                                   )
model = LogisticRegression() 
model
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
model.fit(X_train, y_train)
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
y_pred = model.predict(X_test)
# raw accuracy
np.mean(y_pred == y_test)
0.9844590555887627
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  
[[1647    0]
 [  26    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1647
           1       0.00      0.00      0.00        26

    accuracy                           0.98      1673
   macro avg       0.49      0.50      0.50      1673
weighted avg       0.97      0.98      0.98      1673



/Users/ahalterman/anaconda3/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
model = LogisticRegression(class_weight='balanced', max_iter=1000) 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(np.mean(y_pred == y_test))
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  
0.7071129707112971
[[1166  481]
 [   9   17]]
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      1647
           1       0.03      0.65      0.06        26

    accuracy                           0.71      1673
   macro avg       0.51      0.68      0.45      1673
weighted avg       0.98      0.71      0.81      1673
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42) 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(np.mean(y_pred == y_test))
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  
0.9856545128511656
[[1646    1]
 [  23    3]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1647
           1       0.75      0.12      0.20        26

    accuracy                           0.99      1673
   macro avg       0.87      0.56      0.60      1673
weighted avg       0.98      0.99      0.98      1673
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

parameters = {'n_estimators':(10, 50, 100),
             'max_features': [5, 10, "auto"], 
             'max_depth': [5, 10, None]}
rf = RandomForestClassifier(class_weight="balanced")
clf = GridSearchCV(rf, 
                   parameters, 
                   cv = 5,
                   scoring=make_scorer(f1_score, average='macro'))
clf.fit(X_train, y_train)
GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight='balanced',
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [5, 10, None],
                         'max_features': [5, 10, 'auto'],
                         'n_estimators': (10, 50, 100)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=make_scorer(f1_score, average=macro), verbose=0)
cv_model = clf.best_estimator_
cv_model.random_state = 42
cv_model.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
y_pred = model.predict(X_test)
print(np.mean(y_pred == y_test))
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  
0.9856545128511656
[[1646    1]
 [  23    3]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1647
           1       0.75      0.12      0.20        26

    accuracy                           0.99      1673
   macro avg       0.87      0.56      0.60      1673
weighted avg       0.98      0.99      0.98      1673

Numpy

Let’s implement our favorite matrix operation in numpy:

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$

# pandas stores data as a numpy array internally. 
# Extract that array from our dataframe:
X_arr = X.values
X_arr.shape
#type(X_arr)
(8365, 90)
# (X'X)^{-1} part
first = np.linalg.inv(np.matmul(X_arr.transpose(), X_arr))
first.shape
(90, 90)
# (X'X)^{-1} X'
second = np.matmul(first, X_arr.transpose())
second.shape
(90, 8365)
# (X'X)^{-1} X'y
y_arr = y.values
beta_hat = np.matmul(second, y_arr)
beta_hat
array([ 9.15557359e-04, -1.98683161e-02,  4.97339059e-01, -1.13954863e-08,
        3.49932939e-01, -5.26447221e+01,  3.48090394e+00, -1.92783887e-01,
       -1.14251018e-01,  7.40985952e-01, -1.50706556e-02, -7.01910439e-01,
        3.90333808e-02,  5.18722318e-01,  4.10661105e-01,  5.23789156e+01,
        4.99131771e-01,  2.46628694e-03,  1.59457077e-03,  1.11187963e-03,
        7.47700665e-01, -6.89842833e-01, -2.01735066e-03,  9.82908740e-04,
       -2.61683065e-05, -1.23909341e-01,  1.57504735e-04,  9.55090092e-02,
       -4.75722092e-02,  8.91325454e-04,  4.24612222e-01, -9.62906927e+02,
       -9.62879433e+02, -9.62839911e+02, -9.62868615e+02, -9.62867136e+02,
       -9.62893711e+02,  1.08267776e-04,  2.17711652e-02,  1.93703260e-05,
       -2.18254131e-03,  3.50089066e-02, -6.15902755e-04,  2.92768141e-03,
       -1.68783809e-03, -1.94270516e-03, -6.70619664e-03, -2.16807462e-04,
       -3.32073604e-06,  9.63071366e+02,  9.63048471e+02,  9.63053055e+02,
        9.63061466e+02,  1.95730736e-02,  1.00675911e-02, -5.71985512e-07,
        3.97404334e-04, -4.31343678e-04,  6.64880298e-02, -2.31396221e-03,
       -4.32906358e-04,  5.24082468e-03, -8.09099212e-03,  3.72918194e-03,
        2.01363791e-03, -8.06686537e-02,  3.07452918e-04, -5.25755735e+01,
        2.85048249e-02, -2.70691235e-04,  4.42181992e-03, -6.44334071e-04,
        2.15744877e-05, -1.97725042e-02,  8.65914094e-06, -1.75195003e-02,
        7.89896542e-06, -9.23597488e-03, -7.83440984e-04,  4.62118722e-02,
        1.10501630e-04, -8.89377306e-02, -4.77380819e-02, -4.39172741e-02,
       -1.11120434e-01,  5.05154923e-02, -8.02167160e-03, -6.00367357e-05,
       -5.86679825e-03, -3.81315670e-03])
# The resulting array has several useful methods
beta_hat.max()
963.0713664382838
beta_hat.min()
-962.9069274995783
beta_hat.argmax()
49
# which variable is the max?
X_vars[beta_hat.argmax()]
# the codebook tells us this variable is a "semi-democracy" dummy
'mirps0'
X_vars[beta_hat.argmin()]
# this is the "Region: Western Europe and the US" variable
'geo1'
from numba import jit
import math

def haversine(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371 # km

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d
%timeit haversine((36.36,35.34), (37.33, 36.44))
2.58 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#haversine((36.36,35.34), [(37.33, 36.44), (38.33, 36.44)])
@jit()
def haversine_jit(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371 # km

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d
haversine_jit((36.96,35.34), (57.33, -36.44))
5574.624467973912
%timeit haversine_jit((36.36,35.34), (37.33, 36.44))
531 ns ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Code profiling

%load_ext line_profiler
%lprun -f haversine haversine((36.36,35.34), (37.33, 36.44))
Previous