Crash course in Python for R users
This is Part 1 of a two part series. Stay tuned for Part 2, which will cover numpy, pandas and scikit-learn.
R is an extremely powerful language for data analysis, and probably the best language for working with tabular data, running regressions, and making visualizations. However, most of the cutting edge work in machine learning, neural networks, and natural language processing is being done in Python. Python is also a great language for web scraping, and has a lot of great tools for working with text data.
This tutorial is a quick introduction to Python for R users. It is not meant to be a comprehensive introduction to Python, but rather a quick overview of the differences between R and Python and a quick getting started guide for R users who want to get started with Python.
Note on the code: code that you can run begins with >>>
, and output does not. When you paste code into a Python interpreter, you don’t need to include the >>>
.
Python is a calculator, just like R
>>> 2 * 402
804
>>> (139 / 5) * 2
55.6
To assign objects, we get to use =
instead of R’s <-
>>> gdp = 592993954831
>>> pop = 100000000
>>> gdp_per_capita = gdp / pop
>>> print(gdp_per_capita)
5929.93954831
Python has similar basic data types to R, but they have different names:
int
instead ofinteger
float
instead ofnumeric
str
instead ofcharacter
bool
instead oflogical
(any Python uses True and False, not R’s TRUE and FALSE)
>>> type("Hello World")
str
>>> type(42)
int
>>> type(45.1)
float
>>> type(False)
bool
Data Structures
Lists
Like R’s vectors, Python uses a lot of lists. These are ordered arrays.
Lists are created with square brackets, and can contain any type of data (including other lists!).
Note that Python starts with 0!
>>> my_list = ["a", "b", "c"]
>>> type(my_list)
list
>>> my_list[0]
'a'
Exercise
- We can use the
len()
function to get the length of a list. How long ismy_list
? - What happens when we run
my_list[len(my_list)]
? Why?
Dictionaries
One of Python’s most useful data structures is a dictionary. A dictionary has a key-value structure, where you access elements of a dictionary by name, rather than by position (think a more general form of R’s dataframes). Each dictionary has keys, and associated with each key is a value. The values can be any kind of data structure, including simple ints, strs, and floats, but also lists, other dictionaries, dataframes, model objects, etc.
Dictionaries are defined using curly braces {}
. Each key-value pair is separated by a comma, and the key and value are separated by a colon.
Example:
>>> article = {"title": "Rivalry and Revenge",
"author" : "Balcells",
"year" : "2017"}
>>> article['author']
'Balcells'
Exercise:
Create a dictionary called andy_facts
with the following keys and values:
- “name” : “Andy”
- “age” : 34
- “cats” : [“Archie”, “Ellie”]
Then, access the value associated with the key “cats”. What data type is it?
What are all those dots for? (Or, methods, attributes, and namespaces)
When you read Python code, you’ll see a lot of dots. For example, my_list.append(5)
or my_string.upper()
. What’s going on here?
Dots have special meaning in Python. It’s not like R, where people put dots in all sorts of names. Objects can have built in or attached functions, called methods. These methods are called with a dot notation. (These also come up in Python packages, which we’ll talk about later).
Compare:
[R] strsplit("Andy Halterman", " ")
and
[Python] "Andy Halterman".split(" ")
Exercise
Take a look at these built in methods for strings:
>>> print("andy".title())
>>> print("Andy".lower())
Can you figure out how to make a string all upper case?
Loops
As R programmers, we’re often told to avoid using loops because they are slow. In Python, loops are relatively fast, and are often the best way to do things.
Python has two main types of loops: for
loops and while
loops.
for
loops are used when you want to iterate over a list, or some other iterable object.while
loops are used when you want to keep doing something until some condition is met.
In practice, for
loops are much more common than while
loops.
For loops
For loops have the following structure:
for {variable} in {list}:
{do something}
- The {list} can be any iterable object, including a list, dataframe, string, etc.
- The {variable} can be any variable name you want. By convention, it is usually the letter
i
, but as your loops become more complicated, it can be helpful to use more descriptive variable names. - The {do something} can be any code you want. Note the colon and indentation! Python uses indentation to determine what is inside the loop and what is outside the loop. This is in contrast to R, which uses curly braces
{}
to determine what is inside and outside of a loop.
>>> numbers = [1, 2, 3, 4, 5]
>>> for number in numbers:
print(number)
1
2
3
4
5
Compare this to how it would look in R:
numbers <- list(1, 2, 3, 4, 5)
for (i in 1:length(numbers)) {
number = numbers[[i]]
print(number)
}
In Python, we get the value of the list element directly, rather than using the index to get the value.
Here’s another example, this time using i
instead of the more descriptive number
>>> for i in numbers:
print(i * 10)
10
20
30
40
50
Appending to lists
One common thing we want to do in a loop is append to a list. In R, we can do this using the c()
function. In Python, we can use the append()
method. For example, if we want to create a list of the squares of the numbers 1 through 10, we could do the following:
>>> numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> squared_numbers = []
>>> for i in numbers:
sq = i ** 2
squared_numbers.append(sq)
>>> squared_numbers
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
In the code above, we created a list of numbers to iterate over. Then, we created an empty list called squared_numbers
. When we iterated numbers, we squared each one and appended it to the squares
list.
append vs extend
Note that we used the append()
method to add to the list. There is also an extend()
method. What’s the difference? append()
adds a single element to the list. extend()
adds all the elements of a list to the list.
For example:
>>> my_list = [1, 2, 3]
>>> my_list.append([4, 5, 6])
>>> print(my_list)
[1, 2, 3, [4, 5, 6]]
>>> my_list = [1, 2, 3]
>>> my_list.extend([4, 5, 6])
>>> print(my_list)
[1, 2, 3, 4, 5, 6]
Notice that the first block of code adds the second list as a single element. The length of the resulting list is 4, with the fourth element being the list [4, 5, 6]
. The second block of code adds the elements of the second list to the first list. The length of the resulting list is 6, with the last three elements being 4, 5, and 6.
List comprehensions
Instead of writing out a for loop across multiple lines, we can also use a list comprehension. List comprehensions are a very common way to write for loops in Python and you’ll often see them in answers on Stack Overflow. They are more compact, but can be harder to read when you’re first starting out.
Recall our example above, were we multiply each number in a list by 10:
>>> for i in numbers:
print(i * 10)
10
20
30
40
50
60
70
80
90
100
Here’s the same loop as above, but using a list comprehension:
>>> [print(i * 10) for i in numbers]
10
20
30
40
50
60
70
80
90
100
[None, None, None, None, None, None, None, None, None, None]
List comprehensions return a list, so if you want to do something else with the values, you’ll need to assign the list comprehension to a variable. For example, instead of printing out the values as we did above, we can save them to a variable:
>>> times_ten = [i * 10 for i in numbers]
>>> print(times_ten)
[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
You read a list comprehension backwards: The second part of the list comprehension is the for loop, and the first part is what you want to do with each element of the list.
Exercise
-
Write a for-loop that iterates over the string “Hello, world!". How does Python treat iterating over a string?
-
Create a list of lists that looks like the following:
[[1, 12, 3, 19], [4, 53, 6], [8, 9]]
Print the length of each list in the list of lists.
-
Now do the same thing, but using a list comprehension.
Functions
Functions in Python are defined using the def
keyword. The function name is followed by the arguments, and then a colon. The function body is indented, just like a for loop. R will sometimes helpfully return the last variable in a function, but in Python you need to explicitly use the return
keyword.
For example, we can define a function that adds 2 to a number, then squares it (note that Python uses **
for exponentiation, not ^
like R):
>>> def square_number(x):
z = (x)**2
return z
square_number(4)
16
With functions, we can also specify default values for arguments. For example, we can define a function that takes two arguments, but if the second argument is not specified, it defaults to 2:
>>> def exponent_number(x, exp=2):
z = (x)**exp
return z
>>> exponent_number(4, exp=3)
64
# exp defaults to 2 if not specified
>>> exponent_number(4)
16
Exercise
-
Write a function that takes a list of numbers and returns the sum of the squares of the numbers.
-
Now iterate over the list of lists below, apply your function, and save the results to a list:
[[1, 12, 3, 19], [4, 53, 6], [8, 9]]
Writing docstrings
When you write a function in Python, it’s often very helpful to write a docstring–short documentation on what the function does, what arguments it takes, and what it returns. See the example below for how to write one, and what it looks like when you look it up.
>>> def transform_text(x):
"""
A simple function to transform text.
A longer description here about what the function does.
Parameters
---------
x: str
"StateNme" in the merged data
Returns
-------
name_mod: str,
A transformed version of StateNme.
"""
name_mod = x.lower()
name_mod = re.sub("a", "Z", name_mod)
print(name_mod)
# let's check out our cool new docstring!
?transform_text
Nesting and whitespace
As you can tell, Python makes heavy use of whitespace to set apart different levels of functions, for loops, etc. The default indentation is 4 spaces (which I recommend), but you can also use tabs. The important thing is to be consistent! If you mix tabs and spaces, you’ll get an error.
If you need to create a nested for loop, or a for loop in a function, you need to indent the code inside the for loop or function.
def my_function(big_list):
# everything inside the function is indented to this level
print(len(big_list))
for ll in big_list:
# everything inside the for loop is indented to this level
for i in ll:
# everything inside the nested for loop is indented to this level
...
# back to the top level of the function
return stuff
Installing and importing packages
- In Python, libraries are installed from the command line, NOT from inside Python itself.
- From the terminal, run
pip install mypackage
. - If you’re using Anaconda, you can also use
conda install mypackage
. - If you’re using Jupyter notebooks, you can also run
!pip install mypackage
from inside a notebook.
Importing libraries in Python is similar to importing packages in R: instead of [R] library(mypackage)
, do [Python] import mypackage
.
For example, we can import the re
library, which is Python’s library for regular expressions:
>>> import re
Python also lets you import specific functions from a library: from mypackage import cool_function
For example, we can import the Counter
function from the collections
library:
>>> from collections import Counter
You can also rename libraries if they’re too long. For example: import numpy as np
.
You’ll see this a lot in code you read with the numpy
and pandas
packages.
>>> import pandas as pd
>>> import numpy as np
Python is much more careful about keeping libraries’ functions attached to the functions. For example, if we want to use the string substitution function sub
from the re
library, we call it as re.sub()
.
>>> my_string = "Hello, world!"
>>> re.sub("world", "friend", my_string)
'Hello, friend!'
By keeping functions attached to the library, we can avoid conflicts between library. For example, the numpy
library has a function called sum
, which is also a built-in function in Python. By calling numpy.sum()
, we can avoid conflicts between the two functions. This also makes it easier to read code, since you can tell where a function came from.
Exercise
Iterate over the list of numbers below and apply the log
function from the numpy
library to each number. Save the results to a list.
[1, 2, 3, 4, 5]
Extra fun stuff: progress bars
tqdm is one of my favorite libraries in Python. It makes very nice progress bars without any real effort. This can be really helpful for long running loops, or when you’re running a script and want to know how long it will take to finish.
>>> from tqdm.autonotebook import tqdm
>>> from time import sleep
>>> _ = [sleep(0.01) for i in tqdm(range(0, 500))]
0%| | 0/500 [00:00<?, ?it/s]100%|██████████| 500/500 [00:06<00:00, 80.04it/s]