Dive into Python—Part 2

Lesson Notes

Table of contents

Tuesday, June 15

11:15 am–1:00 pm
Connection Details: Lesson Room Link

Our training JupyterHub

For this session and for all sessions tomorrow, you are welcome to use our training JupyterHub on which all Python packages are installed and where the data is already downloaded. It will make getting setup a lot easier and faster.

To access the JupyterHub, you will need a username and password that I will give you in this course and you can find instructions in our help page .

The username and password will remain valid for the entire week, so write them down somewhere and keep them at hand.

Writing functions

So far, you have used functions that were built into Python (e.g. print).
In addition to those, countless packages allow to load additional functions into your session.

But you can also build your own functions to do just exactly what you want.

General syntax

Here is how you can define a new function in Python:

def function_name(arguments):
    Body of the function

Our first functions

Let's give this a try by creating some greeting functions.

Function without argument

Let's start with the simple case in which our function does not accept any argument:

# We define our first function
def hello():
    print('Hello')

# Then we can call it
hello()

This was great, but…

# ... it does NOT accept arguments, so this does not work
hello('Marie')

Function which accept an argument

Let's step this up with a function which can accept an argument:

# We define a second function
def greetings(name):
    print('Hello ' + name)

# Now, this time, this also works
greetings('Marie')

However, now this does not work anymore:

greetings()

Function with a facultative argument

Let's make this even more fancy: let's create a function with a facultative argument. That is, a function which accepts an argument, but also has a default value for when we do not provide any argument:

# Our third function
def howdy(name='you'):
    print('Hello ' + name)

# We can call it without argument (making use of the default value)
howdy()

# And we can call it with an argument
howdy('Marie')

This was fun, but…

# ...this does not work
howdy('Marie', 'Paul')

Function with two arguments

We could create a function which takes two arguments:

# Our two-argument function
def hey(name1, name2):
    print('Hello ' + name1 + ', ' + name2)

# Which solved our problem
hey('Marie', 'Paul')

But it is terribly limiting:

# This doesn't work
hey()

# And neither does this
hey('Marie')

# Nor to mention this...
hey('Marie', 'Paul', 'Alex')

Function which allow to pass any number of arguments

Let's create a truly great function which handles all of our cases:

# Now, this one is the real deal
def hi(name='you', *args):
    result = ''
    for i in args:
        result += (', ' + i)
    print('Hello ' + name + result)

# Everything works!
hi()
hi('Marie')
hi('Marie', 'Paul')
hi('Marie', 'Paul', 'Alex')

Introduction to Pandas

pandas is a Python library built to manipulate data frames and time series.

For this section, we will use the Covid-19 data from the Johns Hopkins University CSSE repository. I last accessed this database on Saturday, so the last day of data we will use is June 12, 2021.

You can visualize this data in a dashboard created by the Johns Hopkins University Center for Systems Science and Engineering.

I already uploaded the two files we will use onto our JupyterHub. They are at: /project/def-sponsor00/dhsi/covid_confirmed.csv and /project/def-sponsor00/dhsi/covid_dead.csv.

For those who want to work on their own machine, those files are also in our GoogleDrive . You can download them from there, but you will have to make sure to change the code to match the paths of the files on your machine when you read them into Python.

For reference, here is a User Guide to pandas and here is the full documentation.

Setup

First, we need to load the pandas library and read in the data:

# Load the pandas library and create a shorter name for it
import pandas as pd

# Read in data from a csv file
cases = pd.read_csv('/project/def-sponsor00/dhsi/covid_confirmed.csv')

First look at the data

What does our data look like?

# Print the data
cases   # When not using Jupyter, you must use: print(cases)

# Look at the first rows
cases.head()
cases.head(25)

# And the last rows
cases.tail()

# Names of rows and columns
cases.index
cases.columns

# Quick summary of the data
cases.describe()

# Data types of the various columns
cases.dtypes

Number of cases per country by decreasing order for any date

Let's see what the numbers were for each country on June 12, 2021. To make it easier to read, let's also order those numbers by decreasing order:

# Let's get rid of the latitude and longitude to simplify our data
simple = cases.drop(columns=['Lat', 'Long'])
simple

# Some countries are split between several provinces or states (e.g. Australia)
# Let's select only the data for Australia to show this
# → We want all the rows for which the Country/Region column is equal to Australia

# First, we want to select the Country/Region column
# There are several ways to index in pandas. This is one option
simple['Country/Region']

# This is another way to index a column in pandas
simple.loc[:, 'Country/Region']

# Then we need a conditional (that variable is equal to Australia)
simple.loc[:, 'Country/Region'] == 'Australia'

# Finally, we index, out of our entire data frame,
# the rows for which that condition returns true
simple[simple.loc[:, 'Country/Region'] == 'Australia']

# Let's make the sum per countries for all the dates
# We need to group the rows by countries, then make the sum
totals = simple.groupby('Country/Region').sum()
totals

# Now, we can look at the totals for any date
totals['6/12/21']

# And with an alternative indexing method
totals.loc[:, '6/12/21']

# And we can order them by decreasing values
totals.loc[:, '6/12/21'].sort_values(ascending=False)

Number of cases for one country at all the dates

Now, let's focus on a single country and extract the time series of confirmed cases for that country.

# Our variable totals can be used directly for this too
# Here, we index a row instead of a column
totals.loc['Albania']

Global totals

Now, what if we want to have the world totals for each day?

# Calculate the columns totals
totals.sum()

Your turn

Try to work together to answer the following question.
Help each other out and make it a team effort!

How many people had died from Covid-19 in Venezuela by March 10, 2021?
(according to the available data)

Solution

# First, you need to load the proper data file
dead = pd.read_csv('/project/def-sponsor00/dhsi/covid_dead.csv')
dead

# Then, you need to select the data for Venezuela
# You can index in one of two methods
venez = dead[dead['Country/Region'] == 'Venezuela']
venez = dead[dead.loc[:, 'Country/Region'] == 'Venezuela']
venez

# If you knew the index for Venezuela,
# you could also index this way
dead.loc[270]

# Finally, you need to select for the proper date
answer = venez['3/10/21']
answer = venez.loc[:, '3/10/21']
answer

The answer is: 1399.