Matplotlib data and code

By Martin McBride, 2022-06-17

Tags: data code numeric python matplotlib
Categories: matplotlib numpy

In this article, we will discuss the data we will be using in this series, and the method we will use to read the data into a Python list.

All the data and code is available on here on github.

Data sets

Matplotlib can be used with many types of data, but for these articles, we will be using UK temperature and rainfall data for the years 2009 and 2010. The data sets are derived from public sector information licensed under the Open Government Licence v3.0. The data has been organised to make it easy to use in the examples.

The data is stored in comma-separated value (CSV) format. This is a text format containing lines of numerical values separated by commas.

The years 2009 and 2010 are not leap years. This is a deliberate choice to simplify the data handling, so we can concentrate on the graph plotting code.

Temperature data

The temperature data is based on the maximum temperature each day, in degrees Centigrade. The following files are used:

File	Type
2009-temp-daily.csv	Daily temperatures (1)
2009-temp-monthly.csv	Monthly average temperatures (2)
2009-temp-monthly-list.csv	Daily temperatures, one line per month (3)
2009-temp-daily.csv	Daily temperatures (1)
2009-temp-monthly.csv	Monthly average temperatures (2)

Type (1) files contain 365 entries, indicating the maximum temperature of each day of the year like this:

1.2
3.8
2.4
etc...

Type (2) files contain 12 entries, indicating the average of the maximum daily temperature of each month of the year, like this:

5.980645161290322
6.732142857142856
11.022580645161291
etc...

Type (3) files contain 365 entries, indicating the maximum temperature of each day of the year, similar to type (1). But all the data for a given month is contained on a single line. So there are 12 lines, each with a month's worth of daily data, like this:

1.2, 3.8, 2.4, 1.7, ...
2.7, 0.6, 2.6, ...
10.1, 10.3, 9.3, ...
etc...

The first line has 31 entries (for January), the second line has 28 entries (for February), and so on. This data is only provided for 2009 because we only use it for box plots.

Rainfall data

The rainfall data is based on the total rainfall each day, in millimetres. The following files are used:

File	Type
2009-rain-daily.csv	Daily rainfall (1)
2009-rain-monthly.csv	Monthly average rainfall (2)
2009-rain-daily.csv	Daily rainfall (1)
2009-rain-monthly.csv	Monthly average rainfall (2)

Type (1) files contain 365 entries, indicating the total rainfall of each day of the year, like this:

1.2
3.8
2.4
etc...

Type (2) files contain 12 entries, indicating the total rainfall of each month of the year, like this:

5.980645161290322
6.732142857142856
11.022580645161291
etc...

Reading the data in Python

Here is the code to read the daily temperature data from a CSV file:

import csv

with open("2009-temp-daily.csv") as csv_file:
    csv_reader = csv.reader(csv_file, quoting=csv.QUOTE_NONNUMERIC)
    temperature = [x[0] for x in csv_reader]

First, we must import the csv module that we will use to parse the CSV file. The csv module is a Python built-in module, so we don't need to install anything extra to use it.

Next, we open the 2009 daily temperature CSV file. We open it using a with statement. This means that the file will be closed automatically when we have finished with it. We name the opened file object csv_file.

We then create a CSV reader object based on the CSV file. The reader function takes an optional parameter called quoting. We set this parameter to QUOTE_NONNUMERIC. This may be slightly counter-intuitive, but it tells the CSV reader to convert all values to numbers unless they are in quote marks.

Since our data file contains unquoted data, the reader will convert all the values to numbers. This means we will get a list of numbers rather than a list of strings, which is exactly what we want.

The reader returns a list for each line in the CSV file, in other words, it returns a list of lists. So if our file contained the following data:

1.2
3.8
2.4
1.7

The reader would return a list of lists like this:

[ [1.2], [3.8], [2.4], [1.7] ]

What we need is a normal list like this:

[ 1.2, 3.8, 2.4, 1.7 ]

We create the required list using a list comprehension:

temperature = [x[0] for x in csv_reader]

For each sublist in the original data, the list comprehension reads the first element and adds it to the output list. If you are not familiar with list comprehensions, there is a quick description next.

List comprehensions

The code above will read data from a CSV file into a Python list. The way the code works isn't particularly important, because we are mainly here to learn about plotting graphs with Matplotlib.

But if you haven't used list comprehensions before, and would like to understand how they work, here is a short description.

A list comprehension creates a new list based on the contents of an existing sequence. The original sequence can be a list, tuple, string, range function, iterable, or any other object that provides a sequence of values.

So for example, suppose we had a list a:

a = [1, 2, 3, 4]

And we wished to create a new list a where each element is double the equivalent element in a:

b = [2, 4, 6, 8]

We could do this with a loop:

b = []
for x in a:
    b.append(x*2)

A list comprehension is just a shorter way to do the same thing:

b = [2*x for x in a]

It takes the form:

[expression for x in sequence]

And creates a new list by evaluating expression for every value of x in sequence.

In our specific case:

[x[0] for x in csv_reader]

csv_reader is a sequence of lists:

[ [1.2], [3.8], [2.4], [1.7] ]

The values of x will be:

[1.2]
[3.8]
[2.4]
[1.7]

The values of x[0] will be:

1.2
3.8
2.4
1.7

Which will create a final list of:

[ 1.2, 3.8, 2.4, 1.7 ]