Often when you're working with data, you have it stored in files. These may be text files, images, audio, code! Just about anything you can imagine!
As a case in point, suppose that a local dentist has a small database of clients stored in a text file. For simplicity, suppose they only keep each clients' first and last name along with their age and phone number. A typical entry looks something like:
Manny Hernandez 31 217-321-1234
Suppose that the dentist has quite a few clients and wants to get a sense of what the age distribution of his clients is in order to determine if it's worth making special offers to say youth or elderly clients.
How could we start to do something like this?
First, we'd need to read the input and represent it in a useful way. Suppose that we've called the data file 'clients.txt'. Let's go ahead and open it and read the data into Python.
f = open('clients.txt')
data = f.read()
f.close()
data
This is an ok start, but notice that all the entries are read as one big string. There are a couple options here. One way to handle this is by splitting the string as follows:
data.split('\n')
A different way which bypasses this intermediate step is to use readlines instead of read.
lines = open('clients.txt').readlines() # now using short version to quickly get contents out!
lines
In either case, now we have our data represented as a list of individual entries ready for further processing. We'll use the split command to do this. First, let's start by looking at a simple case.
s = "this is a string"
s.split()
s = "this is a multiline string\nthis is the second line"
s.split()
Split also allows you to specify which character I break the string at. A moment ago, I used the following to break at newline characters.
s.split('\n')
Read the lines from 'clients.txt', then convert each line into a tuple of ('first', 'last', age, 'phone number') fields. In particular, make sure that the age field is converted to an integer type, not a string!
For example, the line
'Stella Do 39 217-325-1432'should become
('Stella', 'Do', 39, '217-325-1432')
Suppose we want to sort our clients by age. How would we do this?
By default Python's list sort works on tuples entry-by-entry. Hence, running a sort would sort by first name -> last name -> age.
One possible solution hinted at much earlier is to reorder the tuples and then sort. That's fine - but there's a better way!
Many commonly used functions support optional arguments which make them much more flexible. Let's take a look at how to make sorted sort by age.
entries = [
(1, 5),
(2, 4),
(6, 9),
(3, 7),
]
sorted(entries)
def first(entry):
return entry[0]
def second(entry):
return entry[1]
print sorted(entries, key=first)
print sorted(entries, key=second)
We can do this even more compactly and not even give a name to our "chooser" function by using Python's lambda notation.
print sorted(entries, key=lambda (f, s): f)
print sorted(entries, key=lambda (f, s): s)
Now that we've done a little processing, let's save our data. It turns out that writing data to a file isn't much harder than reading data. For example:
f = open('somewords.txt', 'w')
f.write('hello\n')
f.write('meow\n')
f.write('surfer\n')
f.close()
Now, we just need to do this with our sorted table and we'll be done. The only problem is, we need to format our data into a string to write it. How would we do this? On way would be:
entry = ('John', 'Putnam', 63, '217-321-1542')
print str(entry)
But that's not quite right... We want to format this so that it's flat. To do this, we'll use Python's format function.
print '{0} {1} {2} {3}\n'.format(entry[0], entry[1], entry[2], entry[3])
print '{} {} {} {}\n'.format(entry[0], entry[1], entry[2], entry[3])
Even more compactly, we can ask Python to "fill in" the arguments from a tuple using the star-prefix notation.
print '{} {} {} {}\n'.format(*entry)
Although you can always work around this, the star notation is good to be aware of. It works anytime you want expand a tuple into arguments of a function!
We won't go in-depth with string formatting here, but the format function has a rich array of ways it can format strings. Take a look the range of examples if you're interested!
Complete our database processing by writing the table sorted by age to 'people.age.txt'.
Now, we've done some good work in terms of structuring the client data. But, what if we wanted to do some more analysis on it?
For example, we may want a histogram of the clients' ages or a description of the mean and standard deviation of the clients' ages.
Further, we don't just want a list of numbers but some kind of a visual representation of the dataset.
We'll start to address these problems using some basic functionality from two very powerful Python modules: Scipy and Matplotlib.
Let's start out with visualizing our data, since pictures are fun to make.
I'll go ahead and process the client data again, but this time only extracting their ages.
def parse(line):
first, last, age, phone = line.split()
return int(age)
ages = [parse(line) for line in open('clients.txt')]
Now, let's see how to create a basic histogram - nothing fancy.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.hist(ages)
plt.show()
Now, since we created this plot, we understand what we're looking at, but to someone who comes across this randomly, it's not very useful. Let's see if we can add some context to it.
plt.figure()
plt.title('Age Distribution of Clients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.show()
If we're happy with the plot, we can save it for later use using the savefig command instead of show.
plt.figure()
plt.title('Age Distribution of Clients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.savefig('age-histogram.png')
Now we've gotten some visual feedback of what our distribution looks like. It looks like it has a mean around 35 and a large variance. To quantify this, let's use Scipy's describe function.
from scipy.stats import describe
info = describe(ages)
info
Let's try to add this newfound information to our plot!
mu = round(info.mean, 2)
sigma = round(info.variance**.5, 2)
plt.figure()
plt.title('Age Distribution of Clients ($\\mu={}$, $\\sigma={}$)'.format(mu, sigma))
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.savefig('age-histogram.png')
# The \\ is needed in the title string as \ performs special "escape" commands. For example,
# we've seen that \n produces a newline. \\ just leaves us with a good-ole \
Matplotlib is quite flexible in how a plot can be adjusted and tweaked! We could probably spend days discussing these things, but for now, we'll just cover the basics.
A couple question came up during the day. The first was regarding how to format strings to be a certain length. The second was about clarification of the lambda statement.
In order to format strings to be a certain length or have a certain justification, we can add to the replacement marker we saw in format strings:
'{:<12} {:>14} {:^16}'.format('left', 'right', 'center')
The above formats the strings to the left, right and center padded with 12, 14 and 16 spaces respectively. You can find out more about these things in the example gallery.
The original intent was to just use lambda as a short utility for sorting, but due to some questions, I thought I'd expand on it here. The idea behind lambda is to provide a way to construct "unnamed" functions. For example, the following are equivalent:
def square(x):
return x**2
square = lambda x: x**2
What's crucial is that lambda doesn't need a name to work. Perhaps the key example to see this is:
(lambda x: x**2)(4)
This is how we used it as a key to the sorted function. One last example which demonstrates this is:
def bind(f, x):
return lambda y: f(x, y)
def f(x, y):
return x + 2*y
g = bind(f, 2)
g(3)
The function bind takes a function f of two arguments and returns a new function with the first slot bound to x. In this example, $g(y) = f(x, y)$.