Processing Files and Strings

Often when you're working with data, you have it stored in files. These may be text files, images, audio, code! Just about anything you can imagine!

As a case in point, suppose that a local dentist has a small database of clients stored in a text file. For simplicity, suppose they only keep each clients' first and last name along with their age and phone number. A typical entry looks something like:

Manny Hernandez 31 217-321-1234

Suppose that the dentist has quite a few clients and wants to get a sense of what the age distribution of his clients is in order to determine if it's worth making special offers to say youth or elderly clients.

How could we start to do something like this?

First, we'd need to read the input and represent it in a useful way. Suppose that we've called the data file 'clients.txt'. Let's go ahead and open it and read the data into Python.

In [89]:
f = open('clients.txt')
data = f.read()
f.close()
data
Out[89]:
'Michelle       Knauss         21  217-326-2757\nStella         Do             39  217-202-3672\nLoree          Whetstone      56  217-364-3180\nAlecia         Brister        58  217-369-2187\nPaulette       Royce          25  217-277-7354\nAlton          Oneal          20  217-360-7421\nJae            Rosecrans      11  217-229-7251\nStaci          Crabill        55  217-395-1576\nAntonietta     Olguin         43  217-315-7118\nRicky          Tinch          21  217-204-5472\nTom            Ogletree       25  217-354-9250\nDeane          Beegle         60  217-357-5927\nFiliberto      Massey         38  217-200-9626\nSteve          Hadsell        52  217-221-8413\nShawnee        Bibeau         31  217-203-8649\nTianna         Siddall        52  217-268-5767\nEmmanuel       Mefford        10  217-276-1283\nYesenia        Moon           24  217-341-2026\nBryanna        Albano         60  217-393-9458\nRikki          Helle          30  217-262-7414\nYuriko         Norby          20  217-372-8624\nJone           Yelle          26  217-275-5638\nYukiko         Elizondo       13  217-275-1878\nCarman         Shoup          36  217-296-2503\nMaricruz       Furlong        26  217-392-9723\nSusan          Look           27  217-392-6052\nMaura          Lemond         43  217-202-4303\nCarlos         Beauregard     56  217-333-9292\nEric           Digregorio     21  217-238-8827\nShanelle       Pariseau       32  217-256-2110\nBarton         Reams          22  217-227-3143\nMegan          Deck           53  217-318-4692\nMaurice        Minder         49  217-355-5013\nMariann        Hetzel         46  217-272-4041\nLouanne        Ettinger       24  217-245-3958\nFrida          Ceasar         14  217-256-6017\nCherilyn       Forkey         24  217-368-9606\nErwin          Boring         47  217-209-1653\nDanuta         Cahall         21  217-387-9678\nVinita         Karls          40  217-201-6987\nLinn           Overholt       38  217-374-1029\nDaniella       Trumble        50  217-325-4734\nZonia          Womack         48  217-312-9764\nMina           Lovitt         53  217-246-5270\nRaymond        Weisbrod       46  217-338-7509\nTai            Pflum          20  217-231-4415\nChau           Iwamoto        29  217-333-6766\nCherish        Holoman        49  217-340-7786\nAngelena       Wolverton      13  217-207-5937\nPaulita        Fries          33  217-244-3229\nNatasha        Helman         48  217-296-1375\nKimber         Detty          60  217-268-7289\nTiny           Schubert       43  217-354-4352\nAgripina       Clendenin      43  217-230-4714\nLogan          Portman        35  217-233-4231\nCliff          Dively         31  217-361-9700\nClorinda       Landey         20  217-390-9604\nDenae          Islas          38  217-294-9915\nCaitlyn        Shive          35  217-394-6834\nRolf           Monn           11  217-255-1562\nTasha          Goodrum        21  217-232-4939\nMozella        Demo           52  217-254-1651\nLorina         Prather        11  217-367-6849\nTheodore       Bunn           19  217-227-4766\nVirgie         Deangelis      48  217-260-4432\nCarlena        Vela           19  217-230-6289\nKraig          Edgin          48  217-323-8871\nYulanda        Ashlock        22  217-390-4545\nHarriette      Tomlin         39  217-284-6363\nEusebio        Grate          33  217-270-5382\nHannah         Goodell        44  217-233-6955\nJennefer       Loso           42  217-206-8116\nMarlene        Poland         24  217-372-5587\nSonny          Reimer         51  217-350-4634\nCarlita        Hedgecock      47  217-370-7816\nNell           Stallcup       16  217-310-6646\nSang           Kujawski       32  217-211-9846\nKatina         Rennie         54  217-296-2983\nJessie         Yoho           46  217-235-5017\nDina           Brakefield     42  217-292-5933\nDiana          Granger        15  217-313-1718\nChong          Robinett       41  217-228-8321\nHouston        Shibata        46  217-358-6334\nNakita         Heeren         19  217-341-4032\nRonny          Fackler        60  217-379-7852\nLeontine       Zoeller        45  217-202-5340\nHassie         Lush           57  217-312-7202\nSigrid         Grimmer        49  217-227-4144\nRosaria        Hayslip        13  217-271-5984\nLawana         Dion           15  217-382-8019\nRonna          Meneely        17  217-290-2456\nRosario        Iannotti       48  217-329-3522\nTeena          Koen           47  217-387-1715\nSantos         Morais         11  217-376-8328\nClement        Mangione       10  217-217-4424\nLavinia        Vidrine        27  217-301-8274\nBridgette      Gerke          52  217-314-9239\nOralee         Longstreet     17  217-397-5060\nProvidencia    Stonecipher    30  217-284-5594\nGrisel         Hanway         21  217-247-8875'

This is an ok start, but notice that all the entries are read as one big string. There are a couple options here. One way to handle this is by splitting the string as follows:

In [70]:
data.split('\n')
Out[70]:
['Michelle       Knauss         21  217-326-2757',
 'Stella         Do             39  217-202-3672',
 'Loree          Whetstone      56  217-364-3180',
 'Alecia         Brister        58  217-369-2187',
 'Paulette       Royce          25  217-277-7354',
 'Alton          Oneal          20  217-360-7421',
 'Jae            Rosecrans      11  217-229-7251',
 'Staci          Crabill        55  217-395-1576',
 'Antonietta     Olguin         43  217-315-7118',
 'Ricky          Tinch          21  217-204-5472',
 'Tom            Ogletree       25  217-354-9250',
 'Deane          Beegle         60  217-357-5927',
 'Filiberto      Massey         38  217-200-9626',
 'Steve          Hadsell        52  217-221-8413',
 'Shawnee        Bibeau         31  217-203-8649',
 'Tianna         Siddall        52  217-268-5767',
 'Emmanuel       Mefford        10  217-276-1283',
 'Yesenia        Moon           24  217-341-2026',
 'Bryanna        Albano         60  217-393-9458',
 'Rikki          Helle          30  217-262-7414',
 'Yuriko         Norby          20  217-372-8624',
 'Jone           Yelle          26  217-275-5638',
 'Yukiko         Elizondo       13  217-275-1878',
 'Carman         Shoup          36  217-296-2503',
 'Maricruz       Furlong        26  217-392-9723',
 'Susan          Look           27  217-392-6052',
 'Maura          Lemond         43  217-202-4303',
 'Carlos         Beauregard     56  217-333-9292',
 'Eric           Digregorio     21  217-238-8827',
 'Shanelle       Pariseau       32  217-256-2110',
 'Barton         Reams          22  217-227-3143',
 'Megan          Deck           53  217-318-4692',
 'Maurice        Minder         49  217-355-5013',
 'Mariann        Hetzel         46  217-272-4041',
 'Louanne        Ettinger       24  217-245-3958',
 'Frida          Ceasar         14  217-256-6017',
 'Cherilyn       Forkey         24  217-368-9606',
 'Erwin          Boring         47  217-209-1653',
 'Danuta         Cahall         21  217-387-9678',
 'Vinita         Karls          40  217-201-6987',
 'Linn           Overholt       38  217-374-1029',
 'Daniella       Trumble        50  217-325-4734',
 'Zonia          Womack         48  217-312-9764',
 'Mina           Lovitt         53  217-246-5270',
 'Raymond        Weisbrod       46  217-338-7509',
 'Tai            Pflum          20  217-231-4415',
 'Chau           Iwamoto        29  217-333-6766',
 'Cherish        Holoman        49  217-340-7786',
 'Angelena       Wolverton      13  217-207-5937',
 'Paulita        Fries          33  217-244-3229',
 'Natasha        Helman         48  217-296-1375',
 'Kimber         Detty          60  217-268-7289',
 'Tiny           Schubert       43  217-354-4352',
 'Agripina       Clendenin      43  217-230-4714',
 'Logan          Portman        35  217-233-4231',
 'Cliff          Dively         31  217-361-9700',
 'Clorinda       Landey         20  217-390-9604',
 'Denae          Islas          38  217-294-9915',
 'Caitlyn        Shive          35  217-394-6834',
 'Rolf           Monn           11  217-255-1562',
 'Tasha          Goodrum        21  217-232-4939',
 'Mozella        Demo           52  217-254-1651',
 'Lorina         Prather        11  217-367-6849',
 'Theodore       Bunn           19  217-227-4766',
 'Virgie         Deangelis      48  217-260-4432',
 'Carlena        Vela           19  217-230-6289',
 'Kraig          Edgin          48  217-323-8871',
 'Yulanda        Ashlock        22  217-390-4545',
 'Harriette      Tomlin         39  217-284-6363',
 'Eusebio        Grate          33  217-270-5382',
 'Hannah         Goodell        44  217-233-6955',
 'Jennefer       Loso           42  217-206-8116',
 'Marlene        Poland         24  217-372-5587',
 'Sonny          Reimer         51  217-350-4634',
 'Carlita        Hedgecock      47  217-370-7816',
 'Nell           Stallcup       16  217-310-6646',
 'Sang           Kujawski       32  217-211-9846',
 'Katina         Rennie         54  217-296-2983',
 'Jessie         Yoho           46  217-235-5017',
 'Dina           Brakefield     42  217-292-5933',
 'Diana          Granger        15  217-313-1718',
 'Chong          Robinett       41  217-228-8321',
 'Houston        Shibata        46  217-358-6334',
 'Nakita         Heeren         19  217-341-4032',
 'Ronny          Fackler        60  217-379-7852',
 'Leontine       Zoeller        45  217-202-5340',
 'Hassie         Lush           57  217-312-7202',
 'Sigrid         Grimmer        49  217-227-4144',
 'Rosaria        Hayslip        13  217-271-5984',
 'Lawana         Dion           15  217-382-8019',
 'Ronna          Meneely        17  217-290-2456',
 'Rosario        Iannotti       48  217-329-3522',
 'Teena          Koen           47  217-387-1715',
 'Santos         Morais         11  217-376-8328',
 'Clement        Mangione       10  217-217-4424',
 'Lavinia        Vidrine        27  217-301-8274',
 'Bridgette      Gerke          52  217-314-9239',
 'Oralee         Longstreet     17  217-397-5060',
 'Providencia    Stonecipher    30  217-284-5594',
 'Grisel         Hanway         21  217-247-8875']

A different way which bypasses this intermediate step is to use readlines instead of read.

In [71]:
lines = open('clients.txt').readlines()  # now using short version to quickly get contents out!
lines
Out[71]:
['Michelle       Knauss         21  217-326-2757\n',
 'Stella         Do             39  217-202-3672\n',
 'Loree          Whetstone      56  217-364-3180\n',
 'Alecia         Brister        58  217-369-2187\n',
 'Paulette       Royce          25  217-277-7354\n',
 'Alton          Oneal          20  217-360-7421\n',
 'Jae            Rosecrans      11  217-229-7251\n',
 'Staci          Crabill        55  217-395-1576\n',
 'Antonietta     Olguin         43  217-315-7118\n',
 'Ricky          Tinch          21  217-204-5472\n',
 'Tom            Ogletree       25  217-354-9250\n',
 'Deane          Beegle         60  217-357-5927\n',
 'Filiberto      Massey         38  217-200-9626\n',
 'Steve          Hadsell        52  217-221-8413\n',
 'Shawnee        Bibeau         31  217-203-8649\n',
 'Tianna         Siddall        52  217-268-5767\n',
 'Emmanuel       Mefford        10  217-276-1283\n',
 'Yesenia        Moon           24  217-341-2026\n',
 'Bryanna        Albano         60  217-393-9458\n',
 'Rikki          Helle          30  217-262-7414\n',
 'Yuriko         Norby          20  217-372-8624\n',
 'Jone           Yelle          26  217-275-5638\n',
 'Yukiko         Elizondo       13  217-275-1878\n',
 'Carman         Shoup          36  217-296-2503\n',
 'Maricruz       Furlong        26  217-392-9723\n',
 'Susan          Look           27  217-392-6052\n',
 'Maura          Lemond         43  217-202-4303\n',
 'Carlos         Beauregard     56  217-333-9292\n',
 'Eric           Digregorio     21  217-238-8827\n',
 'Shanelle       Pariseau       32  217-256-2110\n',
 'Barton         Reams          22  217-227-3143\n',
 'Megan          Deck           53  217-318-4692\n',
 'Maurice        Minder         49  217-355-5013\n',
 'Mariann        Hetzel         46  217-272-4041\n',
 'Louanne        Ettinger       24  217-245-3958\n',
 'Frida          Ceasar         14  217-256-6017\n',
 'Cherilyn       Forkey         24  217-368-9606\n',
 'Erwin          Boring         47  217-209-1653\n',
 'Danuta         Cahall         21  217-387-9678\n',
 'Vinita         Karls          40  217-201-6987\n',
 'Linn           Overholt       38  217-374-1029\n',
 'Daniella       Trumble        50  217-325-4734\n',
 'Zonia          Womack         48  217-312-9764\n',
 'Mina           Lovitt         53  217-246-5270\n',
 'Raymond        Weisbrod       46  217-338-7509\n',
 'Tai            Pflum          20  217-231-4415\n',
 'Chau           Iwamoto        29  217-333-6766\n',
 'Cherish        Holoman        49  217-340-7786\n',
 'Angelena       Wolverton      13  217-207-5937\n',
 'Paulita        Fries          33  217-244-3229\n',
 'Natasha        Helman         48  217-296-1375\n',
 'Kimber         Detty          60  217-268-7289\n',
 'Tiny           Schubert       43  217-354-4352\n',
 'Agripina       Clendenin      43  217-230-4714\n',
 'Logan          Portman        35  217-233-4231\n',
 'Cliff          Dively         31  217-361-9700\n',
 'Clorinda       Landey         20  217-390-9604\n',
 'Denae          Islas          38  217-294-9915\n',
 'Caitlyn        Shive          35  217-394-6834\n',
 'Rolf           Monn           11  217-255-1562\n',
 'Tasha          Goodrum        21  217-232-4939\n',
 'Mozella        Demo           52  217-254-1651\n',
 'Lorina         Prather        11  217-367-6849\n',
 'Theodore       Bunn           19  217-227-4766\n',
 'Virgie         Deangelis      48  217-260-4432\n',
 'Carlena        Vela           19  217-230-6289\n',
 'Kraig          Edgin          48  217-323-8871\n',
 'Yulanda        Ashlock        22  217-390-4545\n',
 'Harriette      Tomlin         39  217-284-6363\n',
 'Eusebio        Grate          33  217-270-5382\n',
 'Hannah         Goodell        44  217-233-6955\n',
 'Jennefer       Loso           42  217-206-8116\n',
 'Marlene        Poland         24  217-372-5587\n',
 'Sonny          Reimer         51  217-350-4634\n',
 'Carlita        Hedgecock      47  217-370-7816\n',
 'Nell           Stallcup       16  217-310-6646\n',
 'Sang           Kujawski       32  217-211-9846\n',
 'Katina         Rennie         54  217-296-2983\n',
 'Jessie         Yoho           46  217-235-5017\n',
 'Dina           Brakefield     42  217-292-5933\n',
 'Diana          Granger        15  217-313-1718\n',
 'Chong          Robinett       41  217-228-8321\n',
 'Houston        Shibata        46  217-358-6334\n',
 'Nakita         Heeren         19  217-341-4032\n',
 'Ronny          Fackler        60  217-379-7852\n',
 'Leontine       Zoeller        45  217-202-5340\n',
 'Hassie         Lush           57  217-312-7202\n',
 'Sigrid         Grimmer        49  217-227-4144\n',
 'Rosaria        Hayslip        13  217-271-5984\n',
 'Lawana         Dion           15  217-382-8019\n',
 'Ronna          Meneely        17  217-290-2456\n',
 'Rosario        Iannotti       48  217-329-3522\n',
 'Teena          Koen           47  217-387-1715\n',
 'Santos         Morais         11  217-376-8328\n',
 'Clement        Mangione       10  217-217-4424\n',
 'Lavinia        Vidrine        27  217-301-8274\n',
 'Bridgette      Gerke          52  217-314-9239\n',
 'Oralee         Longstreet     17  217-397-5060\n',
 'Providencia    Stonecipher    30  217-284-5594\n',
 'Grisel         Hanway         21  217-247-8875']

In either case, now we have our data represented as a list of individual entries ready for further processing. We'll use the split command to do this. First, let's start by looking at a simple case.

In [72]:
s = "this is a string"
s.split()
Out[72]:
['this', 'is', 'a', 'string']
In [73]:
s = "this is a multiline string\nthis is the second line"
s.split()
Out[73]:
['this',
 'is',
 'a',
 'multiline',
 'string',
 'this',
 'is',
 'the',
 'second',
 'line']

Split also allows you to specify which character I break the string at. A moment ago, I used the following to break at newline characters.

In [74]:
s.split('\n')
Out[74]:
['this is a multiline string', 'this is the second line']

Exercise 1

Read the lines from 'clients.txt', then convert each line into a tuple of ('first', 'last', age, 'phone number') fields. In particular, make sure that the age field is converted to an integer type, not a string!

For example, the line

'Stella Do 39 217-325-1432'
should become
('Stella', 'Do', 39, '217-325-1432')

Aside: Optional Arguments in Common Functions

Suppose we want to sort our clients by age. How would we do this?

By default Python's list sort works on tuples entry-by-entry. Hence, running a sort would sort by first name -> last name -> age.

One possible solution hinted at much earlier is to reorder the tuples and then sort. That's fine - but there's a better way!

Many commonly used functions support optional arguments which make them much more flexible. Let's take a look at how to make sorted sort by age.

In [75]:
entries = [
    (1, 5),
    (2, 4),
    (6, 9),
    (3, 7),
]

sorted(entries)
Out[75]:
[(1, 5), (2, 4), (3, 7), (6, 9)]
In [76]:
def first(entry):
    return entry[0]

def second(entry):
    return entry[1]

print sorted(entries, key=first)
print sorted(entries, key=second)
[(1, 5), (2, 4), (3, 7), (6, 9)]
[(2, 4), (1, 5), (3, 7), (6, 9)]

We can do this even more compactly and not even give a name to our "chooser" function by using Python's lambda notation.

In [90]:
print sorted(entries, key=lambda (f, s): f)
print sorted(entries, key=lambda (f, s): s)
[(1, 5), (2, 4), (3, 7), (6, 9)]
[(2, 4), (1, 5), (3, 7), (6, 9)]

Exercise 2

  1. Write a method which takes the client data and a column number and sorts the entries according to that column.
  2. Extend this to support "named" columns. For example, I still want to be able to sort using column 0, 1, 2 or 3 as before, but I should also be able to sort by "first name", "last name", "age" or "phone number" too.

Formatting and Writing Data

Now that we've done a little processing, let's save our data. It turns out that writing data to a file isn't much harder than reading data. For example:

In [78]:
f = open('somewords.txt', 'w')
f.write('hello\n')
f.write('meow\n')
f.write('surfer\n')
f.close()

Now, we just need to do this with our sorted table and we'll be done. The only problem is, we need to format our data into a string to write it. How would we do this? On way would be:

In [79]:
entry = ('John', 'Putnam', 63, '217-321-1542')
print str(entry)
('John', 'Putnam', 63, '217-321-1542')

But that's not quite right... We want to format this so that it's flat. To do this, we'll use Python's format function.

In [80]:
print '{0} {1} {2} {3}\n'.format(entry[0], entry[1], entry[2], entry[3])
John Putnam 63 217-321-1542

In [81]:
print '{} {} {} {}\n'.format(entry[0], entry[1], entry[2], entry[3])
John Putnam 63 217-321-1542

Even more compactly, we can ask Python to "fill in" the arguments from a tuple using the star-prefix notation.

In [82]:
print '{} {} {} {}\n'.format(*entry)
John Putnam 63 217-321-1542

Although you can always work around this, the star notation is good to be aware of. It works anytime you want expand a tuple into arguments of a function!

We won't go in-depth with string formatting here, but the format function has a rich array of ways it can format strings. Take a look the range of examples if you're interested!

Exercise 3

Complete our database processing by writing the table sorted by age to 'people.age.txt'.

Data Analysis: First Steps Towards the Scipy Stack

Now, we've done some good work in terms of structuring the client data. But, what if we wanted to do some more analysis on it?

For example, we may want a histogram of the clients' ages or a description of the mean and standard deviation of the clients' ages.

Further, we don't just want a list of numbers but some kind of a visual representation of the dataset.

We'll start to address these problems using some basic functionality from two very powerful Python modules: Scipy and Matplotlib.

Let's start out with visualizing our data, since pictures are fun to make.

I'll go ahead and process the client data again, but this time only extracting their ages.

In [83]:
def parse(line):
    first, last, age, phone = line.split()
    return int(age)

ages = [parse(line) for line in open('clients.txt')]

Now, let's see how to create a basic histogram - nothing fancy.

In [84]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
plt.hist(ages)
plt.show()

Now, since we created this plot, we understand what we're looking at, but to someone who comes across this randomly, it's not very useful. Let's see if we can add some context to it.

In [85]:
plt.figure()
plt.title('Age Distribution of Clients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.show()

If we're happy with the plot, we can save it for later use using the savefig command instead of show.

In [86]:
plt.figure()
plt.title('Age Distribution of Clients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.savefig('age-histogram.png')

Now we've gotten some visual feedback of what our distribution looks like. It looks like it has a mean around 35 and a large variance. To quantify this, let's use Scipy's describe function.

In [87]:
from scipy.stats import describe
info = describe(ages)
info
Out[87]:
DescribeResult(nobs=100, minmax=(10, 60), mean=34.640000000000001, variance=222.11151515151516, skewness=-0.01047377103097642, kurtosis=-1.28981716368536)

Let's try to add this newfound information to our plot!

In [91]:
mu = round(info.mean, 2)
sigma = round(info.variance**.5, 2)

plt.figure()
plt.title('Age Distribution of Clients ($\\mu={}$, $\\sigma={}$)'.format(mu, sigma))
plt.xlabel('Age')
plt.ylabel('Count')
plt.hist(ages)
plt.savefig('age-histogram.png')

# The \\ is needed in the title string as \ performs special "escape" commands. For example,
# we've seen that \n produces a newline. \\ just leaves us with a good-ole \

Matplotlib is quite flexible in how a plot can be adjusted and tweaked! We could probably spend days discussing these things, but for now, we'll just cover the basics.

Additional Remarks

A couple question came up during the day. The first was regarding how to format strings to be a certain length. The second was about clarification of the lambda statement.

More Formatting

In order to format strings to be a certain length or have a certain justification, we can add to the replacement marker we saw in format strings:

In [1]:
'{:<12} {:>14} {:^16}'.format('left', 'right', 'center')
Out[1]:
'left                  right      center     '

The above formats the strings to the left, right and center padded with 12, 14 and 16 spaces respectively. You can find out more about these things in the example gallery.

Lambda

The original intent was to just use lambda as a short utility for sorting, but due to some questions, I thought I'd expand on it here. The idea behind lambda is to provide a way to construct "unnamed" functions. For example, the following are equivalent:

In [2]:
def square(x):
    return x**2

square = lambda x: x**2

What's crucial is that lambda doesn't need a name to work. Perhaps the key example to see this is:

In [3]:
(lambda x: x**2)(4)
Out[3]:
16

This is how we used it as a key to the sorted function. One last example which demonstrates this is:

In [4]:
def bind(f, x):
    return lambda y: f(x, y)

def f(x, y):
    return x + 2*y

g = bind(f, 2)
g(3)
Out[4]:
8

The function bind takes a function f of two arguments and returns a new function with the first slot bound to x. In this example, $g(y) = f(x, y)$.