So far, we've thought a bit about doing computations. Let's change things up and look at how we might try to represent data instead.
For example, let's say I wanted to keep track of all my friends and their birthdays. How could I do that, so that I could work with it in Python?
Or perhaps, I'm writing a game server where everyone has a unique username and an account associated to it. How could I efficiently look up an account from the username?
Finally, what if I wanted to quickly find all the unique numbers which occur in a list. Is there a nice way to do that?
We'll see that each of these problems can be solved efficiently using a few fundamental data structures: lists, dictionaries and sets.
Lists are, perhaps, the fundamental data structure. They are...well...lists. Most of the time when dealing with data, we don't just have a single item or a couple items, but lots of them. Lists help us organize these into a coherent, ordered collection. For example:
numbers = [1, 4, 2, 3, 9]
print numbers
Now, suppose I want to talk about particular things in the list. How can I do that?
print numbers[0]
print numbers[1]
print numbers[-1]
I can also take entire "slices" from the list. Note that like range, slicing follows the [,) half open interval convention.
print numbers[1:4]
print numbers[2:5]
print numbers[1:]
print numbers[:-3]
I can do something even fancier by "striding" the items!
print numbers[::2]
print numbers[1::3]
print numbers[1:3:2]
That's fine and all, but we want to do more than just look at a list. We want to be able to make changes to them! For example, we make a new friend and want to add them to our list of birthdays.
numbers = [1, 4, 2, 3, 9]
numbers.append(10)
print numbers
numbers = [1, 4, 2, 3, 9]
numbers.insert(0, 10)
print numbers
numbers = [1, 4, 2, 3, 9]
numbers.insert(2, 10)
print numbers
I can also remove items from a list.
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop(0)
print people
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop(-1) # note: same as just people.pop()
print people
I can also join two existing lists together to form a new list by "adding" or concatenating them.
items1 = [1, 2, 3, 4, 5]
items2 = ['A', 'B', 'C']
print items1 + items2
print items2 + items1
We can even have more interesting cases. For example, we may represent a simple "tree" as a list with other lists inside.
data = ["house key", 7, 3.1, [3, 4, 1], ["lighter", [1, True], "pocket knife"]]
print data
Finally, we can iterate the elements of a list in a similar way to how we iterated a range of numbers. Recall that we had:
things = [4, 1, 2, 9, 2, 'a', 2, 'thing', 4, 3.2]
for x in things:
print x
Create a function which takes a list of integers and prints whether the entry is positive, negative or zero.
Modify this function so that instead of printing "positive", "negative" or "zero", it returns the list of corresponding strings. (Can you make this more general? Instead of taking a number to the string "postive", "negative" or "zero", what if I want the squares of the numbers? Or each number incremented by one? How would you allow for this?)
Write a function which takes a list of numbers and returns the sum of the cubes of all the elements.
Write a function which finds the largest item in a list of numbers.
Write a function which returns a list but in reverse order.
Now that we've implemented a couple of these kinds of things, it turns out that Python has some built-in functionality that already does this for us. For example, if we want to find the largest element in a list, we can just use:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print max(numbers)
Similarly, if we want to add the elements, we can just use:
print sum(numbers)
One last, very useful and common operation is being able to sort a list of items:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print sorted(numbers)
A number of these functions become very useful when coupled with the concept of a generator. Although we won't go into depth with these, they become useful for performing small constructions or computations in a very compact way. For example, I can construct a list of squares using the "list-builder" notation:
[n**2 for n in range(1, 10)]
I can utilize functions like min or sum through a generator, too! For example, let me reimplement the sum of cubes function using generator notation:
numbers = [1, 5, 3, 9, 8]
sum(n**3 for n in numbers)
This (sometimes) have the benefit of improved readibility along with compactness. It's easy to see immediately what the line above this is doing in this notation.
Another common task is pairing up pieces of data. For example, in my list of friends' birthdays, I want a way of keeping a coherent grouping of each friend with their birthday. So something like, a list of pairs of the form (friend, birthday). Tuples are one possible solution to this:
birthdays = [
('James', 'Aug 25'),
('Tommy', 'Sept 12'),
('Jason', 'Jun 13'),
('Lexi', 'Feb 1'),
]
Tuples are similar to lists but are "immutable". In other words, once we create a tuple, we can't change it's content. Otherwise, we can still access elements in the same way. For example,
entry = birthdays[1]
print entry
print entry[1]
Python also has a nice "pattern-matching" method of assignment using tuples. For example, if I want to print out just the birthdays in my birthday list by iterating the pairs, I can just use:
for (name, bday) in birthdays:
print bday
Further, we can make this even "cleaner" by just using:
for name, bday in birthdays:
print bday
Another common example of this is an iterator in Python called enumerate, which pairs up the index of and element with the element:
numbers = [1, 5, 3, 6, 1]
for n, xn in enumerate(numbers):
print str(n) + ' -> ' + str(xn)
Suppose that I'm playing a card game like poker. What are some possible ways I could represent my current hand in the game?
Create a short list of friends' birthdays like we had above. Print a sorted version of this list. Notice the way Python sorts these? Now, write a function which creates a new birthday list but with the order of the name and the birthday swapped. Print out a sorted version of this list. Notice the difference?
Create a list of the numbers 1 to 10. Create a new list which consists of the consecutive pairs of numbers in the list. For example, [(1, 2), (2, 3), (3, 4), ..., (9, 10)]
Write a function which takes two tuples of (x1, y1) and (x2, y2) pairs and returns the "vector sum" (x1+x2, y1+y2). This suggests that we may try to implement an entire collection of support functions for vector algebra. Indeed, this kind of data abstraction is another crucial approach to building programs!
Consider something like a reverse telephone book we're building for something like a caller ID system. We want to be able to identify a person by telephone number, so we need a data structure which takes a phone number and gives us a name. Second, we want to be able to update the our data by adding, removing or modifying entries as they change.
One way we may do this is to keep a big list of (key, value) pairs. But...to perform all the operations we want efficiently, we'd need to impose some kind of structure on how the elements are organized. If we're not planning on really using this structure aside from a short problem, this is perhaps too much effort to use.
Alternatively, we could try to implement such a data structure as a big list with possibly empty entries for every possible phone number. But, you can imagine, this is likely to be to dense of a representation for what we want...
What we really want is to use a dictionary, instead! These are a built-in data structure in Python and provide exactly the sort of key -> value behavior we would like.
phonebook = {}
phonebook['2173218822'] = 'Tommy Cool-guy'
phonebook['2173211122'] = 'Michael Jordan'
phonebook['2183321232'] = 'Bobby Bigmoney Patterson'
Alternatively, when initializing a dictionary, we can use the shorthand:
phonebook = {
'2173218822': 'Tommy Cool-guy',
'2173211122': 'Michael Jordan',
'2183321232': 'Bobby Bigmoney Patterson',
}
To look up an entry, we can simply use the same index notation we did for lists.
print phonebook['2173211122']
At this point we should reflect on a major difference between lists and dictionaries. Remember that lists only take integer indicies, whereas dictionaries can use many other types as keys. Related to this, general types of keys don't have a natural ordering, so we should expect any kind of ordering on how the elements in a dictionary are stored.
If we want to check if a key is in the dictionary, we can use the in and not in statements:
print '2173211122' in phonebook
print '2173211122' not in phonebook
We can remove entries using the del statement.
phonebook['2173211122'] = 'Michael Jordan'
print '2173211122' in phonebook
print phonebook['2173211122'] + ' is retiring...'
del phonebook['2173211122']
print '2173211122' in phonebook
Finally, as with lists, we can enumerate the elements.
for num, name in phonebook.iteritems(): # <- we discovered this should be just .items() in Python 3
print str(num) + ' -> ' + str(name)
Mon -> 1, Tues -> 2, Wed -> 3, Thur -> 4, ...See if you can use this to build a "reverse" dictionary taking days numbers to day names.
['dog', 'pencil', 'fence', 'dog', 'apple', 'dog', 'dog', 'dog', 'pear', 'pencil', 'pear', 'pear']Using a dictionary, compute the numbers of times each word occurs.
The last scenario we'll consider here is something like the following: Suppose that I'm fetching text from webpages and keeping track of the 20 most frequently occuring words on a particular webpage. Now, suppose that I want to check if there are any commonly occuring instances of such words from two different webpages.
How might I do this?
Well, if I stored both the elements in a list I could do something like, for each element in the first list, check if it's in the second.
The problem with this is that, say in the worst case, the two lists are totally disjoint, for each element in the first list, I'm performing a number of steps proportional to the length of the second list. In this case, 20. Then, I do this for each element in my first list. In total, I've now performed 400 steps!
We roughly say that this method scales as $O(n^2)$ of the input size. For our purposes, that's not a deal-breaker. But what if I were to double the data size? Or increase it an order of magnitude? Then we'd start to get into trouble... This is not something we're going to get into much here, but it's something worth keeping in mind when designing solutions to problems.
So, what else can we do? Well, we can use a smarter data structure! This time, a set. Python provides a data structure for working with sets in the same way we think of. For example, items will only occur in a set once. We can efficiently compute the union, intersection or difference of two sets. And so on...
A = {1, 2, 5, 7, 8, 9}
B = {2, 4, 5, 8, 10}
print A
print B
print len(A)
We can check set membership as follows.
print 2 in A
print 3 in A
Similar to lists, we can add and remove elements from sets:
A = {1, 4, 5}
A.add(3)
A
A.remove(5)
A
We can also compute some standard set operations as we would write them mathematically.
print A & B
print A | B
print A - B
Finally, sets support the standard subset comparisons we're used to:
A = {1, 2, 3}
B = {1, 3}
print B < A
print B <= A
One kind of subtle difference between sets and either lists or dictionaries is that sets don't have any natural notion of indexing. In other words, we can't access an element by a key. However, we can still use an iterator approach to getting the elements from a set:
S = {1, 2, 3}
for x in S:
print x
Finally, in light of the generator discussion earlier, we can actually use set-builder notation as follows:
dice = range(1, 7)
pairs = {(a, b) for a in dice
for b in dice
if a == 2 or b == 2}
[1, 4, 5, 3, 6, 7, 5, 3, 1, 9, 3, 3, 4, 2]How could you use a set data structure to remove all the duplicates from the list?
users = ['tom', 'pam', 'sasha', 'jj', 'que', 'jeff'] bad = set(['jeff', 'sasha'])It should return an "allowed" list:
['tom', 'pam', 'jj', 'que']