Some Python reminders

I assume that you all know Python. If not you can look back at the course notes of <a href = 'http://ecourse.uoi.gr/course/view.php?id=489'>Introduction to Programming</a>.

There is also a nice introductory <a href = "https://github.com/mcrovella/CS506-Computational-Tools-for-Data-Science/blob/master/01-Intro-to-Python.ipynb">tutorial</a> from Evimaria Terzi and Mark Crovella from whom we will borrow a lot of material.

Here we will just give a few reminders.

Strings

String manipulation will be very important for many of the tasks we will do. Therefore let us play around a bit with strings.

In [15]:
#Concatenating strings

a = "Hello"  # String
b = " World" # Another string
print (a + b)  # Concatenation
Hello World
In [1]:
# Slicing strings

a = "World"

print (a[0])
print (a[-1])
print ("World"[0:4])
print (a[::-1])
print(a[1:-1])
W
d
Worl
dlroW
orl
In [2]:
# Popular string functions
a = "Hello World"
l = list(a)
print(l)
print ("-".join(a))
print ("-".join(l))
print (a.startswith("Wo"))
print (a.endswith("rld"))
print (a.replace("o","0").replace("d","[)").replace("l","1"))
print (a.split())
print (a.split('o'))
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
H-e-l-l-o- -W-o-r-l-d
H-e-l-l-o- -W-o-r-l-d
False
True
He110 W0r1[)
['Hello', 'World']
['Hell', ' W', 'rld']

Strings are an example of an imutable data type. Once you instantiate a string you cannot change any characters in it's set.

In [18]:
string = "string"
string[-1] = "y"  #Here we attempt to assign the last character in the string to "y"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-b377f6c05723> in <module>()
      1 string = "string"
----> 2 string[-1] = "y"  #Here we attempt to assign the last character in the string to "y"

TypeError: 'str' object does not support item assignment

Lists, Tuples, Sets and Dictionaries

Number and strings alone are not enough! we need data types that can hold multiple values.

Lists:

Lists are mutable or able to be altered. Lists are a collection of data and that data can be of differing types.

In [20]:
groceries = []

# Add to list
groceries.append("oranges")  
groceries.append("meat")
groceries.append("asparangus")

# Access by index
print (groceries[2])
print (groceries[0])

# Find number of things in list
print (len(groceries))

# Sort the items in the list
groceries.sort()
print (groceries)

# List Comprehension
veggie = [x for x in groceries if x is not "meat"]
print (veggie)

# Remove from list
groceries.remove("asparangus")
#groceries.pop()
print (groceries)

#The list is mutable
groceries[0] = 2
print (groceries)
asparangus
oranges
3
['asparangus', 'meat', 'oranges']
['asparangus', 'oranges']
['meat', 'oranges']
[2, 'oranges']
In [21]:
groceries.sort()
print (groceries)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-61508be098f7> in <module>()
----> 1 groceries.sort()
      2 print (groceries)

TypeError: unorderable types: str() < int()
In [1]:
L = ['x',1,3,'y']
print(L.pop())
print(L.pop(0))
y
x
In [26]:
# lists are objects
L = [2,5,1,4]
X = L
L.sort()
print (X)
L.append(3)
print(X)
L = sorted(L)
print(L)
print(X) 
[1, 2, 4, 5]
[1, 2, 4, 5, 3]
[1, 2, 3, 4, 5]
[1, 2, 4, 5, 3]
In [27]:
#slicing works for lists as for strings
print (L[1:-1])
print (L[2:])
print(L[:-2])
print(L[1:-1:2])
[2, 3, 4]
[3, 4, 5]
[1, 2, 3]
[2, 4]

List Comprehension

Recall the mathematical notation:

$$L_1 = \left\{x^2 : x \in \{0\ldots 9\}\right\}$$

$$L_2 = \left(1, 2, 4, 8,\ldots, 2^{12}\right)$$

$$M = \left\{x \mid x \in L_1 \text{ and } x \text{ is even}\right\}$$

In [1]:
L1 = [x**2 for x in range(10)] # range(n): returns an iterator over the numbers 0,...,n-1
L2 = [2**i for i in range(13)]
L3 = [x for x in L1 if x % 2 == 0]
print (L1)
print (L2) 
print (L3)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
[0, 4, 16, 36, 64]
In [29]:
[x for x in [x**2 for x in range(10)] if x % 2 == 0]
Out[29]:
[0, 4, 16, 36, 64]
In [4]:
words = 'The quick brown fox jumps over the lazy dog'.split()
print(words) 
upper = [w.upper() for w in words]
print(upper)
stuff = [[w.upper(), w.lower(), len(w)] for w in words]
print(stuff)
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
['THE', 'QUICK', 'BROWN', 'FOX', 'JUMPS', 'OVER', 'THE', 'LAZY', 'DOG']
[['THE', 'the', 3], ['QUICK', 'quick', 5], ['BROWN', 'brown', 5], ['FOX', 'fox', 3], ['JUMPS', 'jumps', 5], ['OVER', 'over', 4], ['THE', 'the', 3], ['LAZY', 'lazy', 4], ['DOG', 'dog', 3]]
In [2]:
s = input('Give numbers separated by comma: ')
x = [int(n) for n in s.split(',')]
print(x)
Give numbers separated by comma: 1,2,3,4
[1, 2, 3, 4]
In [3]:
y = s.split(',')
print(y)
print(y[0]+y[1])
print(x[0]+x[1])
['1', '2', '3', '4']
12
3
In [10]:
#create a vector of all 10 zeros
z = [0 for i in range(10)]
print(z)
#create a 10x10 matrix with all 0s
M = [[0 for i in range(10)] for j in range(10)]
#set the diagonal to 1
for i in range(10): M[i][i] = 1
print(M)
#create a list of random integers in [0,99]
import random
R = [random.choice(range(100)) for i in range(10)]
print(R)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
[69, 93, 69, 2, 26, 93, 90, 89, 5, 4]
In [4]:
# Removing elements from a list while you iterate it can lead to problems
L = [1,2,4,5,6,8]
for x in L:
    if x%2 == 0:
        L.remove(x)
print(L)
[1, 4, 5, 8]
In [54]:
#Another way to do this:
L = [1,2,4,5,6,8]
L = [x for x in L if x%2 == 1] #creates a new list
L[:] = [x for x in L if x%2 == 1]
print(L)
[1, 5]
In [55]:
L = [1,2,4,5,6,8]
R =[y for y in L if y%2 == 0]
for x in R: L.remove(x)
print(L)
[1, 5]

Tuples:

Tuples are an immutable type. Like strings, once you create them, you cannot change them. It is their immutability that allows you to use them as keys in dictionaries. However, they are similar to lists in that they are a collection of data and that data can be of differing types.

In [56]:
# Tuple grocery list

groceries = ('orange', 'meat', 'asparangus', 2.5, True)

print (groceries)

#print(groceries[2])

#groceries[2] = 'milk'

L = [1,2,3]
t = tuple(L)
print(t)
L[1] = 5
print(t)
t[1] = 4
('orange', 'meat', 'asparangus', 2.5, True)
(1, 2, 3)
(1, 2, 3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-56-ead4db6fadf4> in <module>()
     14 L[1] = 5
     15 print(t)
---> 16 t[1] = 4

TypeError: 'tuple' object does not support item assignment

Sets:

A set is a sequence of items that cannot contain duplicates. They handle operations like sets in mathematics.

In [ ]:
numbers = range(10)
evens = [2, 4, 6, 8]

evens = set(evens)
numbers = set(numbers)

# Use difference to find the odds
odds = numbers - evens

print (odds)

# Note: Set also allows for use of union (|), and intersection (&)
In [57]:
a = [2,1,2,1]
print (a)
a = set(a)
print(a)
[2, 1, 2, 1]
{1, 2}

Dictionaries:

A dictionary is a map of keys to values. This is one of the most useful structures. Keys must be unique and immutable.

In [58]:
# A simple dictionary

simple_dict = {'cse012': 'data mining'}

# Access by key
print (simple_dict['cse012'])
data mining
In [59]:
# A longer dictionary
classes = {
    'cse012': 'data mining',
    'cse205': 'object oriented programming'
}

# Check if item is in dictionary
print ('cse012' in classes)

# Add new item
classes['L14'] = 'social networks'
print (classes['L14'])

# Print just the keys
print (list(classes.keys()))

# Print just the values
print (list(classes.values()))

# Print the items in the dictionary
print (list(classes.items()))

# Print dictionary pairs another way
for key, value in classes.items():
    print (key, value)
True
social networks
['cse205', 'L14', 'cse012']
['object oriented programming', 'social networks', 'data mining']
[('cse205', 'object oriented programming'), ('L14', 'social networks'), ('cse012', 'data mining')]
cse205 object oriented programming
L14 social networks
cse012 data mining
In [60]:
for key in classes:
    print (key, classes[key])
cse205 object oriented programming
L14 social networks
cse012 data mining
In [61]:
#change values in a dictionary
classes['L14'] = 'graduate social networks'
print (classes['L14'])
graduate social networks
In [62]:
# Complex Data structures
# Dictionaries inside a dictionary!

professors = {
    "prof1": {
        "name": "Panayiotis Tsaparas",
        "department": "Computer Science",
        "research interests": ["algorithms", "data mining", "machine learning",]
    },
    "prof2": {
        "name": "Yanis Varoufakis",
        "department": "Economics",
        "interests": ["debt", "game theory", "parallel currency",],
    }
}

for prof in professors:
    print (professors[prof]["name"])
Yanis Varoufakis
Panayiotis Tsaparas
In [64]:
professors['prof2']['interests'][1]
Out[64]:
'game theory'

Depending on the task that we want to perform, it makes a bid difference in efficiency what structure we use. When searching over a structure it is important to use a Set or a Dictionary structure since search is done in constant time in expectation (or O(logn) worst case). This makes a huge difference when dealing with large datasets.

In [5]:
# The importance of using the right structure:
import random
L = [random.choice(range(1000000)) for i in range(1000)]
import time
t = time.clock()
count = 0;
for x in range(1000000):
    if x in L:
        count += 1
print (time.clock() - t)
26.21975488869904
In [6]:
S = set(L)
t = time.clock()
count = 0;
for x in range(1000000):
    if x in S:
        count += 1
print (time.clock() - t)
0.14673384039315351
In [72]:
#What structure should we use for storing the edges of a graph with millions of edges?
L = [[1,2],[1,4],[2,5],[3,7],[5,7],[6,8]] # etc...
S = set([tuple(x) for x in L])
print(S)

#What if we have also a weight on the edges, and we want to get the weight?
L = [[1,2,0.2],[1,4,0.5],[2,5,0.8],[3,7,0.1],[5,7,0.4],[6,8,0.9]] 
D = {}
for x in L:
    D[tuple(x[0:2])] = float(x[2])
print(D)

#What if we want to be able to get the neighbors of each node and the weight of the edge?
D = {}
for x in L:
    v1,v2,w = x
    if v1 in D:
        D[v1][v2] = w
    else:
        D[v1] = {}
        D[v1][v2] = w
print(D)
print(D[1][2])
{(1, 2), (6, 8), (5, 7), (1, 4), (3, 7), (2, 5)}
{(1, 2): 0.2, (6, 8): 0.9, (3, 7): 0.1, (2, 5): 0.8, (5, 7): 0.4, (1, 4): 0.5}
{1: {2: 0.2, 4: 0.5}, 2: {5: 0.8}, 3: {7: 0.1}, 5: {7: 0.4}, 6: {8: 0.9}}
0.2

Lambda functions

Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda".

In [24]:
def f (x): return x**2
print (f(8))
64
In [25]:
g = lambda x: x**2
print (g(8))
64

The above pieces of code are equivalent to each other! Note that there is no ``return" statement in the lambda function. A lambda function does not need to be assigned to variable, but it can be used within the code wherever a function is expected.

In [26]:
f = lambda x, y : x + y
print (f(2,3))
5
In [27]:
def multiply (n): return lambda x: x*n
 
multiply_by_2 = multiply(2)
g = multiply(6)
print (multiply_by_2)
print (multiply_by_2(10), g(10))
<function multiply.<locals>.<lambda> at 0x0000000004E25840>
20 60
In [28]:
multiply(3)(30)
Out[28]:
90

The map() function

The advantage of the lambda operator can be seen when it is used in combination with the map() function. map() is a function with two arguments:

r = map(func,s)

func is a function and s is a sequence (e.g., a list). map returns an iterator where we have applied function func to all the elements of s. You need to convert this into a list if you want to use it.

map() and lambda give functionality very similar to that of list comprehension

In [29]:
def dollar2euro(x):
    return 0.89*x
def euro2dollar(x):
    return 1.12*x

amounts= (100, 200, 300, 400)
dollars = map(dollar2euro, amounts)
print (dollars)
print (list(dollars))
<map object at 0x0000000004E188D0>
[89.0, 178.0, 267.0, 356.0]
In [30]:
amounts= (100, 200, 300, 400)
euros = map(euro2dollar, amounts)
print (list(euros))
[112.00000000000001, 224.00000000000003, 336.00000000000006, 448.00000000000006]
In [31]:
list(map(lambda x: 0.89*x, amounts))
Out[31]:
[89.0, 178.0, 267.0, 356.0]

map can also be applied to more than one lists as long as they are of the same size and type

In [32]:
a = [1,2,3,4,5]
b = [-1,-2,-3, -4, -5] 
c = [10, 20 , 30, 40, 50]

l1 = list(map(lambda x,y: x+y, a,b))
print (l1)
l2 = list(map (lambda x,y,z: x-y+z, a,b,c))
print (l2)
[0, 0, 0, 0, 0]
[12, 24, 36, 48, 60]
In [33]:
words = 'The quick brown fox jumps over the lazy dog'.split()
uwords = list(map(lambda w: [w.upper(), w.lower(), len(w)], words))
for t in uwords:
    print (t)
['THE', 'the', 3]
['QUICK', 'quick', 5]
['BROWN', 'brown', 5]
['FOX', 'fox', 3]
['JUMPS', 'jumps', 5]
['OVER', 'over', 4]
['THE', 'the', 3]
['LAZY', 'lazy', 4]
['DOG', 'dog', 3]

The filter() function

The function filter(function, list) filters out all the elements of a list, for which the function function returns True. Returns an iterator

In [34]:
nums = [i for i in range(100)]
print (nums)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
In [35]:
even = list(filter(lambda x: x%2==0 and x!=0, nums))
print (even)
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Command line arguments

To read the command line arguments use the sys.argv list. The first element of this list is the program name. More sophisiticated processing can be done using the getopt.getopt method

In [83]:
import sys

print ('Number of arguments:', len(sys.argv), 'arguments.')
print ('Argument List:', str(sys.argv))
Number of arguments: 5 arguments.
Argument List: ['C:\\Anaconda3\\lib\\site-packages\\IPython\\kernel\\__main__.py', '-f', 'C:\\Users\\Panayiotis\\.ipython\\profile_default\\security\\kernel-a3b88187-1a4c-44b6-9924-aa3dfaec2c63.json', '--profile-dir', 'C:\\Users\\Panayiotis\\.ipython\\profile_default']

Libraries

Python is a high-level open-source language. But the Python world is inhabited by many packages or libraries that provide useful things like array operations, plotting functions, and much more. We can (and we should) import libraries of functions to expand the capabilities of Python in our programs.

There are many python libraries for data mining, and we will use many of these libraries in this course.

Example: The random library

In [85]:
import random
print (random.choice(range(10))) # generates a random number in the range [0,9]
myList = [2, 109, False, 10, "data", 482, "mining"]
print(random.choice(myList))
1
mining
In [86]:
from random import shuffle # imports a specific function of random that can be used without the random. prefix
x = [i for i in range(10)]
print (x)
shuffle(x)
print (x)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 7, 5, 6, 0, 3, 9, 8, 2, 1]
In [87]:
# some more methods of random
print(random.sample([1,3,4,7,8,9,20],3)) #samples 3 elements uniformly at random
print(random.random()) # a random number in [0,1)
print(random.uniform(-1,1)) #a random number in [-1,1]
print(random.gauss(0,1)) #sample from a gaussian distribution with mean 0 and std 1
[9, 20, 4]
0.8665630576491753
-0.6164878646949343
-0.9080984966255757

Including images

In [88]:
from IPython.display import Image
Image(filename = "CSE-UOI-LOGO-EN.jpg",width = 231.4, height = 272.9)
#Image(response.content)
Out[88]: