Some Python reminders¶

I assume that you all know Python. If not you can look back at the course notes of <a href = 'http://ecourse.uoi.gr/course/view.php?id=489'>Introduction to Programming</a>.

There is also a nice introductory <a href = "https://github.com/mcrovella/CS506-Computational-Tools-for-Data-Science/blob/master/01-Intro-to-Python.ipynb">tutorial</a> from Evimaria Terzi and Mark Crovella from whom we will borrow a lot of material.

Here we will just give a few reminders.

Strings¶

String manipulation will be very important for many of the tasks we will do. Therefore let us play around a bit with strings.

#Concatenating strings

a = "Hello"  # String
b = " World" # Another string
print (a + b)  # Concatenation

Hello World

# Slicing strings

a = "World"

print (a[0])
print (a[-1])
print ("World"[0:4])
print (a[::-1])
print(a[1:-1])

W
d
Worl
dlroW
orl

# Popular string functions
a = "Hello World"
l = list(a)
print(l)
print ("-".join(a))
print ("-".join(l))
print (a.startswith("Wo"))
print (a.endswith("rld"))
print (a.replace("o","0").replace("d","[)").replace("l","1"))
print (a.split())
print (a.split('o'))

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
H-e-l-l-o- -W-o-r-l-d
H-e-l-l-o- -W-o-r-l-d
False
True
He110 W0r1[)
['Hello', 'World']
['Hell', ' W', 'rld']

Strings are an example of an imutable data type. Once you instantiate a string you cannot change any characters in it's set.

string = "string"
string[-1] = "y"  #Here we attempt to assign the last character in the string to "y"

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-b377f6c05723> in <module>()
      1 string = "string"
----> 2 string[-1] = "y"  #Here we attempt to assign the last character in the string to "y"

TypeError: 'str' object does not support item assignment

Lists, Tuples, Sets and Dictionaries¶

Number and strings alone are not enough! we need data types that can hold multiple values.

Lists:¶

Lists are mutable or able to be altered. Lists are a collection of data and that data can be of differing types.

groceries = []

# Add to list
groceries.append("oranges")  
groceries.append("meat")
groceries.append("asparangus")

# Access by index
print (groceries[2])
print (groceries[0])

# Find number of things in list
print (len(groceries))

# Sort the items in the list
groceries.sort()
print (groceries)

# List Comprehension
veggie = [x for x in groceries if x is not "meat"]
print (veggie)

# Remove from list
groceries.remove("asparangus")
#groceries.pop()
print (groceries)

#The list is mutable
groceries[0] = 2
print (groceries)

asparangus
oranges
3
['asparangus', 'meat', 'oranges']
['asparangus', 'oranges']
['meat', 'oranges']
[2, 'oranges']

groceries.sort()
print (groceries)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-61508be098f7> in <module>()
----> 1 groceries.sort()
      2 print (groceries)

TypeError: unorderable types: str() < int()

L = ['x',1,3,'y']
print(L.pop())
print(L.pop(0))

y
x

# lists are objects
L = [2,5,1,4]
X = L
L.sort()
print (X)
L.append(3)
print(X)
L = sorted(L)
print(L)
print(X)

[1, 2, 4, 5]
[1, 2, 4, 5, 3]
[1, 2, 3, 4, 5]
[1, 2, 4, 5, 3]

#slicing works for lists as for strings
print (L[1:-1])
print (L[2:])
print(L[:-2])
print(L[1:-1:2])

[2, 3, 4]
[3, 4, 5]
[1, 2, 3]
[2, 4]

List Comprehension¶

Recall the mathematical notation:

$$L_1 = \left\{x^2 : x \in \{0\ldots 9\}\right\}$$

$$L_2 = \left(1, 2, 4, 8,\ldots, 2^{12}\right)$$

$$M = \left\{x \mid x \in L_1 \text{ and } x \text{ is even}\right\}$$

L1 = [x**2 for x in range(10)] # range(n): returns an iterator over the numbers 0,...,n-1
L2 = [2**i for i in range(13)]
L3 = [x for x in L1 if x % 2 == 0]
print (L1)
print (L2) 
print (L3)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
[0, 4, 16, 36, 64]

[x for x in [x**2 for x in range(10)] if x % 2 == 0]

[0, 4, 16, 36, 64]

words = 'The quick brown fox jumps over the lazy dog'.split()
print(words) 
upper = [w.upper() for w in words]
print(upper)
stuff = [[w.upper(), w.lower(), len(w)] for w in words]
print(stuff)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
['THE', 'QUICK', 'BROWN', 'FOX', 'JUMPS', 'OVER', 'THE', 'LAZY', 'DOG']
[['THE', 'the', 3], ['QUICK', 'quick', 5], ['BROWN', 'brown', 5], ['FOX', 'fox', 3], ['JUMPS', 'jumps', 5], ['OVER', 'over', 4], ['THE', 'the', 3], ['LAZY', 'lazy', 4], ['DOG', 'dog', 3]]

s = input('Give numbers separated by comma: ')
x = [int(n) for n in s.split(',')]
print(x)

Give numbers separated by comma: 1,2,3,4
[1, 2, 3, 4]

y = s.split(',')
print(y)
print(y[0]+y[1])
print(x[0]+x[1])

['1', '2', '3', '4']
12
3

#create a vector of all 10 zeros
z = [0 for i in range(10)]
print(z)
#create a 10x10 matrix with all 0s
M = [[0 for i in range(10)] for j in range(10)]
#set the diagonal to 1
for i in range(10): M[i][i] = 1
print(M)
#create a list of random integers in [0,99]
import random
R = [random.choice(range(100)) for i in range(10)]
print(R)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
[69, 93, 69, 2, 26, 93, 90, 89, 5, 4]

# Removing elements from a list while you iterate it can lead to problems
L = [1,2,4,5,6,8]
for x in L:
    if x%2 == 0:
        L.remove(x)
print(L)

[1, 4, 5, 8]

#Another way to do this:
L = [1,2,4,5,6,8]
L = [x for x in L if x%2 == 1] #creates a new list
L[:] = [x for x in L if x%2 == 1]
print(L)

[1, 5]

L = [1,2,4,5,6,8]
R =[y for y in L if y%2 == 0]
for x in R: L.remove(x)
print(L)

[1, 5]

Tuples:¶

Tuples are an immutable type. Like strings, once you create them, you cannot change them. It is their immutability that allows you to use them as keys in dictionaries. However, they are similar to lists in that they are a collection of data and that data can be of differing types.

# Tuple grocery list

groceries = ('orange', 'meat', 'asparangus', 2.5, True)

print (groceries)

#print(groceries[2])

#groceries[2] = 'milk'

L = [1,2,3]
t = tuple(L)
print(t)
L[1] = 5
print(t)
t[1] = 4

('orange', 'meat', 'asparangus', 2.5, True)
(1, 2, 3)
(1, 2, 3)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-56-ead4db6fadf4> in <module>()
     14 L[1] = 5
     15 print(t)
---> 16 t[1] = 4

TypeError: 'tuple' object does not support item assignment

Sets:¶

A set is a sequence of items that cannot contain duplicates. They handle operations like sets in mathematics.

numbers = range(10)
evens = [2, 4, 6, 8]

evens = set(evens)
numbers = set(numbers)

# Use difference to find the odds
odds = numbers - evens

print (odds)

# Note: Set also allows for use of union (|), and intersection (&)

a = [2,1,2,1]
print (a)
a = set(a)
print(a)

[2, 1, 2, 1]
{1, 2}

Dictionaries:¶

A dictionary is a map of keys to values. This is one of the most useful structures. Keys must be unique and immutable.

# A simple dictionary

simple_dict = {'cse012': 'data mining'}

# Access by key
print (simple_dict['cse012'])

data mining

# A longer dictionary
classes = {
    'cse012': 'data mining',
    'cse205': 'object oriented programming'
}

# Check if item is in dictionary
print ('cse012' in classes)

# Add new item
classes['L14'] = 'social networks'
print (classes['L14'])

# Print just the keys
print (list(classes.keys()))

# Print just the values
print (list(classes.values()))

# Print the items in the dictionary
print (list(classes.items()))

# Print dictionary pairs another way
for key, value in classes.items():
    print (key, value)

True
social networks
['cse205', 'L14', 'cse012']
['object oriented programming', 'social networks', 'data mining']
[('cse205', 'object oriented programming'), ('L14', 'social networks'), ('cse012', 'data mining')]
cse205 object oriented programming
L14 social networks
cse012 data mining

for key in classes:
    print (key, classes[key])

cse205 object oriented programming
L14 social networks
cse012 data mining

#change values in a dictionary
classes['L14'] = 'graduate social networks'
print (classes['L14'])

graduate social networks

# Complex Data structures
# Dictionaries inside a dictionary!

professors = {
    "prof1": {
        "name": "Panayiotis Tsaparas",
        "department": "Computer Science",
        "research interests": ["algorithms", "data mining", "machine learning",]
    },
    "prof2": {
        "name": "Yanis Varoufakis",
        "department": "Economics",
        "interests": ["debt", "game theory", "parallel currency",],
    }
}

for prof in professors:
    print (professors[prof]["name"])

Yanis Varoufakis
Panayiotis Tsaparas

professors['prof2']['interests'][1]

'game theory'

Depending on the task that we want to perform, it makes a bid difference in efficiency what structure we use. When searching over a structure it is important to use a Set or a Dictionary structure since search is done in constant time in expectation (or O(logn) worst case). This makes a huge difference when dealing with large datasets.

# The importance of using the right structure:
import random
L = [random.choice(range(1000000)) for i in range(1000)]
import time
t = time.clock()
count = 0;
for x in range(1000000):
    if x in L:
        count += 1
print (time.clock() - t)

26.21975488869904

S = set(L)
t = time.clock()
count = 0;
for x in range(1000000):
    if x in S:
        count += 1
print (time.clock() - t)

0.14673384039315351

#What structure should we use for storing the edges of a graph with millions of edges?
L = [[1,2],[1,4],[2,5],[3,7],[5,7],[6,8]] # etc...
S = set([tuple(x) for x in L])
print(S)

#What if we have also a weight on the edges, and we want to get the weight?
L = [[1,2,0.2],[1,4,0.5],[2,5,0.8],[3,7,0.1],[5,7,0.4],[6,8,0.9]] 
D = {}
for x in L:
    D[tuple(x[0:2])] = float(x[2])
print(D)

#What if we want to be able to get the neighbors of each node and the weight of the edge?
D = {}
for x in L:
    v1,v2,w = x
    if v1 in D:
        D[v1][v2] = w
    else:
        D[v1] = {}
        D[v1][v2] = w
print(D)
print(D[1][2])

{(1, 2), (6, 8), (5, 7), (1, 4), (3, 7), (2, 5)}
{(1, 2): 0.2, (6, 8): 0.9, (3, 7): 0.1, (2, 5): 0.8, (5, 7): 0.4, (1, 4): 0.5}
{1: {2: 0.2, 4: 0.5}, 2: {5: 0.8}, 3: {7: 0.1}, 5: {7: 0.4}, 6: {8: 0.9}}
0.2

Lambda functions¶

Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda".

def f (x): return x**2
print (f(8))

64

g = lambda x: x**2
print (g(8))

64

The above pieces of code are equivalent to each other! Note that there is no ``return" statement in the lambda function. A lambda function does not need to be assigned to variable, but it can be used within the code wherever a function is expected.

f = lambda x, y : x + y
print (f(2,3))

5

def multiply (n): return lambda x: x*n
 
multiply_by_2 = multiply(2)
g = multiply(6)
print (multiply_by_2)
print (multiply_by_2(10), g(10))

<function multiply.<locals>.<lambda> at 0x0000000004E25840>
20 60

multiply(3)(30)

90

The map() function¶

The advantage of the lambda operator can be seen when it is used in combination with the map() function. map() is a function with two arguments:

r = map(func,s)

func is a function and s is a sequence (e.g., a list). map returns an iterator where we have applied function func to all the elements of s. You need to convert this into a list if you want to use it.

map() and lambda give functionality very similar to that of list comprehension

def dollar2euro(x):
    return 0.89*x
def euro2dollar(x):
    return 1.12*x

amounts= (100, 200, 300, 400)
dollars = map(dollar2euro, amounts)
print (dollars)
print (list(dollars))

<map object at 0x0000000004E188D0>
[89.0, 178.0, 267.0, 356.0]

amounts= (100, 200, 300, 400)
euros = map(euro2dollar, amounts)
print (list(euros))

[112.00000000000001, 224.00000000000003, 336.00000000000006, 448.00000000000006]

list(map(lambda x: 0.89*x, amounts))

[89.0, 178.0, 267.0, 356.0]

map can also be applied to more than one lists as long as they are of the same size and type

a = [1,2,3,4,5]
b = [-1,-2,-3, -4, -5] 
c = [10, 20 , 30, 40, 50]

l1 = list(map(lambda x,y: x+y, a,b))
print (l1)
l2 = list(map (lambda x,y,z: x-y+z, a,b,c))
print (l2)

[0, 0, 0, 0, 0]
[12, 24, 36, 48, 60]

words = 'The quick brown fox jumps over the lazy dog'.split()
uwords = list(map(lambda w: [w.upper(), w.lower(), len(w)], words))
for t in uwords:
    print (t)

['THE', 'the', 3]
['QUICK', 'quick', 5]
['BROWN', 'brown', 5]
['FOX', 'fox', 3]
['JUMPS', 'jumps', 5]
['OVER', 'over', 4]
['THE', 'the', 3]
['LAZY', 'lazy', 4]
['DOG', 'dog', 3]

The filter() function¶

The function filter(function, list) filters out all the elements of a list, for which the function function returns True. Returns an iterator

nums = [i for i in range(100)]
print (nums)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

even = list(filter(lambda x: x%2==0 and x!=0, nums))
print (even)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]

Command line arguments¶

To read the command line arguments use the sys.argv list. The first element of this list is the program name. More sophisiticated processing can be done using the getopt.getopt method

import sys

print ('Number of arguments:', len(sys.argv), 'arguments.')
print ('Argument List:', str(sys.argv))

Number of arguments: 5 arguments.
Argument List: ['C:\\Anaconda3\\lib\\site-packages\\IPython\\kernel\\__main__.py', '-f', 'C:\\Users\\Panayiotis\\.ipython\\profile_default\\security\\kernel-a3b88187-1a4c-44b6-9924-aa3dfaec2c63.json', '--profile-dir', 'C:\\Users\\Panayiotis\\.ipython\\profile_default']

Libraries¶

Python is a high-level open-source language. But the Python world is inhabited by many packages or libraries that provide useful things like array operations, plotting functions, and much more. We can (and we should) import libraries of functions to expand the capabilities of Python in our programs.

There are many python libraries for data mining, and we will use many of these libraries in this course.

Example: The random library¶

import random
print (random.choice(range(10))) # generates a random number in the range [0,9]
myList = [2, 109, False, 10, "data", 482, "mining"]
print(random.choice(myList))

1
mining

from random import shuffle # imports a specific function of random that can be used without the random. prefix
x = [i for i in range(10)]
print (x)
shuffle(x)
print (x)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 7, 5, 6, 0, 3, 9, 8, 2, 1]

# some more methods of random
print(random.sample([1,3,4,7,8,9,20],3)) #samples 3 elements uniformly at random
print(random.random()) # a random number in [0,1)
print(random.uniform(-1,1)) #a random number in [-1,1]
print(random.gauss(0,1)) #sample from a gaussian distribution with mean 0 and std 1

[9, 20, 4]
0.8665630576491753
-0.6164878646949343
-0.9080984966255757

Including images¶

from IPython.display import Image
Image(filename = "CSE-UOI-LOGO-EN.jpg",width = 231.4, height = 272.9)
#Image(response.content)