Topics in Database Systems: Data Management in Peer-to-Peer Systems

INTRODUCTION

(Week 1 - 22/2/2005) 


NOTES 
=========================================================================

about the content of the course
--------------------------------
The course will cover issues related to data management in peer-to-peer systems. 
This includes: 
	- topologies, structured and unstructured p2p systems 
	- caching and replication 
	- gossip-based dissemination 
	- distributed query processing

1. What is a p2p system

	nodes at the edge of the network
	connect to share resources (files, cpu, etc)
	popularity started from music file sharing systems

   motivation
	make systems ultra-scalable and ultra-available
	break information monopolies
	replace admininstrative-intensive server-centric 
		systems by self-organization


2. The notion of the "overlay" network

	each node "knows" about (maintains connections with)
	 a small number of other nodes (its neigbors)
	the overlay is not necessarily the same as the
	 physical network (thus, topology-aware overlays)

3. The look-up problem:
	given a data item X stored at some dynamic sets of
	nodes in the system, find it.

4. Issues

	churn (nodes are leaving and entering the system)
	trust
	selfish-behavior (provide incentives)

5. Overview

	centralized (e.g., Napster) problems of scalability & resilience
        hierarchical (as in DNS) problems related to load-balancing and
			failure/removal of nodes high-up in the hierarchy
	symmetric solutions 
	 unstructured
		with and without routing indexes (e.g, Gnutella)
		flooding and variations (e.g., random walks)
		TTL tag & ping-pong 
	 structured 
		Based on DHTs (e.g., CHORD, CAN)
		the name of an item - converted to a numeric key
		the key is mapped to a node
                the DHT provides a single operation: lookup(key)
		either store the actual data items or pointers
			to whether the data items are currently stored
		issues:
		   - map keys to nodes in a load-balanced way
		   - forward a lookup for a key to an appropriate node 
			(to do so, each node must know about some other
			nodes, this information is maintained in routing 
			tables (per node))
		   - handle churn
		Guarantee O(log N) lookup performance 
	 content-based clustering

        pure peer-to-peer systems vs super-peers (e.g., KaZaa)


6. Replication and Caching

	caching (e.g., Freenet) built routing tables using caching

7. Advanced Query Processing

	range queries
	XML/RDF queries

8. Schema Matching