Topics in Database Systems: Data Management in Peer-to-Peer Systems INTRODUCTION (Week 1 - 22/2/2005) NOTES ========================================================================= about the content of the course -------------------------------- The course will cover issues related to data management in peer-to-peer systems. This includes: - topologies, structured and unstructured p2p systems - caching and replication - gossip-based dissemination - distributed query processing 1. What is a p2p system nodes at the edge of the network connect to share resources (files, cpu, etc) popularity started from music file sharing systems motivation make systems ultra-scalable and ultra-available break information monopolies replace admininstrative-intensive server-centric systems by self-organization 2. The notion of the "overlay" network each node "knows" about (maintains connections with) a small number of other nodes (its neigbors) the overlay is not necessarily the same as the physical network (thus, topology-aware overlays) 3. The look-up problem: given a data item X stored at some dynamic sets of nodes in the system, find it. 4. Issues churn (nodes are leaving and entering the system) trust selfish-behavior (provide incentives) 5. Overview centralized (e.g., Napster) problems of scalability & resilience hierarchical (as in DNS) problems related to load-balancing and failure/removal of nodes high-up in the hierarchy symmetric solutions unstructured with and without routing indexes (e.g, Gnutella) flooding and variations (e.g., random walks) TTL tag & ping-pong structured Based on DHTs (e.g., CHORD, CAN) the name of an item - converted to a numeric key the key is mapped to a node the DHT provides a single operation: lookup(key) either store the actual data items or pointers to whether the data items are currently stored issues: - map keys to nodes in a load-balanced way - forward a lookup for a key to an appropriate node (to do so, each node must know about some other nodes, this information is maintained in routing tables (per node)) - handle churn Guarantee O(log N) lookup performance content-based clustering pure peer-to-peer systems vs super-peers (e.g., KaZaa) 6. Replication and Caching caching (e.g., Freenet) built routing tables using caching 7. Advanced Query Processing range queries XML/RDF queries 8. Schema Matching