Topics in Database Systems: Data Management in Peer-to-Peer Systems INTRODUCTION (Week 1 - 19/10/2006) NOTES ========================================================================= about the content of the course -------------------------------- The course will cover issues related to data management in peer-to-peer systems. This includes: - topologies, structured and unstructured p2p systems - caching and replication [- gossip-based dissemination] probably - distributed query processing 1. What is a p2p system nodes at the edge of the network connect to share resources (files, cpu, etc) popularity started from music file sharing systems motivation make systems ultra-scalable and ultra-available break information monopolies replace admininstrative-intensive server-centric systems by self-organization started as file sharing system but also internet-scale storage, DNS servers, search engines, you name it! 2. The notion of the "overlay" network each node "knows" about (maintains connections with) a small number of other nodes (its neigbors) the overlay is not necessarily the same as the physical network (thus, topology-aware overlays) * a network "on top" of the network 3. The look-up problem: given a data item X stored at some "dynamic" sets of nodes in the system, find it. 4. Issues churn (nodes are leaving and entering the system) trust selfish-behavior (provide incentives) performance quality faul-tolerance load balancing 5. Overview centralized (e.g., Napster) problems of scalability & resilience hierarchical (as in DNS) problems related to load-balancing and failure/removal of nodes high-up in the hierarchy symmetric solutions unstructured with and without routing indexes (e.g, Gnutella) flooding and variations (e.g., random walks) TTL tag & ping-pong structured Based on DHTs (e.g., CHORD, CAN) the name of an item - converted to a numeric key the key is mapped to a node the DHT provides a single operation: lookup(key) either store the actual data items or pointers to whether the data items are currently stored issues: - map keys to nodes in a load-balanced way - forward a lookup for a key to an appropriate node (to do so, each node must know about some other nodes, this information is maintained in routing tables (per node)) - handle churn Guarantee O(log N) lookup performance Tree-like structured overlays (based on search trees and variation) issues: - maintaining balance - avoid high loads on top nodes content-based clustering pure peer-to-peer systems vs super-peers (e.g., KaZaa) 6. Replication and Caching Many issues (some still open) how many copies where to place them how to update caching (e.g., Freenet) built routing tables using caching unstructured and structured 7. Advanced Query Processing range queries XML/RDF queries 8. Trust and Reputation