Lectures

Tentative schedule:

Week

Topic

Notes

Instructor

Week 1 Introduction Why not single machine, big-data challenges, datacenter structure, typical use cases and their requirements, course overview. All
Week 2 Batch Processing MapReduce: architecture, components, programming model. Sahu
Week 3 Batch Processing (cont) Storage side: HDFS, HBASE. Sahu
Week 4 Distributed Systems Primer (notes) Challenges and principles, failure modes, inherent tradeoffs. Geambasu
Week 5 Communication and Synchronization Building Blocks (RPC notes, clocks notes, mutual exclusion example (slides by Dave Andersen)) Remote procedure calls, clock synchronization, logical clocks -- all building blocks for distributed algorithms. Geambasu
Week 6 Hard problems in Distributed Systems (consensus problem notes, 2PC notes, Paxos notes (we skipped 3PC)) Consistency, consensus, known impossibility results, approaches to navigate the challenges. Geambasu
Week 7 Google's Storage Stack How core problems in distributed systems are solved in the real world. Design of Chubby, Bigtable, two fundamental components of a google cluster. High-level architecture of a Google cluster. Geambasu
Week 8 Data Models and Cleaning Why the relational data model? Why schemas? The ins and outs.

Reading:

Wu
Week 9 Cleaning and Integration Readings Wu
Week 10 Iterative Processing Spark, RDD abstractions for in-memory computation; Spark Tachyon, a memory-centric distributed file system. Sahu
Week 11 Iterative (continued) + Stream Processing Sahu
Week 12 Machine Learning Systems & Examples Introduction to MLLib and/or other ML processing systems. Industry guest.
Week 13 Classic Query Processing and Fast Query Processing

Classic, analytic, and transaction oriented query execution.

Readings
Wu
Week 14 Potourri Mixture of ideas: Graph analysis. Scalable visualization. Distributed transactions. Wu