Introduction to Cloud & Big Data Systems
I learnt the most from the Intro to Cloud and Big Data course (16:137:602OC). The class introduced me to the world of massive data, data management and also performing analytics in different environments. It also provided an introduction to industry-wide tools such as Spark, Hadoop, Amazon Web Services and Python/Scala APIs for them. (May 2019)
Course Description:
This course provides a comprehensive study of the system architecture, software environment, enabling technologies, and innovative applications of Cloud and Big Data systems. Special emphasis is given to provide students with the background and hands on experience for making engineering decisions for business and science applications.
This course introduces fundamental concepts and key topics of Cloud and Big Data Systems such as Cloud Computing models and platforms, virtualization, distributed file systems, the MapReduce programming model, Big Data processing frameworks (Apache Hadoop and Spark), new database models, and Big Data analytics platforms and enabling technologies.
In addition, this course will discuss recent technological solutions and research in cloud computing and big data with a focus on bridging the gap between data analytics and data-driven platforms.
This course will require students to work on homework assignments, quizzes, and a final project related to cloud computing and/or big data.
Course Materials:
Course slides and homework serve as the primary class content. Readings and links on the open Web may also be used. This course does not follow any specific textbook as it covers different topics and technolgoies. However, the following (optional) books may be useful.
- Tom White. “Hadoop: The Definitive Guide, Fourth Edition”, O’Reilly, 2015.
- Edward Capriolo, Dean Wampler, Jason Rutherglen. “Programming Hive: Data Warehouse and Query Language for Hadoop”, O’Reilly, 2012.
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zharia. “Learning Spark: Lightning-Fast Big Data Analysis”, O’Reilly, 2015.
- Neha Narkhede, Gwen Shapira, Todd Palino. “Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale”, O’Reilly, 2017.
- Martin Kleppmann. “Designing Data-Intensive Applications”, O’Reilly, 2017.
- Jules J. Berman. “Principles of Big Data”, Morgan Kaufmann, 2013.
Course Format:
This is a hybrid course with virtual meetings for lectures and demonstrations and physical meetings for laboratory sessions and providing support for assignments. Students will be encouraged to participate in online forums and group work over the course of the semester.
With a focus on hands on cloud computing and big data systems, students will be involved in homework assignments, laboratories, quizzes and final project in small groups.
Outline and Tentative Schedule:
This course has two types of classes:
Monday sessions will focus on online lectures and demonstrations (i.e., teleconference call through gotomeeting).
Tuesday sessions will focus on laboratories and assignment support (ARC IML 118/119).
- Lectures: Module contents and demos during teleconference calls. Sessions will be recorded and made available online. The lectures outline is provided per week in Sakai.
- Labs: Physical meetings with technical support. A lab report will be required for 4 laboratory assignments.
- Quizzes: Simple questions to follow up after the lectures/materials (to be conducted up to 48 hours after the lesson).
Homework Assignments:
- HW1: Comparative analysis of cloud computing platforms
- HW2: MapReduce/Apache Hadoop
- HW3: In-memory processing/Apache Spark
- HW4: Streaming processing
Projects:
Pre-defined group projects will be offered; however, student groups can propose a specific project of interest (arrangements with the instructor are required).
Lectures:
Week 1: Foundations of Distributed Systems
Week 2: Grid and Cloud Computing
Week 3: Public Clouds
Week 4: Private Clouds
Week 5: Introduction to Big Data
Week 6: Big Data Storage/HDFS
Week 7: MapReduce, Apache Hadoop
Week 8: Apache Pig and Hive
Week 9: In-memory Processing, Apache Spark
Week 10: Data streaming (1), Spark Streaming
Week 11: Data streaming (2), Apache Kafka
Week 12: Kafka Demo and Data Services
Week 13: Introduction to Machine Learning
Week 14: Introduction to Apache Cassandra
Grading Policy:
- HW assignments: 44%
- Project: 20%
- Labs: 12%
- Quizzes: 24%