Politecnico di Torino
Politecnico di Torino
Politecnico di Torino
Academic Year 2015/16
Big data: architectures and data analytics
Master of science-level of the Bologna process in Electronic Engineering - Torino
Master of science-level of the Bologna process in Ict For Smart Societies - Torino
Master of science-level of the Bologna process in Mathematical Engineering - Torino
Teacher Status SSD Les Ex Lab Tut Years teaching
Garza Paolo ORARIO RICEVIMENTO A2 ING-INF/05 40 5 15 0 5
SSD CFU Activities Area context
ING-INF/05 6 D - A scelta dello studente A scelta dello studente
Subject fundamentals
In the big data era traditional data management and analytics systems are no more adequate. Hence, to manage and fruitfully exploit the huge amount of available heterogeneous data, novel data models, programming paradigms, information systems, and network architectures are needed.
The course addresses the challenges arising in the Big Data era, mainly from a data prospective. Specifically, the course will cover how to collect, store, retrieve, and analyze big data to mine useful knowledge and insightful hints. The course covers not only data model and data analytics aspects but also novel programming paradigms (e.g., Map Reduce, Spark RDDs), distributed systems (e.g., Hadoop), cloud computing and network infrastructures, and discusses how they can be exploit to support big data scientists to extract insights from data.
Laboratory sessions allow experimental activities on the most widespread open-source products.
Expected learning outcomes
The course aims at providing:
• Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop).
• Ability to write distributed programs to process and analyze data by means of novel programming paradigms: Map Reduce and Spark programming paradigms
• Knowledge of non-relational database systems (e.g., Hive and HBase) and ability to design databases based on non-relational data models.
• Knowledge of the main characteristics of cloud computing platforms and network infrastructures for Big Data applications.
Prerequisites / Assumed knowledge
Basic programming skills (Java language) and basic knowledge of traditional database concepts (i.e., the relational model and the SQL language).
Lectures (51 hours)
• Introduction to Big data: characteristics, problems, opportunities (3 ore)
• Hadoop and its ecosystem: infrastructure and basic components (3 ore)
• Map Reduce programming paradigm (12 ore)
• Spark: Spark Architecture and RDD-based programming paradigm (13.5 ore)
• NoSQL databases: data models, design, and querying (Hive and HBase) (9 ore)
• Data acquisition: Sqoop, Flume, ... (3 ore)
• Data mining and Machine learning libraries: MLlib and Mahout (3 ore)
• Cloud computing platforms and Network infrastructures for Big Data applications (4.5 ore)

Laboratory activities (9 hours)
• Developing of applications by means of Hadoop, Spark and NoSQL databases (9 ore)
Delivery modes
The course includes laboratory sessions on the main topics of the course (Map Reduce, Spark, HBase, Hive, Sqoop, and MLlib) (9 hours). Laboratory sessions allow experimental activities on the most widespread open-source products.
Texts, readings, handouts and other learning resources
Reference books:
• Tom White. Hadoop, The Definitive Guide. (Third edition). O’Reilly, Yahoo Press, 2012.
• Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. Learning Spark: Lightning-Fast Big Data Analytics. O’Reilly, 2015.
• Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. Advanced Analytics with Spark. O’Reilly, 2014.

Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website or the Portal.
Assessment and grading criteria
The exam includes a written part and the evaluation of the report on the individual practices assigned during the course. The written part includes programming exercises (Map Reduce- and RDDs-based programming) and questions on the main course topics (technological characteristics of Hadoop and Spark, NoSQL databases and data models, cloud computing and network infrastructure for Big data).

Programma definitivo per l'A.A.2015/16

© Politecnico di Torino
Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY
WCAG 2.0 (Level AA)