Servizi per la didattica
PORTALE DELLA DIDATTICA

Distributed architectures for big data processing and analytics

01TUYSM

A.A. 2019/20

2019/20

Distributed architectures for big data processing and analytics

Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).

Distributed architectures for big data processing and analytics

Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).

Distributed architectures for big data processing and analytics

The course aims at providing: • Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark). • Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms • Ability to write distributed programs to process and analyze streaming data • Knowledge of state of the art machine learning library for big data (e.g., MLlib) • Ability to design a big data pipeline

Distributed architectures for big data processing and analytics

The course aims at providing: • Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark). • Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms • Ability to write distributed programs to process and analyze streaming data • Knowledge of state of the art machine learning library for big data (e.g., MLlib) • Ability to design a big data pipeline

Distributed architectures for big data processing and analytics

• Object-oriented programming skills • Knowledge of standard centralized machine learning algorithms

Distributed architectures for big data processing and analytics

• Object-oriented programming skills • Knowledge of standard centralized machine learning algorithms

Distributed architectures for big data processing and analytics

Lectures and classroom exercises (62 hours) • Introduction to Big data: characteristics, problems, opportunities (3 hours) • Design of a big data pipeline (3 hours) • Hadoop and its ecosystem: infrastructure and basic components (4.5 hours) • Map Reduce programming paradigm (9 hours) • Spark Architecture and RDD-based programming paradigm (21.5 hours) • Data streaming analytics (Kafka, Spark Streaming, Storm) (12 hours) • Data mining and Machine learning libraries: MLlib (9 hours) Laboratory activities (18 hours) • Development of big data applications by means of Hadoop and Spark (18 hours)

Distributed architectures for big data processing and analytics

Lectures and classroom exercises (62 hours) • Introduction to Big data: characteristics, problems, opportunities (3 hours) • Design of a big data pipeline (3 hours) • Hadoop and its ecosystem: infrastructure and basic components (4.5 hours) • Map Reduce programming paradigm (9 hours) • Spark Architecture and RDD-based programming paradigm (21.5 hours) • Data streaming analytics (Kafka, Spark Streaming, Storm) (12 hours) • Data mining and Machine learning libraries: MLlib (9 hours) Laboratory activities (18 hours) • Development of big data applications by means of Hadoop and Spark (18 hours)

Distributed architectures for big data processing and analytics

Distributed architectures for big data processing and analytics

Distributed architectures for big data processing and analytics

The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.

Distributed architectures for big data processing and analytics

The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.

Distributed architectures for big data processing and analytics

Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal. Reference books: • Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. • Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. • Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015. • Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.

Distributed architectures for big data processing and analytics

Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal. Reference books: • Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. • Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. • Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015. • Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.

Distributed architectures for big data processing and analytics

Modalità di esame: Prova scritta (in aula);

Distributed architectures for big data processing and analytics

Distributed architectures for big data processing and analytics

Exam: Written test;

Distributed architectures for big data processing and analytics

The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written exam that lasts 2 hours. Specifically, the written exam is composed of two parts: - 2 programming exercises (Map Reduce- and RDD-based programming) (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is open book (notes and books can be used during the exam). The exam is passed if the mark of the written exam is greater than or equal to 18 points.

Esporta Word


© Politecnico di Torino
Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY
Contatti