Master of science-level of the Bologna process in Ingegneria Informatica (Computer Engineering) - Torino Master of science-level of the Bologna process in Ingegneria Matematica - Torino Master of science-level of the Bologna process in Ingegneria Elettronica (Electronic Engineering) - Torino Master of science-level of the Bologna process in Nanotechnologies For Icts (Nanotecnologie Per Le Ict) - Torino/Grenoble/Losanna
In the big data era traditional data management and analytic systems are no more adequate. Hence, to manage and fruitfully exploit the huge amount of available heterogeneous data, novel data models, programming paradigms, information systems, and network architectures are needed.
The course addresses the challenges arising in the Big Data era. Specifically, the course will cover how to collect, store, retrieve, and analyze big data to mine useful knowledge and insightful hints. The course covers not only data model and data analytics aspects but also novel programming paradigms (e.g., MapReduce, Spark RDDs) and discusses how they can be exploit to support big data scientists to extract insights from data.
In the big data era traditional data management and analytic systems are no more adequate. Hence, to manage and fruitfully exploit the huge amount of available heterogeneous data, novel data models, programming paradigms, information systems, and network architectures are needed.
The course addresses the challenges arising in the Big Data era. Specifically, the course will cover how to collect, store, retrieve, and analyze big data to mine useful knowledge and insightful hints. The course covers not only data model and data analytics aspects but also novel programming paradigms (e.g., MapReduce, Spark RDDs) and discusses how they can be exploit to support big data scientists to extract insights from data.
The course aims at providing:
• Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark).
• Ability to write distributed programs to process and analyze data by means of novel programming paradigms: Map Reduce and Spark programming paradigms
• Knowledge of the (relational and non-relational) databases systems that are used to store big data
The course aims at providing:
• Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark).
• Ability to write distributed programs to process and analyze data by means of novel programming paradigms: Map Reduce and Spark programming paradigms
• Knowledge of the (relational and non-relational) databases systems that are used to store big data
Object-oriented programming skills, Java language, and basic knowledge of traditional database concepts (relational model and SQL language).
Object-oriented programming skills, Java language, and basic knowledge of traditional database concepts (relational model and SQL language).
Lectures (45 hours)
• Introduction to Big data: characteristics, problems, opportunities (3 hours)
• Hadoop and its ecosystem: infrastructure and basic components (3 hours)
• Map Reduce programming paradigm (10.5 hours)
• Spark: Spark Architecture and RDD-based programming paradigm (14.5 hours)
• Streaming data analysis: Spark Streaming (6 hours)
• Data mining and Machine learning libraries: Spark MLlib (6 hours)
• Databases for Big data: data models, design, and querying (e.g., HBase) (1.5 hours)
Laboratory activities (15 hours)
• Developing of applications by means of Hadoop and Spark (15 hours)
Lectures (45 hours)
• Introduction to Big data: characteristics, problems, opportunities (3 hours)
• Hadoop and its ecosystem: infrastructure and basic components (3 hours)
• Map Reduce programming paradigm (10.5 hours)
• Spark: Spark Architecture and RDD-based programming paradigm (14.5 hours)
• Streaming data analysis: Spark Streaming (6 hours)
• Data mining and Machine learning libraries: Spark MLlib (6 hours)
• Databases for Big data: data models, design, and querying (e.g., HBase) (1.5 hours)
Laboratory activities (15 hours)
• Developing of applications by means of Hadoop and Spark (15 hours)
The course consists of Lectures (45 hours) and Laboratory sessions (15 hours). The laboratory sessions are focused on the main topics of the course (Map Reduce, Spark, and MLlib) (15 hours). The Laboratory sessions allow experimental activities on the most widespread open-source products.
The course consists of Lectures (45 hours) and Laboratory sessions (15 hours). The laboratory sessions are focused on the main topics of the course (Map Reduce, Spark, and MLlib) (15 hours). The Laboratory sessions allow experimental activities on the most widespread open-source products.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website or the Portal.
Reference books:
• Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018.
• Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018.
• Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.
• Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website or the Portal.
Reference books:
• Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018.
• Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018.
• Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.
• Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Modalità di esame: Prova scritta tramite PC con l'utilizzo della piattaforma di ateneo;
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the Map Reduce programming paradigm and the Spark RDD-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data.
The exam consists of a written online test that lasts 2 hours.
Specifically, the written online test is composed of two parts:
- 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming to be solved using the Java language (27 points)
- 2 multiple choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course.
The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is closed book.
- Books, notes, empty sheets and any other paper materials are not allowed.
- Electronic devicesof any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed.
The exam is passed if the mark of the written exam is greater than or equal to 18 points.
Exam: Computer-based written test using the PoliTo platform;
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the Map Reduce programming paradigm and the Spark RDD-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data.
The exam consists of a written online test (online version with Exam+Respondus systems) that lasts 2 hours.
Specifically, the written online test is composed of two parts:
- 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming to be solved using the Java language (27 points)
- 2 multiple choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course.
The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is closed book.
- Books, notes, and any other paper material are not allowed.
- Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed.
The exam is passed if the mark of the test is greater than or equal to 18 points.
Modalità di esame: Test informatizzato in laboratorio; Prova scritta tramite PC con l'utilizzo della piattaforma di ateneo;
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the Map Reduce programming paradigm and the Spark RDD-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data.
The exam consists of a written online/in lab test that lasts 2 hours.
Specifically, the written test is composed of two parts:
- 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming to be solved using the Java language (27 points)
- 2 multiple choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course.
The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is closed book.
- Books, notes, empty sheets and any other paper materials are not allowed.
- Electronic devicesof any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed.
The exam is passed if the mark of the written exam is greater than or equal to 18 points.
Exam: Computer lab-based test; Computer-based written test using the PoliTo platform;
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the Map Reduce programming paradigm and the Spark RDD-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data.
The exam consists of a written online test (online version with Exam+Respondus systems) or a written in lab test (onsite version with the Exam system) that lasts 2 hours.
Specifically, the written online/in lab test is composed of two parts:
- 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming to be solved using the Java language (27 points)
- 2 multiple choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course.
The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is closed book.
- Books, notes, and any other paper material are not allowed.
- Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed.
The exam is passed if the mark of the test is greater than or equal to 18 points.