01TXGSM

A.A. 2023/24

Course Language

Inglese

Course degree

Master of science-level of the Bologna process in Data Science And Engineering - Torino

Course structure

Teaching | Hours |
---|

Teachers

Teacher | Status | SSD | h.Les | h.Ex | h.Lab | h.Tut | Years teaching |
---|

Teaching assistant

Context

SSD | CFU | Activities | Area context |
---|---|---|---|

MAT/03 SECS-S/01 |
4 4 |
C - Affini o integrative F - Altre attività (art. 10) |
Attività formative affini o integrative Altre conoscenze utili per l'inserimento nel mondo del lavoro |

2022/23

The aim of this course is to introduce the students to a solid mathematical foundation of Machine Learning (ML) by blending learning theory, geometry, topology and statistics. Starting from introducing the algebraic and geometric structures used to represent and manipulate data, we will move to the more geometrical aspects of ML with a particular attention to the various concepts of dimension and learnability. Linear algebra-based methods will be thoroughly presented. At the same time, (generalized) linear models, their selection, regularization, validation and hyperparametric tuning will be presented in full detail from a rigorous statistical point of view as well as some Bayesian methods. These two aspects of the theory, the geometrical and the statistical one, will merged via case study on real and in silico data.

This course aims to introduce the students to a solid mathematical foundation of Machine Learning (ML) by blending learning theory, geometry, topology, and statistics. Starting from introducing the algebraic and geometric structures used to represent and manipulate data, we will move to the more geometrical aspects of ML with particular attention to the various concepts of dimension and learnability. Linear algebra-based methods will be thoroughly presented. At the same time, (generalized) linear models, their selection, regularization, and validation will be presented in full detail from a rigorous statistical point of view as well as some Bayesian methods. These two aspects of the theory, the geometrical and the statistical one, will be merged via case studies on data.

The student will learn the basic concepts of machine and statistical learning from both the frequentist and the Bayesian viewpoint, the main techniques for multivariate data and the critical use of specialised software (R, SAS, BUGS, STAN, MATLAB, ORANGE, R, Python, Rapid Miner and the like), being able to tell the pros and cons.

The student will learn the basic concepts of machine and statistical learning from both the frequentist and the Bayesian viewpoint, the main techniques for multivariate data, and the critical use of specialized software (R, SAS, BUGS, STAN, MATLAB, ORANGE, R, Python, Rapid Miner and the like), being able to tell the pros and cons.

Knowledge of basic probability theory and statistics; linear algebra, in particular SVD; basic of metric geometry and calculus are the prerequisites for this course.

The prerequisites for this course are knowledge of basic probability theory and statistics; linear algebra, in particular, SVD; basic metric geometry, and calculus.

• Mathematical representations of data: spaces (including Hilbert spaces), metrics, distances, dissimilarities and kernels. Geometry of very high dimensional spaces and the curse of dimensionality.
• Learning theory, PAC, Rademacher and VC dimension. Trade-off Bias vs Model Variance and Model Complexity.
• Cross validation, bootstrap and applications.
• Linear algebra-based methods: Principal Component Analysis, Linear Discriminant Analysis, Independent Component Analysis and Stochastic projections (Johnson - Lindenstrauss Transform).
• Linear Models (regression, ANOVA, DOE).
• Generalized linear models (categorical data, logistic and multinomial regression).
• Model and feature selection, hyperparameter tuning (e.g. lasso, AIC, BIC, ridge).
• Bayesian networks (basic concepts, exact and MCMC-based computations).

• Mathematical representations of data: spaces (including Hilbert spaces), metrics, distances, dissimilarities, and kernels. The geometry of very high dimensional spaces and the curse of dimensionality.
• Learning theory, PAC, and VC dimension. Trade-off Bias vs Model Variance and Model Complexity.
• Cross-validation, bootstrap, and applications. Ensemble methods: bagging, random forest, and boosting.
• Linear algebra-based methods: Principal Component Analysis; Linear Discriminant Analysis; Stochastic projections (Johnson - Lindenstrauss Transform); Support Vector machines and kernel methods.
• Linear Models (regression, ANOVA, DOE).
• Generalized linear models (categorical data, logistic and multinomial regression).
• Model and feature selection (e.g. lasso, AIC, BIC, ridge).
• Bayesian networks (basic concepts, exact and MCMC-based computations).

In the first part of the course the lectures are held with the support of slides. Exercises are presented and solved in the class as well. In the final part of the course the lessons will mainly consist in activities carried out at the computer lab under the guidance of the teacher. Technical discussions during class lectures will also help to assess the acquired level of knowledge and ability at the different stages of the course.

There will be 60 hours of lessons and 20 hours of practice. Exercises are presented and solved in the class.

Slides of the lectures, examples of R and python scripts and exercises with solutions will be available in the website of the course. A list of suggested books will be also provided by the teacher during the first lecture.

Slides of the lectures, examples of R and python scripts and exercises with solutions will be available on the website of the course.
A list of suggested books:
- An Introduction to Statistical Learning with Applications in R.
James, G., Witten, D., Hastie, T., Tibshirani, R.
Springer Verlag
- Understanding machine learning: From theory to algorithms.
Shalev-Shwartz, Shai, and Shai Ben-David.
Cambridge university press, 2014.
- Data Science and Machine Learning: Mathematical and Statistical Methods
Dirk P. Kroese, Zdravko I. Botev, Thomas Taimre, Radislav Vaisman
CRC Press, 2019 - 510 pages.

...
The goal of the exam is to test the knowledge of the candidate about the topics included in the official program and to test their skills in analyzing data using the methods explained in the course.
The exam consists of a written examination and an optional oral examination.
The written examination consists of 3 exercises. Two of the exercises will be similar to those presented during the lectures and will consist of modeling some practical problems or providing a suitable model for a given dataset.
The third will be conceptual.
The length of the written exam is two hours, and during the test, it is allowed the use of textbooks, student notes or formularies provided by the teacher during the year.
The maximum possible score will be 27/30.
The oral exam is possible under request for those students that in the written exam get a positive mark (greater or equal to 18/30).
It will be the discussion of a short technical presentation on the analysis of a data set performed by using the methods taught in the course. This will be software independent i.e. one can use Orange, R, Matlab, Rapidminer, Python, C++ etc. according to their knowledge or willingness. The students will be asked methodological and theoretical questions related to the methods used in their analysis.
After the oral test, the mark obtained in the first part of the exam can be increased or decreased by no more than 6 points.

Gli studenti e le studentesse con disabilità o con Disturbi Specifici di Apprendimento (DSA), oltre alla segnalazione tramite procedura informatizzata, sono invitati a comunicare anche direttamente al/la docente titolare dell'insegnamento, con un preavviso non inferiore ad una settimana dall'avvio della sessione d'esame, gli strumenti compensativi concordati con l'Unità Special Needs, al fine di permettere al/la docente la declinazione più idonea in riferimento alla specifica tipologia di esame.

The goal of the exam is to test the knowledge of the candidate about the topics included in the official program and to test their skills in analyzing data using the methods explained in the course.
The exam consists of a written examination and an optional oral examination.
The written examination consists of 3 exercises. Two of the exercises will be similar to those presented during the lectures and will consist of modeling some practical problems or providing a suitable model for a given dataset.
The third will be conceptual.
The length of the written exam is two hours, and during the test, it is allowed the use of textbooks, student notes, or formularies provided by the teacher during the year.
The maximum possible score will be 28/30.
The oral exam is possible under request by the student (or by the professor) for those students that in the written exam get a positive mark (greater or equal to 18/30).
The students will be asked methodological and theoretical questions related to the course’s contents.
After the oral test, the mark obtained in the first part of the exam can be increased or decreased by no more than 6 points.

In addition to the message sent by the online system, students with disabilities or Specific Learning Disorders (SLD) are invited to directly inform the professor in charge of the course about the special arrangements for the exam that have been agreed with the Special Needs Unit. The professor has to be informed at least one week before the beginning of the examination session in order to provide students with the most suitable arrangements for each specific type of exam.

© Politecnico di Torino

Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY

Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY