PORTALE DELLA DIDATTICA

Data spaces/Modelli statistici

04RLONG

A.A. 2022/23

Course Language

Inglese

Course degree

Course structure
Teaching Hours
Teachers
Teacher Status SSD h.Les h.Ex h.Lab h.Tut Years teaching
Teaching assistant
Context
SSD CFU Activities Area context
2021/22
The main objective of this course is to provide the students with solid mathematical bases of the major techniques used in supervised and unsupervised statistical (machine) learning with a special focus on their geometrical aspects.
The main objective of this course is to provide the students with solid mathematical bases of the major techniques used in supervised and unsupervised statistical (machine) learning with a special focus on their geometrical aspects.
- Knowledge and understanding of the main learning techniques (detailed knowledge of the mathematics behind the most popular learning techniques; be acquainted of the limitations of the various techniques; awareness of the structural problem as e.r. the curse of dimensionality) - Practical application of the acquired knowledge (ability to identify the applicability domain of the various techniques with respect of the nature of data; ability to extract information from real and simulated data by applying the learned techniques via software application or development).
- Knowledge and understanding of the main learning techniques (detailed knowledge of the mathematics behind the most popular learning techniques; be acquainted of the limitations of the various techniques; awareness of the structural problem as e.r. the curse of dimensionality) - Practical application of the acquired knowledge (ability to identify the applicability domain of the various techniques with respect of the nature of data; ability to extract information from real and simulated data by applying the learned techniques via software application or development).
The students are assumed to know the topics covered by standard courses in mathematics given in the Bs.D. in Engineering. Furthermore, a knowledge in basic probability and statistics is required: pdf, normal, expectation, mean, variance – covariance. SVD will be explained along the course.
The students are assumed to know the topics covered by standard courses in mathematics given in the Bs.D. in Engineering. Furthermore, a knowledge in basic probability and statistics is required: pdf, normal, expectation, mean, variance – covariance. SVD will be explained along the course.
GENERALITIES ON DATA REPRESENTATION. Metric and topological spaces. Coordinatization. Distances, dissimilarities, and kernels. The curse of dimensionality: the Law of Large Numbers; the Geometry of High Dimensions: properties of the Unit Ball; Generating Points Uniformly at Random from a Ball; Gaussians in High Dimension; Random Projection and Johnson-Lindenstrauss Lemma. LINEAR ALGEBRA BASED METHODS: SVD. Principal Components Analysis. Independent Component Analysis. STATISTICAL LEARNING What Is Statistical Learning? Why Estimate f? How Do We Estimate f? The Trade-Off Between Prediction Accuracy and Model Interpretability. The Bias-Variance Trade-Off. PAC, Rademacher and VC dimension. Supervised Versus Unsupervised Learning. Regression Versus Classification Problems. Assessing Model Accuracy. Measuring the Quality of Fit. LINEAR REGRESSION Simple Linear Regression. Multiple Linear Regression. CLASSIFICATION An Overview of Classification. Logistic Regression. Multiple Logistic Regression. Logistic Regression for >2 Response Classes. Linear Discriminant Analysis. Quadratic Discriminant Analysis. Comparison of Classification Methods. K-Nearest Neighbours. RESAMPLING METHODS Cross-Validation. Leave-One-Out Cross-Validation. k-Fold Cross-Validation. Cross-Validation on Classification Problems. The Bootstrap. Mathematical justification of these methods. TREE-BASED METHODS The Basics of Decision Trees and Regression Trees as optimization and combinatorial problem. Bagging, Random Forests, Boosting. SUPPORT VECTOR MACHINES. Classification Using a Separating Hyperplane. The Maximal Margin Classifier. Construction of the Maximal Margin Classifier. The Non-separable Case. Support Vector Classifiers. Support Vector Machines. SVMs with More than Two Classes. OVO and OVA. Relationship to Logistic Regression. Kernel Methods.
GENERALITIES ON DATA REPRESENTATION. Metric and topological spaces. Coordinatization. Distances, dissimilarities, and kernels. The curse of dimensionality: the Law of Large Numbers; the Geometry of High Dimensions: properties of the Unit Ball; Generating Points Uniformly at Random from a Ball; Gaussians in High Dimension; Random Projection and Johnson-Lindenstrauss Lemma. LINEAR ALGEBRA BASED METHODS: SVD. Principal Components Analysis. Independent Component Analysis. STATISTICAL LEARNING What Is Statistical Learning? Why Estimate f? How Do We Estimate f? The Trade-Off Between Prediction Accuracy and Model Interpretability. The Bias-Variance Trade-Off. PAC, Rademacher and VC dimension. Supervised Versus Unsupervised Learning. Regression Versus Classification Problems. Assessing Model Accuracy. Measuring the Quality of Fit. LINEAR REGRESSION Simple Linear Regression. Multiple Linear Regression. CLASSIFICATION An Overview of Classification. Logistic Regression. Multiple Logistic Regression. Logistic Regression for >2 Response Classes. Linear Discriminant Analysis. Quadratic Discriminant Analysis. Comparison of Classification Methods. K-Nearest Neighbours. RESAMPLING METHODS Cross-Validation. Leave-One-Out Cross-Validation. k-Fold Cross-Validation. Cross-Validation on Classification Problems. The Bootstrap. Mathematical justification of these methods. TREE-BASED METHODS The Basics of Decision Trees and Regression Trees as optimization and combinatorial problem. Bagging, Random Forests, Boosting. SUPPORT VECTOR MACHINES. Classification Using a Separating Hyperplane. The Maximal Margin Classifier. Construction of the Maximal Margin Classifier. The Non-separable Case. Support Vector Classifiers. Support Vector Machines. SVMs with More than Two Classes. OVO and OVA. Relationship to Logistic Regression. Kernel Methods.
Lessons, exercise classes and laboratory sessions will be given. There will be three hours of lesson per week plus 1.5 one hour and half of exercises / further lessons. These latter are split into two group: one for mathematical engineering and the other for software enngineering.
Lessons, exercise classes and laboratory sessions will be given. There will be three hours of lesson per week plus 1.5 one hour and half of exercises / further lessons. These latter are split into two group: one for mathematical engineering and the other for software enngineering.
An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani https://www.amazon.it/Introduction-Statistical-Learning-Applications/dp/1461471370/ref=sr_1_1?ie=UTF8&qid=1474898531&sr=8-1&keywords=An+Introduction+to+Statistical+Learning+with+Applications+in+R freely available at http://www-bcf.usc.edu/~gareth/ISL/
An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani https://www.amazon.it/Introduction-Statistical-Learning-Applications/dp/1461471370/ref=sr_1_1?ie=UTF8&qid=1474898531&sr=8-1&keywords=An+Introduction+to+Statistical+Learning+with+Applications+in+R freely available at http://www-bcf.usc.edu/~gareth/ISL/
Modalità di esame: Prova orale obbligatoria; Elaborato scritto individuale;
Exam: Compulsory oral exam; Individual essay;
The goal of the exam is to test the knowledge of the candidate about the topics included in the official program and to test their skills in analysing data using the methods explained in the course. The exam consists in two parts: first the candidate will write a technical relation "tesina" on the analysis of a data set performed by using the methods taught in the course. This will be software independent i.e. one can use Orange, R, Matlab, Rapidminer, Python, C++ etc. according to their knowledge or willingness. Once the "tesina" is approved by the professor, then the student is allowed to present it in an oral exam (about 20.min) during which the professor will also ask questions on the theoretical aspects of the methods used in the tesina. Sample work from the previous years will be provided on the website. CAVEAT: students following this course as a submodule of Statistical Models will give the exam according to the rules fixed thereby.
Gli studenti e le studentesse con disabilità o con Disturbi Specifici di Apprendimento (DSA), oltre alla segnalazione tramite procedura informatizzata, sono invitati a comunicare anche direttamente al/la docente titolare dell'insegnamento, con un preavviso non inferiore ad una settimana dall'avvio della sessione d'esame, gli strumenti compensativi concordati con l'Unità Special Needs, al fine di permettere al/la docente la declinazione più idonea in riferimento alla specifica tipologia di esame.
Exam: Compulsory oral exam; Individual essay;
The goal of the exam is to test the knowledge of the candidate about the topics included in the official program and to test their skills in analysing data using the methods explained in the course. The candidate will write a technical relation "tesina" on the analysis of a data set performed by using the methods taught in the course. This will be software independent i.e. one can use Orange, R, Matlab, Rapidminer, Python, C++ etc. according to their knowledge or willingness. The student has to present it in an oral exam (about 20.min) during which the professor will also ask questions on the theoretical aspects of the methods used in the tesina. Sample work from the previous years will be provided on the website. CAVEAT: students following this course as a submodule of Statistical Models will give the exam according to the rules fixed thereby.
In addition to the message sent by the online system, students with disabilities or Specific Learning Disorders (SLD) are invited to directly inform the professor in charge of the course about the special arrangements for the exam that have been agreed with the Special Needs Unit. The professor has to be informed at least one week before the beginning of the examination session in order to provide students with the most suitable arrangements for each specific type of exam.