Empirical investigation of the usage of large pre-trained models in open source projects
keywords ARTIFICIAL INTELLIGENCE, BIG DATA ANALYSIS, BIG TECH, DATA MINING, MACHINE LEARNING, NATURAL LANGUAGE PROCESSING, OPEN SOURCE
Reference persons ANTONIO VETRO'
Research Groups DAUIN - GR-16 - SOFTWARE ENGINEERING GROUP - SOFTENG, DAUIN - GR-22 - Nexa Center for Internet & Society - NEXA
Thesis type DATA ANALYSIS, DATA MINING, OPEN SOURCE SOFTWARE, RESEARCH / EXPERIMENTAL, SOFTWARE DEVELOPMENT
Description Large-scale pre-trained models (such as BERT and GPT ) brought relevant achievements in the field of natural language processing (NLP). However, they also created high dependence from the few actors in the world that have the resources to build them, creating a large divide between big tech corporation and elite universities and other companies/universities of medium and smaller size. The main goal of the thesis is to mine open source repositories to understand the level of diffusion ( and thus dependence on) of the most common large-scale pre-trained models.
Initial readings include: https://arxiv.org/pdf/2003.08271.pdf , https://doi.org/10.1016/j.aiopen.2021.08.002 , https://www.youtube.com/watch?v=Fmi3fq3Q3Bo
Required skills The thesis requires very good development skills, knowledge of fundamental NLP and ML techniques, knowledge of how to interact with repositories (e.g., git). Grade point average equal to or higher than 26 can play a relevant role in the selection.
Notes When sending your application, we kindly ask you to attach the following information:
- list of exams taken in you master degree, with grades and grade point average
- a résumé or equivalent (e.g., linkedin profile), if you already have one
- by when you aim to graduate and an estimate of the time you can devote to the thesis in a typical week
Deadline 30/11/2023 PROPONI LA TUA CANDIDATURA