Clustering webpages for realistic experiments on the Internet
keywords CLUSTERING, INTERNET, MACHINE LEARNING, WEB
Reference persons MARTINO TREVISAN, LUCA VASSIO
Research Groups SmartData@PoliTO, Telecommunication Networks Group
Thesis type EXPERIMENTAL THESIS
Description Experimenting networked systems is fundamental for the development of novel techniques, assessing the impact of design choices and improve users' Quality of Experience. Testing the Web is typically done using lists of popular websites -- e.g., the Alexa rank (https://www.alexa.com/topsites), which however only offer a list of homepages of the target websites. This is a strong limitation, as websites are known to have a diverse webpage structure depending for example, on the subsections in which content is organized. The goal of this thesis is to develop a system able to select a subset of the pages of a website so that they are representative of the diversity of the internal structure. To this end, it is necessary to leverage Data Science and Machine Learning techniques, clustering among all, to group together similar pages and choose the right (and right number of) representatives. Using open datasets, and collecting additional if needed, the student will apply Machine Learning tools to achieve this goal, using Big Data approaches if the size of the dataset becomes large.
Deadline 13/09/2023 PROPONI LA TUA CANDIDATURA