PORTALE DELLA DIDATTICA

Ricerca CERCA
  KEYWORD

Clustering webpages for realistic experiments

keywords CLUSTERING, MACHINE LEARNING, WEB

Reference persons DANILO GIORDANO, MARTINO TREVISAN, LUCA VASSIO

Research Groups SmartData@PoliTO, Telecommunication Networks Group

Thesis type EXPERIMENTAL THESIS

Description Experimenting networked systems is fundamental for the development of novel techniques, assessing the impact of design choices and improve users' Quality of Experience. Testing the Web is typically done using lists of popular websites -- e.g., the Alexa rank (https://www.alexa.com/topsites), which however only offer a list of homepages of the target websites. This is a strong limitation, as websites are known to have a diverse webpage structure depending for example, on the subsections in which content is organized. The goal of this thesis is to develop a system able to select a subset of the pages of a website so that they are representative of the diversity of the internal structure. To this end, it is necessary to leverage Data Science and Machine Learning techniques, clustering among all, to group together similar pages and choose the right (and right number of) representatives. Using open datasets, and collecting additional if needed, the student will apply Machine Learning tools to achieve this goal, using Big Data approaches if the size of the dataset becomes large.

Required skills Machine Learning and Data Science
Python programming, using Scikit Learn and Pandas libraries
Networking fundamentals: HTTP, HTML, TCP


Deadline 29/07/2021      PROPONI LA TUA CANDIDATURA




© Politecnico di Torino
Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY
Contatti