Enhancing Feature Space through Scraping for Gemeente Amsterdam
Jan 2022 ~ MSc Course "Data Systems Project"
						Length:    1mo (at 1.0 FTE)
							Programming language:    Python (Pandas, requests, Beautiful Soup, RE, NLTK, Math,
							datetime, GeoPandas, SciPy, scikit-learn) 
							Software:    Tableau 
							Data:  
    - BAG dataset (Basisregistratie adressen en gebouwen = basic registration
							addresses and buildings), containing information about every building in Amsterdam,
							such as the address, the neighborhood, and the function of the building
							   - Unstructured data on the Internet
							Problem description: 
							Enhance the feature space for fire effect modeling through web scraping and design a
							dashboard to visualize the results
							Approach & Results: 
							The municipality of Amsterdam is responsible for setting up fire safety inspections for the
							buildings in Amsterdam. However, because there are over 500.000 buildings in the city, the
							municipality created a ranking that sorts all the properties in descending order based on the
							risk score, defined as chance x effect. During one of the meetings with Gemeente Amsterdam,
							a gap in the effect score was noticed. Hence, the proposed solution is based on the idea that there
							is unstructured information publicly available online that can positively contribute to a
							more accurate effect score. 
							
							The infrastructure of the proposed system can be seen above. The system is composed of two
							main parts, namely the scraping and visualizing. The first one starts by extracting names of public
							assets within Amsterdam from Wikipedia that will play as the rows of the first dataset.  Since
							the constructions are communal, it was assumed that their online popularity reflects their
							true real-world popularity. Thus, extra features from Wikipedia, Tripadvisor, Google, and
							Flickr were scraped using Beautiful Soup to represent the interest in the respective objects.
							The described process is depicted in the upper branch of the diagram, following the black
							arrows into the Monuments (POI) dataset.
							
							On the other hand, since other non-public buildings are also important, their addresses were taken
							from the BAG dataset, and the value of each feature was computed considering the public assets
							in the vicinity and aggregating their respective variables. One can see this in the diagram
							following the blue arrows. Finally, the two derived datasets are visualized in a Tableau dashboard that allows the user
							to apply various filters and give specific attributes more importance if wanted.
							
							The final dashboard can be accessed at:
							https://public.tableau.com/app/profile/fabian4248/viz/GroupD1_16439031396800/DOCUMENTATION?publish=yes
						
