The Battle of the Neighborhoods — Applied Data Science Capstone by IBM & Coursera

Charlotte Chang
8 min readMay 17, 2021
Photo Credit — londoneye.com

1. Introduction

1.1. Background

London is one of the biggest cities in the world with a population of over 8 million people, as per a Census performed in 2011. Subsequently, the city has attracted individuals from all over the world. It has become one of the most ethnically diverse cities in the world. London is melting pot of various people, ethnicities, cultures and backgrounds. Asians are one of its largest minority ethnic group. As per the 2011 Census, Asians make up 18.4% of the total population in London. This includes many Chinese, both immigrants and those that were born and raised in the UK. As such, it is no surprise that Chinese cuisine has been increasing in popularity. London even has its very own Chinatown located in Westminster.

1.2. Business Problem

Our business problem here is to capitalize on this increasing demand and interest in Chinese cuisine and open a Chinese restaurant. However, the first thing to think about when opening a new restaurant is location. The purpose of this project is to determine which neighborhood would be the most ideal to open a new Chinese restaurant. To do so, we will be analyzing the demographic data of boroughs in London and nearby venues as well as performing clustering on neighborhoods.

2. Data Acquisition

2.1. Data Sources

We will be using 3 main data sources:

  • Firstly, we want to figure out which boroughs have the highest Chinese population since one of the assumptions we will be making in this project is that the demand for Chinese food is predominantly from the Chinese community. Thus, we need demographic data in London. We can obtain this information from the Wikipedia page — Ethnic Groups in London.
  • Next, we will be choosing the top five boroughs with the highest Chinese population and obtain a list of neighborhoods and respective postal codes for those boroughs. The postal codes will then be used to get the coordinates (latitude and longitude) of the neighborhood which we will need later on. We can obtain this information from the Wikipedia page — List of Areas of London.
  • Lastly, we need to use the FourSquare API to explore each neighborhood and analyze what venues are in each neighborhood, how many Chinese restaurants there are, and which venues are the most common in each neighborhood. This will guide our final recommendation on the neighborhood that is the most suitable to open a new Chinese restaurant. We can obtain this information from the Foursquare Developer API.

3. Methodology

The main dataframe we will be using contains the following data:

  • Neighborhood, Borough, PostCode, Latitude and Longitude

3.1. Selecting the boroughs to analyze

There are over 300 neighborhoods in London. To reduce the amount of data processing, we will focus our analysis on the top 5 boroughs with the highest Chinese population. We assume that the Chinese population make up the majority of the market for Chinese cuisine. In reality, other ethnicities also do frequent Chinese restaurants. However, since market data for Chinese cuisine is not made freely available on the internet, for the purpose of this project, we will make this assumption.

We determine the top 5 boroughs by filtering the Chinese Population column in descending order. We end up with 54 neighborhoods.

3.2. Exploring a Sample Neighborhood

We shall do an initial exploration of a sample neighborhood, Arkley, Barnet, to determine the workability of the FourSquare API data.

We perform a GET request to the FourSquare API, using a limit of 100 and a radius of 2000, and examine the initial results. There are 42 venues in Arkley, Barnet.

We drill into the data by generating a list of all 42 venues and their respective categories, i.e., pub, Italian restaurant, etc. We end up with the below dataframe.

Now we want to check how many Chinese restaurant, if any, there are in this neighborhood. To do this, we create a loop iterating through the above dataframe. If the categories column is ‘Chinese Restaurant’, we will print the row. Interestingly, after running the loop, the output is empty. This means that Arkley, Barnet does not have any Chinese restaurants in the area although it is in the borough with the highest Chinese population.

3.3. Exploring all the Neighborhoods in the 5 Boroughs

Now let us explore all the neighborhoods within the 5 boroughs, Barnet, Tower Hamlets, Southwark, Camden and Westminster, and determine which neighborhoods have the most Chinese restaurants. By analyzing which neighborhoods have the most Chinese restaurants, we can indirectly gauge where the demand for Chinese cuisine is.

Firstly, as above, we extract all the selected neighborhoods’ venues and their respective categories. We then run a one hot encoding on the results, group the results by neighborhood and calculate the frequency of each venue category. We end up with the following table.

We are only interested in the frequency of Chinese restaurants, so we filter out those neighborhoods where the frequency of Chinese restaurants is 0. Then, we drop all columns other than ‘Neighborhood’ and ‘Chinese Restaurant’. Finally, we sort the result in descending order. Our new dataframe is as follows.

We have narrow down our possible neighborhood options to 11 neighborhoods. But we want more insights into these neighborhoods before we make our conclusions and recommendations. One thing we can do is check what are the 10 most common venues in each neighborhood to determine if Chinese restaurants is one of them and also where it places in the 10 most common venues. Using the one hot encoding dataframe, we create a function that will return the most common venues and set the value as 10. We generate a new dataframe with the results.

3.4. Clustering the Neighborhoods

We want to cluster the neighborhoods to determine if there is any correlation in the neighborhoods and concentration of Chinese restaurants. We chose the kmeans clustering method as it is the most common and best suited for our purposes. We also use the Elbow Method and determined the best number of clusters is 5.

Next, we do a left join on the two main dataframe we have: the most common venue dataframe and our original dataframe containing the neighborhood, borough, postal code, latitude and longitude. We also visualize the results of our machine learning by utilizing folium to superimpose the clusters onto a map of London. Lastly, we examine each cluster by slicing the dataframe by cluster.

4. Results

Our final results from the analysis is the following dataframe containing the columns Neighborhood, Borough, PostCode, Latitude, Longitude, Cluster Labels, 1st Most Common Venue to 10th Most Common Venue.

Fig 1

We drill down and see here the 10 most common venues for each neighborhood and, highlighted in the green boxes, if Chinese restaurants is one of them.

Fig 2

Our folium map with the markers for each cluster was rendered as shown below.

Fig 3

The resulting slicing of the dataframe into each cluster is shown below.

Cluster 1

Fig 4

Cluster 2

Fig 5

Cluster 3

Fig 6

Cluster 4

Fig 7

Cluster 5

Fig 8

5. Discussion and Recommendation

Based on an initial look at Fig 2, we can see that only 2 neighborhoods — Brent Cross, Barnet and Bow, Tower Hamlets — have Chinese restaurants listed in the top 5 most common venues. Some of the other neighborhoods do not even have Chinese restaurant listed. We now have a clearer picture of which neighborhoods seem to be popular for Chinese restaurants.

From the Fig 3 map, we can see that the area we want to be focusing on is near the London city center and north of London.

Cluster 2 and 5 (in Fig 5 and 8) are removed from our consideration. They do not have Chinese restaurant listed in the 10 most common venues. We are looking for areas where demand for Chinese restaurants is high, thus where Chinese restaurants are a popular locale.

The most viable option here is Cluster 1 and 3. The clustering algorithm has confirmed that the neighborhood Brent Cross and Bow are our best options. However, we must keep in mind that there are other contributing factors that we have not considered due to the lack of public information available such as a breakdown of ethnicities in each neighborhood and market data on Chinese restaurants, i.e., who are more likely to frequent them including a breakdown of age, income, education, ethnicity etc. information. Also, the Census information we obtained was from back in 2011, 10 years ago, therefore the demographic in London may have chanced since then.

A city as diverse as London and the 5 selected boroughs with the highest Chinese population only results in 11 neighborhoods that have Chinese restaurants is unlikely.

Given this caveat, our recommendation for where to open a Chinese restaurant is either at Brent Cross, Barnet or Bow, Tower Hamlet.

6. Conclusion

In this project, our aim was to identify which neighborhood would be best suited to open a new Chinese restaurant. Our main methodology to determine this was to understand where the demand is for Chinese restaurants. We accomplished this by determining where the concentration of Chinese communities are, then analyzing each neighborhood in those boroughs. The neighborhoods with the most Chinese restaurants are the ones that were deemed the most viable options as this indicates where the demand is high.

Our recommendation, ultimately, was either Brent Cross, Barnet or Bow, Tower Hamlet. Although in this report, we do acknowledge that the missing data, especially market data on Chinese restaurants is required to give an even more accurate recommendation.

This information would be useful for those looking to open a new Chinese restaurant and wanting to know where the demand for them is high.

--

--