IBM Applied Data Science Capstone Project

The Battle of Neighborhoods

K. S. S Abhinav Kaushal

March, 2021

>> Introduction

Ø Background:

Silicon Valley is a region in the southern part of the San Francisco Bay Area in Northern California that serves as a global center for high technology and innovation. It corresponds roughly to the geographical Santa Clara Valley. San Jose is Silicon Valley’s largest city, the third-largest in California, and the tenth-largest in the United States; other major Silicon Valley cities include Sunnyvale, Santa Clara, Redwood City, Mountain View, Palo Alto, Menlo Park, and Cupertino. Some other cities — Milpitas, Fremont, Pleasanton, Saratoga, Campbell, Livermore etc within the neighborhood are highly evaluated to be convenient for residential setups.

Ø Problem Statement:

Consider a scenario where one wants to find a house in Silicon Valley. We narrow down the search into 5 cities in South Bay & East Bay areas: Cupertino, Sunnyvale, Milpitas, Fremont and Pleasanton. First of all, we want to know where they are, as well as some basic information — such as population, median household income, median house value, top employers and local schools. Other defining factors could be demographics profile, population change in recent years, & education ratings.

Ø Interests:

People moving into the Silicon Valley residential areas or have vested interests or obligations to make a home in the technological suburb of San Jose. Some real estate backdrop analysis may also present opportunities in long term investments/assets.

>> Data Framing & Methodology

— Data Sources:

1. Demographic Information & Zipcode: zipcode.org

2. Demographics & Economic Data: wikipedia.org

3. Living & Educational Data: niche.com

4. Zipcode Geographics: opendatasoft.com

5. Neighborhoods Information Interface: foursquare.com

v Data Description:

As we may not be able to obtain all the data from one single source, we want to do a comparison of one single dataset from various sources. Furthermore, we want to explore the neighborhoods (using zipcode to represent) and find out whether there are similar neighborhoods in more affordable areas with those in the expensive areas.

Demographic information and Zipcode are obtained from Zipcode.org. Information of demographics profile, population of 2010 from the United States Census, estimated population of 2019 and local top employers are obtained from Wikipedia. Getting Niche grades for Public schools, Housing, Good for families and Cost of living and median house value, median household income, high school names from its website. From Opendatasoft, we retrieve latitude and longitude information for each zipcode. With them, we make enquiry to Foursquare for location data.

v Data Cleaning, Feature Selection & Methodology:

Using Folium, the regional map is displayed.

Then Pandas DataFrame is created through the collected information from Zipcode.org, Wikipedia and NICHE respectively.
For Visualization, Matplotlib & Seaborn are used for plotting:
1, Population and Demographics Profile
2, Population Change from 2010 to 2019
3, Median House Value and Household Income
4, Population Data Comparison from Zipcode.org, Wikipedia and NICHE
With these, we combine three dataframes into one Dataframe to summarize all the information we want to know except for location data.
For instance: city, population, public school ratings, median house value and household income, rent, high schools, top local employers, zipcodes (neighborhoods).

à There are lot of additional information available from the selected sources that needs to be scrapped and tabularized for analysis purposes. This is achieved by using the web-scrapping tools for the extraction of the existing dataframes, followed by cleaning the irrelevant data from this methodology.

The above listed visualization plots are the focused features to identify & determine the selective statistics in choosing a neighborhood to live in.

— Area Mapping: Cities of Cupertino, Sunnyvale, Milpitas, Fremont and Pleasanton

→ Demographic Information from the sources:

→ These combined with the Latitude & Longitude values by zipcodes complete the data set from the sources:

>> Problem Modelling & Visualization

The following plots reveal the inferences of the required statistics needed for understanding the behaviour of the data. The relationships are plotted so as to show the dependency of every attribute in the dataframe.

>> Results

All 5 cities are very nice places to live because of the high rating education, renowned employers and active neighborhoods.
However, cost of living and house price are pretty high in all cities.

¯ Population in the region has slight increment in rent in ten years, and all 5 cities have the same trend.

¯ Medium household incomes of 5 cities are in the same range. The cities of Milpitas, Fremont and Pleasanton have relatively more affordable house price.

¯ One neighborhood in Sunnyvale, Pleasanton & Fremont each fall in the same cluster of Cupertino (Cluster 2: Blue, Red & Orange dots), featured with restaurants, and food with basic amenities.

¯ Many neighborhoods of Cupertino is in one cluster with neighborhoods from all others — Fremont, Pleasanton, Sunnyvale and Milpitas (Cluster 4, green dots), owning lots of worldwide cuisines, entertainments and parks.

¯ However, the observation is that major superimposing of the neighborhoods happen with Sunnyvale on the Cupertino cluster (Blue on Green dots)

→ Now, to summarize this by the results, the defining clusters were shown as below:

// Cluster 2:

// Cluster 4:

>> Discussion

Folium is really cool and fun to use. Seaborn and Matplotlib lack certain freedom for data representing due to having limited experience with them. But both were quite useful in the whole process.
Some other plots — bar, box, pie, scatter, bubble charts can also be extensively used for further analysis, with hue and style to display multiple data. But the displayed bar & line charts do present the needed visualization to understand the demographic analysis.

>> Conclusion

In this capstone project, Jupyter notebook, Business understanding (Analytic approach — Data requirement — Data collection — Data understanding — Data preparation — Modeling — Evaluation — Deployment — Feedback), Analyzation, Visualization and Machine Learning with Python are all utilized in the application of the concepts for solving the problem statement.
Being a real case study, anyone who is moving to Bay area may get a little idea looking at the collected data from Wikipedia, NICHE, and Foursquare.
// Sunnyvale seems to be an alternative choice to Pleasanton and Fremont with lesser commute to Central Silicon Valley but a little around a median living cost.