Recently Published
Beam Me Up!!
This is a project about a ride-sharing business that operating in several big cities in Turkey. The company provide motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” to their order buttons. In this project, we are going to help them and forecast the driver demand for the next 5 working days.
Policy Planning in Argentina - Using PCA
With almost 40 million inhabitants and diverse geography that encompasses the Andes mountains, glacial lakes, and the Pampas grasslands, Argentina is the second largest country (by area) and has one of the largest economies in South America. It is politically organized as a federation of 23 provinces and an autonomous city, Buenos Aires.
We will analyze ten economic and social indicators collected for each province. Because these indicators are highly correlated, we will use Principal Component Analysis (PCA) to reduce redundancies and highlight patterns that are not apparent in the raw data. After visualizing the patterns, we will use k-means clustering to partition the provinces into groups with similar development levels.
These results can be used to plan public policy by helping allocate resources to develop infrastructure, education, and welfare programs.
Predicting Food Price in Rwanda with Time Series
I love food, I love to eat, and I love to cook. But every time I go to the supermarket, my wallet weeps a little, even though I try my best to buy locally. But really, how expensive is food around the world? In this notebook, I want to explore time series of food prices in Rwanda from the United Nations Humanitarian Data Exchange Global Food Price Database. Agriculture makes up over 30% of Rwanda's economy, and over 60% of its export earnings, so the price of food is very important to the livelihood of many Rwandans.
Which Language are You Going to Use as a noob Data Scientist?
Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help you gain a better understanding of things that you may be asked to do in the future.
The Good, The Bad, and The Ugly (Password)
Today, many of us are forced to come up with new passwords all the time when signing into sites and apps. As a password inventeur, it is your responsibility to come up with good, hard-to-crack passwords. But it is also in the interest of sites and apps to make sure that you use good passwords. The problem is that it's really hard to define what makes a good password. However, the National Institute of Standards and Technology (NIST) knows what the second best thing is: To make sure you're at least not using a bad password.
Happy Planet Index 2016
The Happy Planet Index (HPI) is an index of human well-being and environmental impact that was created by Nic Marks, telling us how well nations are doing at achieving long, happy, sustainable lives. The index combines four elements to how how efficiently residents of different countries are using environmental resources to lead long, happy lives. I downloaded the 2016 dataset from Happy Planet Index website.
My goal is to find correlations between several variables, then use clustering technic to separate these 140 countries into different clusters, according to wellbeing, wealth (GDP), life expectancy and carbon emissions.
Heart Disease in Cleveland
Millions of people are getting some sort of heart disease every year and heart disease is the biggest killer of both men and women in the United States and around the world. Statistical analysis has identified many risk factors associated with heart disease such as age, blood pressure, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, lack of physical exercise, etc. In this notebook, we’re going to run statistical testings and regression models using the Cleveland heart disease dataset to assess one particular factor – maximum heart rate one can achieve during exercise and how it is associated with a higher likelihood of getting heart disease.
Candy Crush Saga Difficulty Level
Candy Crush Saga is a hit mobile game developed by King (part of Activision|Blizzard) that is played by millions of people all around the world. The game is structured as a series of levels where players need to match similar candy together to (hopefully) clear the level and keep progressing on the level map.
Candy Crush has more than 3000 levels, and new ones are added every week. That is a lot of levels! And with that many levels, it's important to get level difficulty just right. Too easy and the game gets boring, too hard and players become frustrated and quit playing.
In this project, we will see how we can use data collected from players to estimate level difficulty.
Fashion Neural Network Modelling
Applying what I learned and presented a simple R Markdown document to demonstrate neural network modeling, to identify fashion items in the photos.
Property Sales in NYC PCA Analysis
Using any of the two unsupervised learning algorithms we’ve learned, we eill produce a simple R markdown document where we demonstrate an exercise of either clustering or dimensionality reduction on one of either the wholesale.csv, the nyc.csv, or our own dataset.
We will explain our choice of parameters (how we choose k for k-means clustering, or how we choose to retain n number of dimensions for PCA) from the original data. We will give some business utility for the unsupervised model we’ve developed. (The R Markdown document should contain one or two visualization.)
Credit Risk Analysis with Decision Tree and Random Forest
This analysis is done to demonstrate the use of any of the 3 classification algorithms we’ve learned in Classification in Machine Learning Module to predict the risk status of a bank loan, show an understanding of holding out a test / cross-validation set for an estimate of the model’s performance on unseen data, sufficiently explain the model’s performance (accuracy, recall/sensitivity, and specificity), and demonstrate extra effort to improve the the accuracy obtained from the initial model.
Credit Risk Analysis
This analysis is done to demonstrate the use of logistic regression on the lbb_loans.csv dataset, correctly interpret the negative coefficients obtained from logistic regression model, understand which of the variables are more statistically significant as predictors, and demonstrate some strategies to improve the model that has been built.