projects

python

Kpop Data Analysis

02/2021 - 04/2021

Code

Part.1 Kpop Explained by Data

Analyzed data of all K-pop idols from its start to 2021 about K-pop Industry, artists and companies



Part.2 Kpop Companies Explained by Data

Visualized the business performance of public K-pop companies and analyzed their artist management and international marketing strategies


Here are the interactive data visualization of revenues and net income of Kpop Agencies from 2016 to 2020. If you hover your pointers over the lines of each year, the chart will show a hover box of the revenue or the net income of all companies that year.


Part.3 International Kpop Artists

In my last Kpop Data Analysis Project. I realized that there are some mistakes about nationality of kpop artists in the dataset. I corrected the data and made a clearer visualization of international Kpop artists by using Python and Plotly. This an interative choropleth of Kpop Stars' nationality other than South Korea. If you hover your pointers on the map, ther will be a information box showing how many Kpop star are from this country.


Part.4 Kpop On YouTube Explained by Data

As Kpop becomes increasingly international, YouTube plays a pivotal roles as the digital platforms for Kpop idols to share their music video to the audience all over the world. The view count is a key metrics reflecting the music videos' international popularity. I extracted the data of all Kpop music videos from Kpop Database and scraped the view counts of all 4262 music videos from YouTube by 04/05/2021.



Part.5 Why Kpop Groups Have So Many Members?

On average, Kpop groups have 5.5 members. 5-member group is the most common form. But why can some Kpop groups become so big? The largest Kpop group, NCT, has 23 members. I did an exploratory data analysis of Kpop group sizes by timeline.


python

Machine Learning in Python

Sentiment Analysis of Movie Reviews

01/2021

Code

When you have a large amount of movie reviews, how can you know whether they are complements or criticisms? In this project, I used natural language processing tools to classify the sentiment of the text by using both shallow learning and deep learning, and made a sentiment analysis of the dataset of reviews on imdb.


Dimension Reduction with PCA

01/2021

Code

In this article I will use Principal Component Analysis to showcase dimension reduction on 'banknote authentication' dataset


Social Media Analytics

11/2020

Code

Analyzed the data of Trump and Biden's recent tweets by scraping their recent tweets, investigating people's responses and inspecting the contents of their tweets


Calculating π by Monte-Carlo Simulation

07/2021

Code

As we learned more and more math, we found more and more ways to calculate π. In computational statistics, there is a way to calculate π by brute force -- Monte-Carlo Simulation. In this article, I will do a simple Monte-Carlo Simulation on the calculation of π, or the area of a circle. This method can also be applied to the calculation of any area of geometric shapes.


Bayesian Spam Filter

05/2020

Code

In this project, I will use Naive Bayes Classifier and Bag-of-Words model to implement a Bayesian spam filter. This article will walk you through the process of implementation, training and testing.

R

Data Analytics in R

Analysis on Tropical Atmosphere Ocean Data

01/2020 - 04/2020

Code
  • Analyzed and manipulated the database containing 96k+ data measuring El Niño effect in equatorial pacific, by using R, Trifecta
  • Clustered data in groups, applied logistic regression and hypothesis testing to find the relevant measures of El Niño effect, classified measures by different buoys for further studies of El Niño effect


Analysis of Secondary Education and Teen Fertility

04/2020

Code
  • Analyzed Word Bank's dataset of countries' secondary school enrollment rate and teen fertility rate by using OLS regression and difference-in-difference estimate
  • Found that improving a country’s secondary school enrollment rate lowers its teen fertility rate.


Visualization of World Indicator Data

11/2019

  • Explored the World Indicator data and visualize them with graphs by R
  • Compared and visualized US, China, Brazil, Russia and India about their development of Internet usage, CO2 emissions and Health expense percentage of GDP from 2000 to 2012
  • Compared and visualized the distribution of world population in 1998 and 2018

tableau

Visualization in Tableau

Visualization of Delayed Domestic Flights

11/2019

  • Analyzed and manipulated the database containing 13k+ delayed domestic flights in US in one day, by using SQL
  • Visualized the result of the data analysis in an interactive map illustrating delayed flights as lines between airports by using Tableau
  • Showcased the distribution and the scale of delayed flights in US, each flight’s information and insights on mapping out delayed flights, for travelers’ reference

The application showcases the domestic flights in the United States on 1 January 2015. The data demonstrates each airports’ longitude, latitude, and the distance between origin and destination airports. The lines change colors from the origin airports to its destination airports. From the visualization, we can observe that the flights in the contiguous 48 states are more frequent than the outlying states and territories. For the 48 states, the major hubs like DFW , JFK, and LAX have huge amounts of flights going in and out. For the outlying states and territories, the flights to and from Hawaii and Puerto Rico are more frequent than to and from Alaska and Guam, probably because of their famous tourism.