A Deep Dive into Real-World Data

Visualizations help us interpret the data quickly and make informed business decisions. For the past few days, I have been exploring visualizing features on R through the ggplot package and thought I could share some of the interesting things I learned.

Installation:

# Install and load ggplot2 package on R
> install.package("ggplot2")
> library (ggplot2)

Visualizing the Quantiles:

Let's plot the car manufacturer against city and highway mileage. NOTE: This data set is included in R.

> ggplot(data = mpg, mapping = aes(sample = cty, color=class)) + stat_qq(geom='point', distribution='qunif') + labs(x='Quantiles', y='City Mileage', color = 'Class', title = 'Quantile Plot, City Mileage (Grouped by Class)')

Clearly, 2- Seater and compact have the highest mpg among all the classes with the suv, minivan and pick up coming at the bottom 3. No surprises there - but it's interesting to see that differences in mpg are lower at the lower quartiles and tend to increase reaching the maximum at higher quartiles. This basically means that the maximum mpg are significantly higher for 2seater when compared to pick up, whereas the difference between their lowest mpg's are not that significant. This will be more clear from the below line graph that shows the trend.

facet_wrap:

We used color coding on above plot to differentiate the "class", but let's say you don't like to have all the lines on the same chart and you'd prefer to see them on different charts, then facet_wrap function can help. It basically creates multiple plots for each class category.

>ggplot() + facet_wrap('class', nrow = 2)

facet_wrap + Combined average:

It's difficult to compare individual charts and gather insights, so let's compare each of them with the combined average to see where they stand.

Things to call out? Interestingly, Minivans do have a higher distribution than the mean on the lower quartiles, but they tend to reduce as we move to the higher quartiles. 2 - Seater's graph is exactly the opposite - higher than average at lower quartiles that decreases as we move to higher quartiles.

We can also plot an error bar based on a 95% CIs using a normal approximation.

Companies have been able to increase their car's fuel efficiency with the roll-out of more efficient engines every year. Exception to this are the mid -sized and subcompact cars which showed a decreasing trend year by year. Interesting!

Engine size does have an impact on both city and highway mileages evident from the below plot.

I definitely recommend you checking out this package as it's a powerful tool for visualization and I
love the fact that it does both analysis and the plotting at the same time. More to come on this topic

As marketers, being able to capture the sentiments and perceptions related to our brand helps us get valuable insights. It allows us to determine if our campaigns are getting the response that we planned for, while also giving us an opportunity to improve our strategy in case our ideas are not resonating well with our customers. In a 2014 Marketing Trends Survey, marketers ranked sentiment as the third-most valuable element to be extracted within data-driven marketing strategies, after web behavior and browsing behavior.

Today I thought I will do some inside digging on what Twitterites were thinking pre and post Superbowl game..... and Let's not forget the half time show.

Pre-game: Sentiment Algorithm was implemented prior to the game to scan all the tweets, remove the non-ASCI words, and provide a score based on the number of positive and negative comments present in the tweets. In order to count the number of positive and negative words, I used the opinion lexicon in english, provided by Hu and Liu:http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Twitter API and R program were used to extract tweets that were hashtagged SuperBowl (#superbowl). The sentiment algorithm helped me come up with a score for each tweet on a scale from -5 to 5. The lowest scores (negative) indicated that tweets were associated with unhappiness, zero meant neutral feelings, and positive meant viewers were happy.

For eg: "Awesome Party!" receives a score of +1 (Awesome); "I hate this game." receives -1, commercials were bad and useless. #unhappy receives -3 and so on.

Game on!

People were having positive feelings prior to the start of the game and this one received a +5.

Super*3 + Delicious + Fun =5

Overall score distribution at 2:16 pm PST : There were 198 tweets out of a total of 3000 total that got a negative score and rest of them were positive. As you could see in the below chart, 93.3% tweets were positive/neutral and rest (6.67%) were negative.

Sorry, no points for sarcasm. It's difficult to create an algorithm for that.

Half Time: That was a pretty spectacular half time show by Katy Perry according to Twitter.

And it jumped to 99.2% positive/neutral tweets just after her show. Out of a total of 3000 tweets, only 22 tweets had a negative score.

2 minutes warning: The tense final moments...and we still receive a score of +1! Hope the Victoria Secret ad reduced some of the tensions.

Final scores were 90% positive/neutral and 10% negative. Clearly you could see happy Patriots fans and some dejected Seahawks fan. Also, with the Super Bowl taking a violent turn at the end of the game, the number of negative tweets shot up.

Here is a box plot of the overall sentiments of the game. You can see in the figure the median ranging from 0 to 1, 2 being the maximum, 3 and 4 scores becoming the positive outliers.

There you go folks...SUPERBOWL XLIX to the New England Patriots!

Integrating R with Tableau allows me to create more interactive charts. Below is an analysis on tweets mentioning "Franklin Templeton" with the positive tweets for the past week being displayed on a tree map/packed bubble.

Reference:

1. Jeffrey beans approach on twitter text mining.

2. Opinion lexicon in english, provided by Hu and Liu : http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle,
Washington, USA,

Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing
and Comparing Opinions on the Web." Proceedings of the 14th
International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

A Deep Dive into Real-World Data

Monday, June 27, 2016

Visualizing in R through ggplot2

Sunday, February 1, 2015

Analyzing SuperBowl Sentiment with Twitter API and R