Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

April 4, 2016

Snowzilla - Blizzard 2016!

    From January 22–24, 2016, a major blizzard produced up to 3 ft (91 cm) of snow in parts of the Mid-Atlantic and Northeast United States. I live in Raleigh, NC which also saw its fair share of snowfall. Internet was down for one day but when it came back on I started collecting tweets referring to the blizzard.

    I wanted to see where are the tweets coming from. So I searched for tweets containing mentions of stormjonas, blizzard2016, jonas, blizzard, snowzilla. I ran the script for a around 10-12 hours and managed to collect around 4k tweets from all over the world. The number of tweets is less since geo tagged tweets are far less in number than regular tweets.

    The data required some cleaning and sorting according to the location. There were tweets from all over the world but most of them from the United States, naturally. 

    Let's start with tweets from the United States. On sorting the data I saw that there were tweets from all states including Hawaii and Alaska. I wish I collected the text from the tweets to get the general idea about what each state was talking about. Anyways, I plotted each unique location on the map and this is how it looks. Tweets from USA
    There are lots of tweets around the north eastern states which were battered by the snow. The density decreases as we move left. There is some activity on the west coast. I wonder what are they talking about. Either about the excellent sunny weather they have throughout the year or about how they could have easily used an extra holiday.

    The state of New York received record snowfall too. Let's compare the average snowfall in each borough against the number of tweets from each borough.


    The number of tweets from Manhattan is the most while it received the least snowfall on average. Though the average snowfall in each borough is almost the same, Staten Island got hit the most. While having hardly any tweets from there. (Source).

    According to this article the top 6 snowfall receiving states are given below and then is the number of tweets from each state sorted descendingly.

Avg Snowfall in inches
Number of tweets from each state
 
West Virginia which received the highest snowfall does not even make the list. While California which received no snow had considerable amount of tweets. 

    About a 100 tweets came from the United Kingdom. This was mostly due to reports that the blizzard was headed over to the UK from USA. (Source).
Here you can see tweets coming from all over the world. Majority from the US and Europe.


I wish I had collected the text of the tweets to see the sentiment of the tweets from different states. I presented some of the ideas I had. Would add more as and when I can think of more. Comments and suggestions are welcome!

All the code can be found over here


October 18, 2014

125 Years of English Football!



    So the Barclay's Premier League is back since quite some time, and all the other leagues are back in action too. And for the first time, we have an Indian football league too! The ISL! ( I wish it changes Indian football forever!)

     Football league (as it was known back then) was created in 1888 by Aston Villa director William McGregor. Since then English Football has evolved into thousands of teams which play under hundreds of leagues. The BPL is just the tip of the iceberg. This blog gives complete information of the hierarchy of English Football.

    So, James Curley, assistant professor of psychology at Columbia University, in his free time, cobbled up data from a lots of sources and compiled all of them together, to make, what's probably the best collection of English football scores. Sitting silently on this Github page are scores of nearly 200,000 games played in the top 4 leagues since 1888. These 14 megabytes can tell us remarkable stories about 125 years of English Football!

    I have used R to perform all the manipulations on the data. The below code shows how to load data into R.

    Take the most common scoreline, for example, in 188,060 games, there were 13,475 0-0 draws. And the most common scoreline is 1-1, accounting for roughly 21000 (11%) games.
Top Five Full Time scores
Now, lets talk about goals! 

In 188,060 matches played in 125 years, a total of 542,288 goals were scored!
About 330,000 goals were scored by the home team and remaining by the visiting team.




We see that average home goals have reduced significantly and away goals keep oscillating.
Now this drop in average home goals and rise in the away goals in past twenty years explains the below graphs.
So, the home wins have greatly reduced to about 44% while away wins are on a gradual rise. This means that home matches won't matter as much as they used to matter earlier and slowly home dominance will begin to fade away.

Average goals per game have also reduced.


We see huge shifts in the average goals around the years 1925 and 1965. And the reason for is rule changes.


1958 - Substitutions were allowed for the first time
This roughly corresponds with the beginning of a steep decline in scoring in the 1960s. This could make for a plausible causal explanation: Perhaps playing with an injured player left teams extremely vulnerable on defense, leading to many goals. The addition of the substitute may have mitigated these effects.

The reduction in goals in the late 1920's isn't well explained. But it is believed that this majorly happened due to tactical changes. (Teams used to play many forwards, but later, defensive and midfield players increased.)


All the code used for plotting above charts and manipulation data can be found here.


Hope that it was a good read! 
Suggestions and feedback are always welcome! 

Happy Coding :D