One type of data that’s easier to find on the net is Weather data. There are many sites which provide historical data on many meteorological parameters such as pressure, temperature, humidity, wind_speed, visibility, etc.
Today we are performing data analysis on one such data. You can find the data here.
This dataset has hourly temperature recorded for 10 years starting from 2006–04–01 00:00:00.000 +0200 to 2016–09–09 23:00:00.000 +0200. It corresponds to Finland, a country in the Northern Europe.
We need to analyze the data to test the Null Hypothesis (H0) here, which would be “Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming”
The H0 means we need to find whether the average Apparent temperature for the month of a month say April starting from 2006 to 2016 and the average humidity for the same period have increased or not.
I have performed the analysis using Python in Google Colab. You can use Jupyter Notebook too.
So, lets begin:
Step 1: Import the necessary libraries for the analysis.
Step 2: Load the dataset. Since the data is in the form of csv file, we use read_csv() of Pandas.
Step 3: Data cleaning. For that lets start with checking the number of rows and columns in the data
We see that there are 96453 rows and 11 columns.
Next, remove duplicate rows(if any):
Looks like there are no duplicate rows. Now, lets check the datatypes of each column in the dataframe.
Everything looks fine except ‘Formatted date’ column. It is shown as object, which basically corresponds to string datatype. So, we need to change it to datetime type.
Next step would be to check for null values.
Seems like there are 517 null values in a particular column.
Now, ideally we would require a dataset to be free from any missing values, for visualizations and analysis. But, in cases like the above, we need to think and answer to a certain question before making any decision, “Is/are the column/columns with missing values necessary for the analysis we are doing?”
In our case, if we look into our hypothesis, we are supposed to find if the apparent temperature and humidity are increasing due to global warming and so, we need to be concerned with the apparent temperature and humidity only, for analysis. Other columns do not look significant for this analysis. So it might be safe to say that not taking any action on the null values in the column ‘Precip Type’ would not effect the quality of our analysis.
Step 4: The final step is data visualization and analysis. For this, lets create a new dataframe containing columns ‘Formatted Date’, ‘Apparent Temperature(C)’ and ‘Humidity’ from data_w.
The above dataframe contains data about apparent temperature and humidity on hourly level. Its difficult to decipher anything by plotting this dataframe. So, we resample the above dataframe to display apparent temperature and humidity at month level.
In the above code we are resampling the data using the function resample(). The parameter passes MS (stands for Month Starting) is used to aggregate data of all the rows belonging to the same month, to a single row. The mean() finds the average of the aggregated data.
Now, lets make a line plot based on the dataframe, showing the fluctuations(if any) in apparent temperature and humidity.
The plot above shows us that humidity as well as apparent temperature remains almost the same along the years. We say apparent temperature is almost same because the peaks and lows are almost along the same line.
Lets look at the variation of a specific month along the years, say April.
For the month of April, the humidity has remained the same throughout the period from 2006–2016. But in the case of temperature, we can see a moderate increase around the year of 2009 and a drop in 2010, with pretty much the same temperature for the rest of the period until 2015 where the temperature again drops followed by an increase in 2016.
The full code can be obtained here.