Analysing time-series of Quandl

Quandl & Rapporter

2013/11/15 04:17:13 PM

Metadata

Analysing Oil, Gold, and Stocks between 2012-01-01 and 2012-06-31downloaded from Quandl in 0.724 second with the following original description:

Comparison of oil, gold and stock markets. USD.

Variables

This daily dataset contains 483 rows and 4 columns with the overall number of 1932 records, from which 942 records will be analysed for the Oil variable.

Overview

The descriptive statistics of Oil can be used to provide a quick and dirty overview of the data:

min mean median max sd IQR
77.7 96 95.5 111 7.18 11

Histogram

The below histogram also shows that the values are somewhere between 77.7 and 111 (range: 32.9) with the average mean being 96:

Histogram of Oil

Histogram of Oil

If the data is not normal, it is also worth checking out the median (95.5) and the interquartile range (11) too instead of the standard deviation (7.18).

References:

Observed values

The above histogram shows not much about a time-series, right? Let us check out other options.

Line plot

The daily data between 2012-01-03 and 2013-11-12 on a line-plot:

Heatmap

Which looks much better on a calendar heatmap:

References:

Autocorrelation

Computing the cross-correlation of a signal with itself is a mathematical tool for finding repeating patterns in the time-series. Basically we compute the correlation coefficient between the raw data and its lagged version for serveral iterations, where high (>0.5) or low (<-0.5) values show a repeating pattern.

The autocorrelation estimate is maximum at lag 1 being 0.98.

The autocorrelation estimate is maximum at lag 1 being 0.98.

References:

Seasonal effects

There are not enough (less than 2 periods of) data in the time series, so seasonal decomposition is not attemplted.

Linear model

And now we build a really simple linear model based on the year, the month, the day of the month and also the day of the week to predict Oil, Gold, and Stocks. The model that can be built automatically is: value ~ year + month + mday + wday.

Assumptions

In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used:

  Value p-value Decision
Global Stat 180 0 Assumptions NOT satisfied!
Skewness 2 0.157 Assumptions acceptable.
Kurtosis 16.5 4.82e-05 Assumptions NOT satisfied!
Link Function 160 0 Assumptions NOT satisfied!
Heteroscedasticity 1.87 0.171 Assumptions acceptable.

To check these assumptions, the Global Validation of Linear Model Assumptions R-package will help us. The result of that we can see in the table above.

The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals (skewness and kurtosis), the linearity and the homoskedasticity. On the table we can see if our model met with the assumptions. As a generally accepted thumb-rule we use the critical p-value=0.05.

So let's see the results, which the test gave us:

In summary: We can 't be sure that the linear model used here fits to the data.

References:

Linearity

As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed. Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component and Residual Plot.

First, we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles. We can talk about linearity if the green line did not lie too far from the red.

References:

Parameters

A linear model: value ~ year + month + mday + wday
  Estimate Std. Error t value Pr(>|t|)
d$year 3.86 0.627 6.15 1.63e-09
d$month -0.449 0.0962 -4.66 4.06e-06
d$wday -0.0127 0.222 -0.0573 0.954
d$mday -0.0128 0.0355 -0.36 0.719
(Intercept) -7672 1263 -6.08 2.56e-09

Most model parameters can be read from the above table, but nothing about the goodness of fit. Well, the R-squared turned out to be 0.126 while the adjusted version is 0.119.

References:

Residuals

Let us also check out the residuals of the above linear model:

References:

Predicted values

At last, let us compare the original data with the predicted values:

References:

ARIMA

Here we try to identify the best ARIMA model to better understand the data or to predict future points in the series. The model is chosen according to either AIC, AICc or BIC value is:

Damn, we could not fit a model:


We are terribly sorry, but this computational intensive process
is not allowed to be run on a time-series with more then 365 values.
Please sign up for an account at rapporter.net for extra resources
or filter your data by date.

References: