Plotting - Statistical Significance

The main library for plotting is matplotlib, which uses the Matlab plotting capabilities.

We can also use the seaborn library on top of that to do visually nicer plots

Simple plots

The documentation for the plot function for data frames can be found here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

Histograms

Plotting columns against each other

Plot columns B,C,D against A

The plt.figure() command creates a new figure for each plot

Grid of plots

Use a grid to put all the plots together

Plot all colums together against A.

Clearly they are different functions

Plot all columns against A in log scale

We observe straight lines for B,C while steeper drop for D

Plot with log scale only on y-axis.

The plot of D becomes a line, indicating that D is an exponential function of A

Plotting using matplotlib

Also how to put two figures in a 1x2 grid

Using seaborn

Scatter plots

Scatter plots take as imput two series X and Y and plot the points (x,y).

We will do the same plots as before as scatter plots using the dataframe functions

Putting many scatter plots into the same plot

Using seaborn

In log-log scale (for some reason it seems to throw away small values)

Statistical Significance

Recall the dataframe we obtained when grouping by gain

We see that there are differences in the volume of trading depending on the gain. But are these differences statistically significant? We can test that using the Student t-test. The Student t-test will give us a value for the differnece between the means in units of standard error, and a p-value that says how important this difference is. Usually we require the p-value to be less than 0.05 (or 0.01 if we want to be more strict). Note that for the test we will need to use all the values in the group.

To compute the t-test we will use the SciPy library, a Python library for scientific computing.

The Student t-test

The t-test value is:

$$t = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} $$

where $\bar x_i$ is the mean value of the $i$ dataset, $\sigma_i^2$ is the variance, and $n_i$ is the size.

Kolomogorov-Smirnov Test

Test if the data for small and large gain come from the same distribution. The p-value > 0.1 inidcates that we cannot reject the null hypothesis that they do come from the same distribution.

Use scipy.stats.ks_2samp for testing two samples if they come form the same distribution.

If you want to test a single sample against a fixed distribution (e.g., normal) use the scipy.stats.kstest

$\chi^2$-test

We use the $\chi^2$-test to test if two random variables are independent. The larger the value of the test the farther from independence. The p-value tells us whether the value is statistically significant.

Error bars

We can compute the standard error of the mean using the stats.sem method of scipy, which can also be called from the data frame

Computing confidence intervals

We can also visualize the mean and the standard error in a bar-plot, using the barplot function of seaborn. Note that we need to apply this to the original data. The averaging is done automatically.

Visualizing distributions

We can also visualize the distribution using a box-plot. In the box plot, the box shows the quartiles of the dataset (the part between the higher 25% and lower 25%), while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers”. The line shows the median.

We can also use a violin plot to visualize the distributions

Seaborn lineplot

Plot the average volume over the different months

Comparing multiple stocks

As a last task, we will use the experience we obtained so far -- and learn some new things -- in order to compare the performance of different stocks we obtained from Yahoo finance.

Next, we will calculate returns over a period of length $T$, defined as:

$$r(t) = \frac{f(t)-f(t-T)}{f(t)} $$

The returns can be computed with a simple DataFrame method pct_change(). Note that for the first $T$ timesteps, this value is not defined (of course):

Now we'll plot the timeseries of the returns of the different stocks.

Notice that the NaN values are gracefully dropped by the plotting function.

We can also use the seaborn library for doing the scatterplot. Note that this method returns an object which we can use to set different parameters of the plot. In the example below we use it to set the x and y labels of the plot. Read online for more options.

Get all pairwise correlations in a single plot

There appears to be some (fairly strong) correlation between the movement of TSLA and YELP stocks. Let's measure this.

Correlation Coefficients

The correlation coefficient between variables $X$ and $Y$ is defined as follows:

$$\text{Corr}(X,Y) = \frac{E\left[(X-\mu_X)(Y-\mu_Y)\right]}{\sigma_X\sigma_Y}$$

Pandas provides a DataFrame method to compute the correlation coefficient of all pairs of columns: corr().

It takes a bit of time to examine that table and draw conclusions.

To speed that process up it helps to visualize the table using a heatmap.

Computing p-values

Use the scipy.stats library to obtain the p-values for the pearson and spearman rank correlations

Matplotlib

Finally, it is important to know that the plotting performed by Pandas is just a layer on top of matplotlib (i.e., the plt package).

So Panda's plots can (and should) be replaced or improved by using additional functions from matplotlib.

For example, suppose we want to know both the returns as well as the standard deviation of the returns of a stock (i.e., its risk).

Here is visualization of the result of such an analysis, and we construct the plot using only functions from matplotlib.

To understand what these functions are doing, (especially the annotate function), you will need to consult the online documentation for matplotlib. Just use Google to find it.