Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5

Probability And Statistics Introduction Using R
#1

Probability is the logic of randomness and uncertainty. If you are interested in algorithmic trading or want to become a professional quant, you must master probability theory. I find probability theory very fascinating. Probability theory started its journey many centuries back when Gerolamo Cardano wrote the book: The Book On The Game Of Chance. He was trying to calculate the probability of throwing a dice which had become very important for the professional gamblers in those days. You can read the history of probability theory online just Google it. First we discuss probability theory and then we discuss statistics.

Let start with sampling. I hope you are familiar with R language. R is a very powerful language specifically developed for statistical data analysis. If you are not familiar with R, you can refer the thread: Introduction to R in the Algorithmic Trading with R forum where I have introduced the basic R commands for those who are very new and don't know R.

Sampling and Simulation
The heart of probabilistic analysis is sampling and simulation. Most of the time we would want to draw samples from a distribution. R can help a lot in drawing samples from all sorts of probability distributions.

> sample(10,5)
[1]  8  2 10  6  4
> sample(2:8,12, replace=TRUE)
 [1] 3 6 8 3 8 7 8 5 2 3 4 8

In R we use sample() command to sample randomly from a set of numbers with equal probability also known as discrete uniform probability. In the first sample command, we told R to sample randomly with equal probability betwen numbers 1 and 10 and sample 5 times without replacement. In the second sample command, we told R that we want sampling with replacement by using replace=TRUE numbers between 2 and 8 and we want 12 samples. We can also sample from the English alphabet letters:

> sample(letters, 8)

[1] "g" "r" "m" "k" "p" "y" "q" "x"

> sample(1:5,12, replace=TRUE, prob=c(0.2,0.1,0.3,0.2,0.2))

 [1] 3 2 3 1 2 2 5 2 5 1 5 1

In the above sample command, we allowed  the sampling of numbers 1,2,3,4,5 with unequal probabilities. Many books on probability discuss birthday matching problems. Suppose there are 23 people in a room what is the probability of 2 people having the same birthday. R has build in functions for these types of birthday problems:

> pbirthday(23)

[1] 0.5072972

As you can see 23 people have almost 50% probability of having 2 people amongst them with matching birthdays. On your birthday party, if you have invited a lot of people something like 43 friends of yours, there is almost 90% chance that two of them will have matching birthdays:

> pbirthday(43)

[1] 0.9239229

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#2

Binomial Distribution
Binomial distribution is widely used in binary classification problems. You should become thoroughly familair with it. We can model a random variable X having Bernoulli distribution with X=1 and X=0 values. Probability of X=1 is p and probability of X=0 is 1-p. X=1 is mostly known as a success in the experiment and X=0 is known as a failure in the experiment. Binomial distribution tells you the probability of n successes in N trails. The probability of success always stays the same for each trail. Binomial distributions are used when we have a yes/no or success/failure type of situations. Just keep this in mind, the individual trial is a Bernoulli trial with X=1 or X=1. Binomial random variable is the sum of these individual Bernoulli random variables.

Hypergeometric Distribution
Now Hypergeometric distribution is a bit different. Hypergeometric distribution is explained with this example. Suppose we have a total of N balls out of which n are black balls and N-n are white balls. What is the probability of getting a black ball if we draw the balls without replacement. This is important for you to understand. We are doing the sampling without replacement. The probability of drawing a black ball is n/N and it stays the same as you draw the balls. This thing confused me. How can the probability stay the same when we are doing the sampling without replacement meaning once we get a ball the total number of balls in the sample decrease by 1. You need to ponder over this simple fact. Most of the books the provide the Hypergeometric distribution proof simply say that the probability after each draw is the same n/N.

Let's do a simple example. Suppose we have 3 red balls and 7 yellow balls. So the total balls are 10. We sample these 10 balls 5 times. What is the probability of drawing a red ball. For the first draw things are simple:

P(red ball)=3/10

But for the second draw things a bit different:

P(red ball)=P(red ball on 2nd draw | red ball on 1st draw)P(red ball on 1st draw)+
P(red ball on 2nd draw | yellow ball on 1st draw)P(yellow ball on 1st draw)
P(red ball)=(2/9)(3/10)+(3/9)(7/10)=(9/9)(3/10)=3/10

We used the Law of Total Probability and the probability indeed has not changed from 3/10 for the second red ball. Let's consider the third draw now:

P(red ball)=P(red ball on 3rd draw | red ball on 2nd draw, red ball on 1st draw)P(red ball on 1st draw & red ball on 2nd draw)+P(red ball on 3rd draw | red ball on 2nd draw, yellow ball on 1st draw)P(yellow ball on 1st draw & red ball on 2nd draw)+P(red ball on 3rd draw | red ball on 1st draw, yellow ball on 2nd draw)P(red ball on 1st draw and yellow ball on 2nd draw)+ P(red ball on 3rd draw | yellow ball on 1st draw, yellow ball on 2nd draw)P(yellow ball on 1st draw and yellow ball on 2nd draw)

So things become complicated pretty fast but the probability of drawing the red ball again comes out to 3/10. Understanding these concepts are important in probability if you want to master it.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#3

Monty Hall Problem
In most textbooks, you will come across the Monty Hall Problem. This is an interesting problem in probability that generated a lot of discussion and debate. Even some professional mathematicians got it wrong when they supported wrong answer. Monty Hall in his show used to show three doors 1,2 and 3 to the selected participant.  There is a sports car behind one of the three doors and 2 goats behind the 2 remaining doors. Suppose you are the  participant selected by Monty Hall. You choose the door as 1 behind which you think there is a sports car. Monty Hall knows what is behind each door. Now he chooses the door having the goat behind the two remaining doors. Now he offers you a chance to switch your door. Should you switch the door? Will it help in winning you the sports car?

When the choose one of the three doors, the probability of sports car behind any one door is 1/3. Now the situation changes when Monty Hall opens the doors in the remaining two doors having the goat. Monty will always open the door that has the goat. After you choose the door, the remaining 2 doors has probability of 1/3+1/3=2/3 of having the sports car. When Monty open the door containing the goat in the remaining 2 doors, the probability 2/3 shifts to the other door. So you should shift as the other door having the sports car has now increased its probability from 1/3 to 2/3 while the door that you had chosen still has the probability 1/3.

Every author has written pages explaining the Monty Hall problem. I think the explanation is simple. One door has probability 1/3 and two doors has probability 2/3 of having sports car. When Monty eliminates the door the probability shifts to the door that has not been opened and becomes 2/3. So you should switch your choice to other door. This is a prime example of how probability changes as new information arrives. I hope this explains in a very easy manner Monty Hall problem. We can use R to simulate the Monty Hall problem.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#4

Normal Distribution
Normal distribution is pretty ubiquitous in statistics. You must have seen the bell shaped curve in most of the statistics books. The bell shape curve is the hallmark of a normal distribution. Normal distribution is characterized by two parameters. First is the mean and the second is the second deviation. Some books use variance. But R uses mean and standard deviation so we stick with what R uses.

The important question that comes to mind is why Normal Distribution is so ubiquitous. If you take a course in probability you will come across the Central Limit Theorem. Central Limit Theorem is considered to be one of the most fundamental theorems in probability and statistics. Normal distribution arises when we add many random variables that are I.I.D (independent and identically distributed). So when we add many random variables that have the same distribution Central Limit Theorem tells us that the sum will be approximately normal when the random variables are very large.

For example suppose we draw a sample of N random variables from a distribution and add them. If we draw many samples and add them and take the mean, the mean behaves as a normal random variable when the number of samples is very large. This is the basis of Sampling and Simulation. So if we want to master the art of simulation, you should understand the derivation of the Central Limit Theorem.

When the effect is caused by the small addition of many things, we have a normal distribution in practice. For example. In humans, height is the effect of many factors that make small contributions each to building the height of a man or a woman. So you will find height to be approximately distributed as a normal random variable. IN finance however normal distribution most of the time will give erroneous results as most of the time the financial returns have a distribution that is heavy tailed. Why? Let's calculate the normal random variable probability of being 1, 2 and 3 standard deviations away from the mean.

> pnorm(c(1,2,3)) - pnorm(c(-1,-2,-3))

[1] 0.6826895 0.9544997 0.9973002

We calculate the probability of normal distribution between 1 standard deviation, 2 standard deviation and 3 standard deviation. It shows that almost 99% of the probability lies within 3 standard deviations. So it is very unlikely for a random variable to move further from 3 standard deviations. But in financial markets, markets can easily move 10-20 standard deviations away from the mean especially during the times of market crashes and flash crashes. So we cannot use a normal distribution when it comes to building financial models. Another problem with normal random variable is that it

Log Normal Distribution
Log normal random variable arises when we take the exponential of a normal random variable. In other words if we take the logarithm of a log normal random variable we will get a normal random variable. Log normal random variable is never negative and is skewed which makes it ideal for financial models. Due to this reason it is used a lot in financial modelling. Most of the time we use the log normal random variable to model price. Price can never be negative. The lowest it can go is zero. Log normal random variable adequately can do that. Even then log normal has been found to be not a good model for price.

Student t Distribution
When it comes to heavy tails student t random variable is often used. When the student t distribution degrees of freedom are small like 1,2,3,4...10..20, it has heavy tails. In the long run when degrees of freedom are large, student t distribution approaches a normal random variable. So for the small degrees of freedom we have pretty heavy tails for a student t random variable. This is useful when we are trying to model predict outliers in a financial model. When we have the degree of freedom n=1, we have the Cauchy distribution. Cauchy distribution has interesting properties. It has undefined mean and undefined variance.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#5

Q-Q Plots
Let's compare the Student t Distribution with the Cauchy Distribution. Cauchy distribution is a special case of a student t distribution when the degrees of freedom n=1. Q-Q plot is the method that we use to compare the tails of  two distributions. In a Q-Q plot we plot the quantiles of one distribution against the quantiles of the second distribution. Quantile is an important concept. Consider the 10th quantile. It means the value of the random variable for which the cumulative probability distribution is 0.1. Similarly 90th quantile will be the value of the random variable for which the cumulative probability distribution is 0.9.

First let compare a normal distribution with a Cauchy distribution. As said above, a Cauchy distribution has undefined mean and an undefined variance. On the other hand the mean and variance of a normal distribution are always well defined. Let's see how these two distributions differ:

> qnorm(c(0.8,0.85,0.9,0.95,0.99), mean=0, sd=1)
[1] 0.8416212 1.0364334 1.2815516 1.6448536 2.3263479
> qcauchy(c(0.8,0.85,0.9,0.95,0.99), location=0, scale=1)
[1]  1.376382  1.962611  3.077684  6.313752 31.820516

Above we calculated the 80th Quantile first. For the standard normal random variable value of 0.84, we have 80% of the cumulative probability on the left of it. For the Cauchy random variable to have 80% of the cumulative probability on the left, its value should be 1.37. For 99% of the cumulative probability, we find standard normal random variable to be 2.32 while the Cauchy random variable to be 31.8 which is a long way from the location =0. So you can see Cauchy has heavy tails, a lot heavier than the normal distribution.

> p <- seq(from=0,to=1, length=1200)
> plot(qt(p,1), qcauchy(p, location=0, scale=1),, type="l",
+ xlab="Student t Distribution Quantiles",
+ ylab="Cauchy Distribution Quantiles")
> plot(qt(p,2), qcauchy(p, location=0, scale=1),, type="l",
+ xlab="Student t Distribution Quantiles",
+ ylab="Cauchy Distribution Quantiles")

Let's return to the student t distribution and the Cauchy distribution once again. As said above, student t distribution with one degree of freedom is the Cauchy distribution. Q-Q plot shows a nice straight diagonal line.
[Image: qq1.png]
As you can see above we have a straight line for the student t distribution with one degree of freedom and the Cauchy distribution meaning they are the same. Now let's check the student t distribution with two degrees of freedom with the Cauchy distribution.
[Image: qq2.png]
Viola Cauchy distribution has very heavy tails as compared to the student t distribution as shown by the above Q-Q plot.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)