Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5

R Introduction
#1

R is a data science and statistical analysis programming language. R was developed in academia. Today R is very popular in the finance sector and many quants use it in analyzing financial market data. Excel used to be popular but it has limitations. R is very powerful as compared to Excel. If you are a trader, you should learn R. I am starting this thread in which I will introduce R if you don't know it. I will try to write a few posts every week and show you how to learn R. Once you learn R, I will show you how to use it in building algorithmic trading models. So stay tuned by subscribing to this thread if you want to learn R.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#2

R is now the standard language when it comes to data science and statistical analysis. R is an open source software. You can download R from CRAN website. It doesn't matter whether you are using Windows, Mac OS X or Linux,  R works on all these operating systems. R was developed by two professors from New Zealand in 1993 so that they could teach data analysis courses. The names of these professors started with Robert so they choose to call their new language R. R is based on S language that was developed at Bell Labs. R is a GNU Project. Since R was easy to download and use, its popularity increased and over the last 2 decades it has been adopted by many universities, many companies and even government agencies when it comes to doing statistical analysis.

So if you are using Excel for doing your financial analysis, it is time you start learning R. Just download R from CRAN (Comprehensive R Archived Network), install it on your computer and start using it. Now R is based on 1950s technology. R is not multi-threaded which means it cannot divide the job into a number of threads and do it concurrently. R is currently developed by a R Core Team. You can do parallel computing with R.

Microsoft people liked R so much that they have developed their own version which they called Microsoft R Open. It is also an open source project and you can download Microsoft R Open from there website. I personally use Microsoft R Open. Microsoft R Open uses the exact same language and packages as the standard R and it is multi-threaded. I recommend you download Microsoft R Open. It works on all operating systems like Windows, OS X, Ubuntu and more.

Now when you develop programs you need an Integrated Development Environment (IDE). On a IDE, you can write the code clearly and it is easy to debug the errors. RStudio is a wonderful IDE that you will love just like me. Download RStudio, it is also open source. First install Microsoft R Open. After that install RStudio, it will automatically detect the R installation. I think this is enough for today.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#3

I hope you have installed Microsoft R Open and RStudio by now. If you haven't, please do so before you proceed. There is practically no difference between R and Microsoft R Open. Then why choose R Open. R is a single threaded software while R Open is a multithreaded software which makes it very fast as compared to R. More on this later when you become advanced in R programming. For now, just remember R Open is much faster than R as it is multithreaded. Due to this reason, I use R Open. Practically there is no difference between R and R Open. RStudio is an IDE (Integrated Development Environment) that works for both R and R Open. RStudio is a great IDE and we will use it in doing data science and developing algorithmic trading strategies.

Let's now start learning R. Before we start, the > sign that you see in the R console is the command prompt. > tells you that the interpreter is ready to accept R command. So don't  get confused with this > sign. Now keep this in mind. Every thing in R is a vector. Look at the very simple example below:

> 1+2

[1] 3

The first thing you see is the  > sign. As said this is just the command prompt. I want to add 1 to 2. Very simple. I tell R to do it for me. The result is 3 which is correct. But did you see that [1]. What is this? This [1] is just the index of the first item of the vector  displayed which is 3. 

Now R is a full fledged objected oriented programming language. Just like other modern programming languages like Python/JAVA/C/C++, you can created variables in R and assign them values. This is one of the most  basic operation that we do repeatedly in programming. Variables are just boxes that hold values. You can use = sign for the assignment operation that assigns a value to a variable. But the standard R assignment operator is <- and I always use this <- operator when assigning values to variables.

> x <- 1

> 1 -> x

What is this 1 -> x. This is another way to assign a value to a variable but we don't use it much. It was just meant show you that you can do it like this as well. The standard assignment operator is <- and we use it mostly.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#4

R Data Types
R has three basic data types that are the numbers, text also known as character and TRUE and FALSE Boolean values. Numbers in R behave almost like they do unlike other programming languages. For example:

> 2/3

[1] 0.6666667

If you do this division in other languages, you might get 0 as the output. Now in R you don't have to worry about using numbers of different types like integers, doubles etc.

> x <-2
> y <-3
> z <- 4
> (x*y+z)/5
[1] 2

You will always find this [1] with the result. I have already explained why we see this [1] so don't need to get confused. If the result is too big for R, it will show Inf. Similarly, division by zero will also result in Inf.

> 3/0

[1] Inf

Missing values are shown with NA (not available). If the result of a computation is not making any sense to R, it will show it as NaN which means not a number. Then we have the NULL object.

> 0/0

[1] NaN

Any computation involving NA and NaN will also produce NA and NaN. We declare text characters by enclosing them in "".

> x <- "Hello Dear"
> class(x)
[1] "character"
> str(x)
 chr "Hello Dear"

If you want to combine text with a number, you will need to use paste function.

> y <- 10
> paste(y, "is a number")
[1] "10 is a number"

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#5

In R we can create sequence of number with the seq() and rep() commands:

> seq(1,10,2)

[1] 1 3 5 7 9

In the above seq() command, the 1 is the start number and 10 is the last number and 2 gives the the steps between the two numbers. In the above case the sequence started with 1 and ended at 9 with 2 as the difference between the two numbers. 

> rep(2,10)

 [1] 2 2 2 2 2 2 2 2 2 2

In the above rep() command, we told R to print number 2 ten times. Both these commands are used very frequently.

Basic R Data Structures
R has number of data structures like the vectors, matrices, arrays, lists and dataframes. These data structures can contain numeric values, logical values and characters also known as strings. A matrix is a two dimensional array. We can create a matrix with the following command.

> x <- matrix(nrow=5, ncol=5, data=NA, byrow=TRUE)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA

Here we have filled the matrix with NA values by row with byrow=TRUE in the matrix command. By default the matrix fills the values by column so if you want to fill the values by row you will need byrow=TRUE in the matrix command.

R Dataframe

The problem with the matrix is simple. It can only has numeric values. If we want non numeric values like the strings or Logicals we need a dataframe which is just like a matrix by it can contain non numeric values as values in its columns. Dataframe is the most versatile R data structure and its used extensively in reading the data. Most of the time we will be dealing with csv files that we download from MT4 History center. We will use  the following command to read the csv in R as a dataframe:

> # Import the csv file
> data1 <- read.csv("D:/Shared/MarketData/GBPUSD10080.csv",
+                   header=FALSE)
> colnames(data1) <- c("Date", "Time", "Open", "High", "Low",
+                      "Close", "Volume")
> tail(data1)
                      Date  Time    Open    High     Low   Close Volume
1259 2018.07.29 00:00 1.31100 1.31717 1.29744 1.30018 188773
1260 2018.08.05 00:00 1.29995 1.30041 1.27221 1.27570 196647
1261 2018.08.12 00:00 1.27579 1.28257 1.26607 1.27454 230963
1262 2018.08.19 00:00 1.27444 1.29352 1.27286 1.28591 198856
1263 2018.08.26 00:00 1.28561 1.30422 1.28278 1.29213 212624
1264 2018.09.02 00:00 1.29207 1.29329 1.27847 1.28170  97167

> head(data1)
        Date  Time   Open   High    Low  Close Volume
1 1994.06.19 00:00 1.5342 1.5569 1.5270 1.5530   1206
2 1994.06.26 00:00 1.5537 1.5595 1.5296 1.5380   1284
3 1994.07.03 00:00 1.5348 1.5533 1.5322 1.5500   1020
4 1994.07.10 00:00 1.5550 1.5770 1.5530 1.5620   1045
5 1994.07.17 00:00 1.5616 1.5665 1.5200 1.5275   1346
6 1994.07.24 00:00 1.5307 1.5453 1.5205 1.5440   1056

Now that we have read the OHLCV (Open, High, Low, Close and Volume) csv files in R as a dataframe we can easily manipulate it. For example we can use the seq() command that I introduced above and reslice the dataframe:

> x <-nrow(data1)-1
> n <-30
> data2 <- data1[seq(from=x-floor(x/n)*n, to=x,  by=n),]
> tail(data2)
                     Date  Time    Open    High     Low   Close Volume
1113 2015.10.11 00:00 1.53185 1.55069 1.51986 1.54327 486747
1143 2016.05.08 00:00 1.44232 1.45286 1.43385 1.43490 235787
1173 2016.12.04 00:00 1.26674 1.27729 1.25475 1.25922 296905
1203 2017.07.02 00:00 1.30140 1.30205 1.28650 1.28889 225683
1233 2018.01.28 00:00 1.41496 1.42772 1.39787 1.41094 277901
1263 2018.08.26 00:00 1.28561 1.30422 1.28278 1.29213 212624

As you can see above, we have sliced the data with interval of 30 between the rows. You can check this with the first column which is the index column. If  you want to add rows to the dataframe, use the following command:

> #add three rows to the dataframe
> data2[nrow(data2)+3,] <-NA
> tail(data2)
                     Date  Time    Open    High     Low   Close Volume
1203 2017.07.02 00:00 1.30140 1.30205 1.28650 1.28889 225683
1233 2018.01.28 00:00 1.41496 1.42772 1.39787 1.41094 277901
1263 2018.08.26 00:00 1.28561 1.30422 1.28278 1.29213 212624
44         <NA>  <NA>      NA      NA      NA      NA     NA
45         <NA>  <NA>      NA      NA      NA      NA     NA
46         <NA>  <NA>      NA      NA      NA      NA     NA

So you can see it is very easy to deal with a dataframe in R. You should practice the above commands in your R console and become familiar with them.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply
#6

Dataframe is a powerful data structure. Sometimes we want to lag the column a few steps above or below we can use the following function to do that:

> # Import the csv file
> data1 <- read.csv("D:/Shared/MarketData/GBPUSD10080.csv",
+                   header=FALSE)
> colnames(data1) <- c("Date", "Time", "Open", "High", "Low",
+                      "Close", "Volume")
>
> lagk <- function(x, k) {
+   if (k>0) {
+     return (c(rep(NA, k), x)[1 : length(x)] );
+   }
+   else {
+     return (c(x[(-k+1) : length(x)], rep(NA, -k)));
+   }
+ }
>
> data1$Close_1 <- lagk(data1$Close, k=-1)
> data1$Close_2 <- lagk(data1$Close, k=1)
> head(data1)
           Date  Time   Open   High    Low  Close Volume Close_1 Close_2
1 1994.06.19 00:00 1.5342 1.5569 1.5270 1.5530   1206  1.5380      NA
2 1994.06.26 00:00 1.5537 1.5595 1.5296 1.5380   1284  1.5500  1.5530
3 1994.07.03 00:00 1.5348 1.5533 1.5322 1.5500   1020  1.5620  1.5380
4 1994.07.10 00:00 1.5550 1.5770 1.5530 1.5620   1045  1.5275  1.5500
5 1994.07.17 00:00 1.5616 1.5665 1.5200 1.5275   1346  1.5440  1.5620
6 1994.07.24 00:00 1.5307 1.5453 1.5205 1.5440   1056  1.5420  1.5275
> tail(data1)
                  Date  Time    Open    High     Low   Close Volume Close_1 Close_2
1259 2018.07.29 00:00 1.31100 1.31717 1.29744 1.30018 188773 1.27570 1.31095
1260 2018.08.05 00:00 1.29995 1.30041 1.27221 1.27570 196647 1.27454 1.30018
1261 2018.08.12 00:00 1.27579 1.28257 1.26607 1.27454 230963 1.28591 1.27570
1262 2018.08.19 00:00 1.27444 1.29352 1.27286 1.28591 198856 1.29213 1.27454
1263 2018.08.26 00:00 1.28561 1.30422 1.28278 1.29213 212624 1.28170 1.28591
1264 2018.09.02 00:00 1.29207 1.29329 1.27847 1.28170  97167      NA 1.29213

As you can see above after we have read the csv file and given names to its columns, we can refer to each dataframe column with that name like data1$Close where data1 tells R that we want the data1 dataframe and the $ sign and then the Close means the Close column. We defined a function that we called the lagk. This function can be used to shift the columns k steps above or down as shown above. But I have stepped ahead. Let's first discuss vectors in R.

Vectors in R

But before we continue I think I should emphasize that Vectors are the most basic R data structures. Every object in R is a vector. We can easily create vectors in R and do many operations on them:

> v <- c( 2, 9,3,1,5,20,10)
> v <- c( 2, 9,3,1,5,20,10)
> length(v)
[1] 7
> sum(v)
[1] 50
> max(v)
[1] 20
> min(v)
[1] 1

First we defined a vector. Length command gives the length of this vector which has 7 elements. We can easily sum the elements in the vector as well as find the maximum and the minimum. These are some more operations with vectors:

> v[c(1,3,5)]
[1] 2 3 5
> v[c(-2,-5)]
[1]  2  3  1 20 10
> v[-(2:4)]
[1]  2  5 20 10
> v^2
[1]   4  81   9   1  25 400 100

These were some more easy vector operations.

Subscribe My YouTube Channel:
https://www.youtube.com/channel/UCUE7VPo...F_BCoxFXIw

Join Our Million Dollar Trading Challenge:
https://www.doubledoji.com/million-dolla...challenge/
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)