Wednesday, June 29, 2016

time series - Is R being replaced by Python at quant desks?


I know the title sounds a little extreme but I wonder whether R is phased out by a lot of quant desks at sell side banks as well as hedge funds in favor of Python. I get the impression that with improvements in Pandas, Numpy and other Python packages functionality in Python is drastically improving in order to meaningfully mine data and model time series. I have also seen quite impressive implementations through Python to parallelize code and fan out computations to several servers/machines. I know some packages in R are capable of that too but I just sense that the current momentum favors Python.


I need to make a decision regarding architecture of a subset of my modeling framework myself and need some input what the current sentiment is by other quants.


I also have to admit that my initial reservations regarding performance via Python are mostly outdated because some of the packages make heavy use of C implementations under the hood and I have seen implementations that clearly outperform even efficiently written, compiled OOP language code.


Can you please comment on what you are using? I am not asking for opinions whether you think one is better or worse for below tasks but specifically why you use R or Python and whether you even place them in the same category to accomplish, among others, the following tasks:




  • acquire, store, maintain, read, clean time series

  • perform basic statistics on time series, advanced statistical models such as multivariate regression analyses,...

  • performing mathematical computations (fourier transforms, PDE solver, PCA, ...)

  • visualization of data (static and dynamic)

  • pricing derivatives (application of pricing models such as interest rate models)

  • interconnectivity (with Excel, servers, UI, ...)

  • (Added Jan 2016): Ability to design, implement, and train deep learning networks.


EDIT I thought the following link might add more value though its slightly dated [2013] (for some obscure reason that discussion was also closed...): https://softwareengineering.stackexchange.com/questions/181342/r-vs-python-for-data-analysis



You can also search for several posts on the r-bloggers website that address computational efficiency between R and Python packages. As was addressed in some of the answers, one aspect is data pruning, the preparation and setup of input data. Another part of the equation is the computational efficiency when actually performing statistical and mathematical computations.


Update (Jan 2016)


I wanted to provide an update to this question now that AI/Deep Learning networks are very actively pursued at banks and hedge funds. I have spent a good amount of time on delving into deep learning and performed experiments and worked with libraries such as Theano, Torch, and Caffe. What stood out from my own work and conversations with others was that a lot of those libraries are used via Python and that most of the researchers in this space do not use R in this particular field. Now, this still constitutes a small part of quant work being performed in financial services but I still wanted to point it out as it directly touches on the question I asked. I added this aspect of quant research to reflect current trends.



Answer



My deal is HFT so what I care about is



  1. read/load data from file or DB quickly in memory

  2. perform very efficient data-munging operations (group,transform)

  3. visualize easily the data



I think is is pretty clear that 3. goes to R, graphics and ggplot2 and others allow you to plot anything from scratch with little effort.


About 1. and 2. I am amazed reading previous post to see that people are advocating for python based on pandas and that no one cites data.table The data.table is a fantastic package that allows blazing fast grouping/transforming of tables with 10s million rows. From this bench you can see that data.table is multiple time faster than pandas and much more stable (pandas tend to crash on massive tables)


Example


R) library(data.table)
R) DT = data.table(x=rnorm(2e7),y=rnorm(2e7),z=sample(letters,2e7,replace=T))
R) tables()
NAME NROW NCOL MB COLS KEY
[1,] DT 20,000,000 3 458 x,y,z
Total: 458MB
R) system.time(DT[,.(sum(x),mean(y)),.(z)])

user system elapsed
0.226 0.037 0.264

R)setkey(DT,z)
R)system.time(DT[,.(sum(x),mean(y)),.(z)])
user system elapsed
0.118 0.022 0.140

Then there is speed, as I work in HFT neither R nor python can be used in production. But the Rcpp package allows you to write efficient C++ code and integrate it to R trivially (literally adding 2 lines). I doubt R is fading, given the number of new packages created every day and the momentum the language has...


EDIT 2018-07



A few years latter I am amazed by how the R ecosystem has evolved. For in-memory computation you get unmatched tools, from fst for blazing fast binary read/write, fork or cluster parallelism in one liners. C++ integration is incredibly easy with Rcpp. You get interactive graphics with the classics like plotly, crazy features like ggplotly (just makes your ggplot2 interactive). For trying python with pandas I honestly do not understand how there could even be a match. Syntax is clunky and performance is poor, I must be too used to R I guess. Another thing that is really missing in python is litterate programming, nothing comes close to rmarkdown (the best I could find in python was jupyter but that does even come close). With all the fuss surrounding the R vs Python langage war I realize that vast majority of people are simply uninformed, they do not know what data.table is, that it has nothing to do with a data.frame, they do not know that R fully supports tensorflow and keras.... To conclude I think both tools can do everything and it seems that python langage has very good PR...


No comments:

Post a Comment

technique - How credible is wikipedia?

I understand that this question relates more to wikipedia than it does writing but... If I was going to use wikipedia for a source for a res...