Every few days, I get a message or a mail from a newbie who wants to learn about machine learning. The primary dilemma that they are not able to get past is not about data-sets or the problems that they want to solve but the dilemma is of a very rudimentary kind – one, which plagues almost everyone who wants to join the machine learning universe.
“Should I use R or should I use Python?”
That’s the dilemma. The answer that you will get depends on who you present the question to. A statistician by heart might probably swear by R – it was created for them – but there always are those outliers who would not leave Python’s side, irrespective of their statistical beliefs. The same stands for those who have tamed Python and would probably work on R under duress.
The RedMonk Programming Language Rankings: January 2020 place Python and R as the two most popular programming languages for statistics and advanced forms of data analysis. It’s a comparison that has spawned several groups in support of each of the languages, each baying for the other’s processing prowess. As with most comparisons where the matters of the heart supersede those of the mind, this battle too, always runs into indecisiveness. The simple reason is the desire to look for a one-stop solution for all our problems, I mean, data-analysis related problems. Unfortunately, there is no one-stop solution.
Both these languages have evolved over the least two decades, are well known and have good user/contributing bases. However, if we must answer which of them is the “best”, there is only one answer – “it depends“.
We have had a lot of capability wise discussions around R and Python, but I largely believe that both are capable programming languages and their usage differs based on the user’s comfort or on the scenario.
Let me explain this “it depends” from a user’s perspective and give my view:
- For first-time user:
- If you are a researcher or statistician or somebody from a non-software field, then you might find R interesting and easy to learn. R makes it quite easy to perform analysis and the availability of packages is quite easy, especially in RStudio. Though the programming language has a steep learning curve for complex functionalities, it is easy to begin with. Let us just say that the entry barrier for R is low.
- If you are a developer, tester, or a person with experience in software, you might find Python easier to work with. Python has evolved as a scripting language and has the capabilities in dealing with both back-end and front-end programs, especially with frameworks such as Django and Beeware.
- For someone who wants to build a product:
- If you are a product manager/technology lead or a person who is planning to build a product using the power of machine learning & AI, then you might want to consider Python. I know that I am moving into controversial territory with these hypotheses, but they are just that – hypotheses. The reason is that Python is robust for deployment and it has optimized packages and methodologies to compute large scale mathematical algorithms. Python has been widely used to make different applications giving it a proven track record of large scale deployability, when compared to R.
- For someone who is into business consulting (data driven):
- Consulting as a craft evaluates all the available aspects of the problem and cleverly chooses the right one. A consultative approach dictates that we can’t tie our options down to one of the two platforms. In my opinion, consultants could use:
- Python when:
- They have a huge data table that gets more than 70% of your systems’ RAM going
- They use non-parametric forms and more Blackbox forms
- They have to implement deep learning
- R when:
- They have to perform first-hand data analysis (EDA) of comparatively smaller data-sets
- They have to work with state-of-the-art algorithms, as researchers usually develop packages in CRAN
- They need great visualizations
- Python when:
- Data Scientist:
- The work of a data scientist and the value that they bring is not limited to just a handful of businesses, sectors or even technologies. They work across many technologies and fields of work, beginning with NLP experts to computer vision experts to deep learning. Not differentiating each as a different skillset (which I believe should be segregated), we can evaluate which of the two programming languages in concern should be used:
- Python for:
- Deep learning
- NLP
- Building products
- Deploying large scale and complex algorithms (Which also can be done in R, but it is more complex to learn)
- Computer Vision
- Python for:
-
- R for:
- Deploying algorithms with statistical inferences (Time series, parametric forms for example)
- Doing statistical research
- Building great visualization
- Conducting market research analysis
- Empirical research
- R for:
This is not an exhaustive list. The key point of this comparison is that the choice of the language should not be fixed but should be altered based on the purpose and the persona of the user.
Let us see some differences based on examples:
Statistics in base languages
-
- There are a set of functions in base R which come handy for a statistician, for example
- Quantiles
- T-test
- Anova (AOV)
- Linear models (lm)
- These functions are not present in native Python, we must import packages such as Pandas, NumPy, etc.
- There are a set of functions in base R which come handy for a statistician, for example
Interpretation of ML algorithms
Interpretation can be statistically easy in R, but power of having different solvers/optimizers can be higher in Python. For Example
- Logistic Regression
- In R if we use GLM we have these interpretations
- Deviance Residuals
- Coefficient and Significance values (P and Z values for 90%,95%,99% significance)
- Null/Residual deviance
- AIC (Akaike information criterion)
- In Python if we use Sklearn for Logistic regression
- We will not have significance values, which are key to the analysis, rather we must derive it using an external function
- On the other hand, there are solvers available in it like liblinear, Saga, Sag, newton-cg along with regularization
- In R if we use GLM we have these interpretations
Reading large data sets
-
- Reading large data sets is usually faster in Python than compared to R if we consider using the regular data manipulation tools. R will need sequential read most of the time.
-
-
- In R – Time taken is 4.15 minutes
- In Python – Time Taken is 77.15 seconds
- In R – Time taken is 4.15 minutes
- Let’s check to import large CSV from this dataset
- In R
- In Python
- In R
-
We can’t always say that one language is better, but we can make a language work better if we have deeper knowledge. Just like the merge function in R vs the merge function in pandas, the pandas merge function was written by keeping in mind the drawbacks of R’s merge function and thus, has a better algorithm in place.
Deep learning
While we can perform deep learning in both the programming languages, Python easily wins with its TensorFlow, Keras and Theano packages. The same thing can be done in R but it’s difficult to find the exact way to implement directly. The number of commits being done in Python is way higher than R (For e.g. if we compare TensorFlow, Keras & SK-learn in Python vs H20, mlr in r) N.B- Though the TensorFlow package in R is also available, it’s been here since October 2019 and doesn’t have so many commits.
Speed
Let us take an example. Suppose I want to implement a dCNN (Deep convolutional neural network). The solution is just a search away for Python’s TensorFlow, while you will probably end up searching for half-a-day to implement in R and to get rid of the errors that follow.
Visualization
The visualization in R, especially with Hadley Wickhams GGPLOT2 with more than 50 visualization types has been very handy and is powerful if you compare it with matplotlib. The visualization capabilities in R and its supporting packages are quite evolved and matured while Python’s visualization packages are still going through massive commits to catch up with R’s capabilities.
Some points which came up from an analysis done on the stackoverflow data present in kaggle, where prediction models were used– based on different features, can we predict which user you are. Some insights that came out are:
-
- If you are looking to move towards Linux next year, you are more likely a Python user
- If you studied statistics, you are more likely to go with R, and if you studied computer science, you might lean towards Python
- If you are young (18-24 years old), you are more likely a Python user
- If you participate in coding competitions, you are more likely to be a Python user
- If you want an Android next year, you are more likely a Python user
- If you want to learn SQL next year, you are more likely an R user
- If you use MS office, you are more likely an R user
- If you want a Raspberry Pi next year, you are more likely a Python user
- If you are a full-time student, you are more likely to be a Python user
- If you are using Agile methodology, you are more likely to be a Python user
- If you are more worried than excited about AI, then you are more likely to be an R user
- People who have been coding for 3-11 years are more likely to be Python users, R seems to be the flavour of choice for those with over 12 years of experience
- People using Python and R are more often moderately happy than people using only one of the languages
CONCLUSION
Everything points back to the same answer “it depends”, which essentially does not help, but proves one thing that there is no need of the comparison. Decision should be taken based on time in hand or the purpose of the usage or the stage of learning that you are in. Getting ourselves updated time and again would help us take this decision faster and better.
Today, Python’s user base has shot up way above R’s user base. More and more Python packages are being deployed to come up at par with R’s 12,000 packages. Popularity wise, the number of questions asked in Quora/Stackoverflow has increased for Python substantially, over the last couple of years. The job market is also inclined towards Python while there is a fairly large user base which still prefers R.
Personally, I believe both are equally capable languages. It is just that the ease of getting a job done is totally based on the user and the job at hand.
If you are looking for a starting point for your business, take advantage of our personalized FREE consultation workshop Sign up here.
Subscribe for regular updates on AI and Data Innovations, case studies, and blogs. Join our mailing list.
Shivakumar is a keen follower of scientific trends and an Asimov fan. He believes solid execution is the key to the success of any strategy and is focused on building a world-class data science team at Prescience. He has a B.Tech from IIT Delhi and an MBA from IIM Lucknow, with 20+ years of experience in the technology space.