R vs Python Speed Comparison for Bootstrapping

I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google time trying to figure out how to do what I want. However, I’m curious as to know how much faster Python is for things like simulating data (because I have a particular interest in neutral models and data simulation). So I faked a linear regression and bootstrapped the confidence interval on the slope. Since I’m only interested in the comparison in speed for bootstrapping, that’s all I’m going to show.

First, with R:

set.seed(101)

# generate data

x <- 0:100

y <- 2*x + rnorm(101, 0, 10)

# plot data

plot(x, y)

# run the regression

mod1 <- lm(y ~ x)

yHat <- fitted(mod1)

# get the residuals

errors <- resid(mod1)

# make a bootstrapping function

boot <- function(n = 10000){

b1 <- numeric(n)

b1[1] <- coef(mod1)[2]

for(i in 2:n){

residBoot <- sample(errors, replace=F)

yBoot <- yHat + residBoot

modBoot <- lm(yBoot ~ x)

b1[i] <- coef(modBoot)[2]

}

return(b1)

}

# Run the bootstrapping function

system.time( bootB1 <- boot() )

mean(bootB1)

On my system, it takes about 10.5 seconds to run this bootstrap.

In Python:

import numpy as np

import pylab as py

import statsmodels.api as sm

import time

# Create the data

x = np.arange(0, 101)

y = 2*x + np.random.normal(0, 10, 101)

# Add the column of ones for the intercept

X = sm.add_constant(x, prepend=False)

# Plot the data

py.clf()

py.plot(x, y, 'bo', markersize=10)

# Define the OLS models

mod1 = sm.OLS(y, X)

# Fit the OLS model

results = mod1.fit()

# Get the fitted values

yHat = results.fittedvalues

# Get the residuals

resid = results.resid

# Set bootstrap size

n = 10000

t0 = time.time()

# Save the slope

b1 = np.zeros( (n) )

b1[0] = results.params[0]

for i in np.arange(1, 10000):

residBoot = np.random.permutation(resid)

yBoot = yHat + residBoot

modBoot = sm.OLS(yBoot, X)

resultsBoot = modBoot.fit()

b1[i] = resultsBoot.params[0]

t1 = time.time()

elapsedTime = t1 - t0

print elapsedTime

np.mean(b1)

On my system, the elapsed time is about 4.7 seconds. That’s a substantial increase in speed. I can only assume that for larger simulations, the time difference would be greater. Thus, it seems like Python may be the way to go for permutational analyses, especially when data are simulated.

So what I have learned about Python so far?

PROS:

Super fast

Language is ‘relatively easy’ to learn

Has some built in stats capabilities

Has great symbolic math capabilities

CONS:

Language is different enough that learning is going to be slow

Statistical capabilities are sparse, and R is an easy statistical language (so far)

Overall, if Python had good stats capabilities, I’d probably switch all together. As it is, I’m considering dropping R for things like modeling and simulations just because Python is so much faster. I’m also reading Numerical Ecology right now, and I’m considering sitting down and putting together the beginnings of a multivariate ecology stats module, but I bet that’s going to be more work than I have time for. It’d be pretty cool though.

PHOENIX USER

R vs Python Speed Comparison for Bootstrapping

R vs Python Speed Comparison for Bootstrapping

referal

Subscribe me

Space for adds

Blog Archive

Popular Posts

Labels

Banner2

Blog Traffic

Total Pageviews