<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.cybaea.net/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"><title>CYBAEA Data and Analysis</title><rights>Copyright by the author(s). All rights reserved.</rights><logo>http://static.cybaea.net/logo/cybaea-data-200.png</logo><subtitle type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.</p><p>This is the blog for practising data analysts and theoretical statisticians.  The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.</p></div></subtitle><updated>2010-08-05T08:22:08Z</updated><id>urn:uuid:259dced6-9721-5b16-a8aa-d91dc8e40f56</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/" /><link rel="alternate" type="text/html" href="http://www.cybaea.net/Blogs/Data/" /><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><generator uri="http://www.cybaea.net/atom/feed.pl?short_name=Data" version="$Revision: 97 $">feed.pl</generator><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.cybaea.net/CybaeaData" /><feedburner:info uri="cybaeadata" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>CybaeaData</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry><title type="text">Big data for R</title><id>urn:uuid:04001d8b-1947-56b3-86a5-265707a84aa9</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/pegHIMxElX0/Big-data-for-R.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Revolutions Analytics recently <a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html">announced</a> their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.
</p>
<p>
However, if you want to replicate their analysis in standard <a href="http://www.r-project.org/">R</a>, then you can absolutely do so and we show you how.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Revolutions Analytics recently &lt;a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html"&gt;announced&lt;/a&gt; their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, if you want to replicate their analysis in standard &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, then you can absolutely do so and we show you how.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Data preparation&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First you need to prepare the rather large data set that they use in the Revolutions white paper.  The preparation script shown  below does two passes over alal the files which is not needed: changing it to a single pass is left as an exercise for the reader....  Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="big.R"&gt;&#xD;
#!/usr/bin/Rscript&#xD;
## big.R - Preprocess the airline data&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Install the packages we will use&#xD;
install.packages("bigmemory",&#xD;
                 dependencies = c("Depends", "Suggests", "Enhances"))&#xD;
&#xD;
## Data sets are downloaded from the Data Expo '09 web site at&#xD;
## http://stat-computing.org/dataexpo/2009/the-data.html&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    if ( !file.exists(file.name) ) {&#xD;
        url.text &amp;lt;- paste("http://stat-computing.org/dataexpo/2009/",&#xD;
                          year, ".csv.bz2", sep = "")&#xD;
        cat("Downloading missing data file ", file.name, "\n", sep = "")&#xD;
        download.file(url.text, file.name)&#xD;
    }&#xD;
}&#xD;
&#xD;
## Read sample file to get column names and types&#xD;
d &amp;lt;- read.csv("2008.csv.bz2")&#xD;
integer.columns &amp;lt;- sapply(d, is.integer)&#xD;
factor.columns  &amp;lt;- sapply(d, is.factor)&#xD;
factor.levels   &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
n.rows &amp;lt;- 0L&#xD;
&#xD;
## Process each file determining the factor levels&#xD;
## TODO: Combine with next loop&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name)&#xD;
    n.rows &amp;lt;- n.rows + NROWS(d)&#xD;
    new.levels &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        factor.levels[[i]] &amp;lt;- c(factor.levels[[i]], new.levels[[i]])&#xD;
    }&#xD;
    rm(d)&#xD;
}&#xD;
save(integer.columns, factor.columns, factor.levels, file = "factors.RData")&#xD;
&#xD;
## Now convert all factors to integers so we can create a bigmatrix of the data&#xD;
col.classes &amp;lt;- rep("integer", length(integer.columns))&#xD;
col.classes[factor.columns] &amp;lt;- "character"&#xD;
cols  &amp;lt;- which(factor.columns)&#xD;
first &amp;lt;- TRUE&#xD;
csv.file &amp;lt;- "airlines.csv"   # Write combined integer-only data to this file&#xD;
csv.con  &amp;lt;- file(csv.file, open = "w")&#xD;
&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name, colClasses = col.classes)&#xD;
    ## Convert the strings to integers&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        col &amp;lt;- cols[i]&#xD;
        d[, col] &amp;lt;- match(d[, col], factor.levels[[i]])&#xD;
    }&#xD;
    write.table(d, file = csv.con, sep = ",", &#xD;
                row.names = FALSE, col.names = first)&#xD;
    first &amp;lt;- FALSE&#xD;
}&#xD;
close(csv.con)&#xD;
&#xD;
## Now convert to a big.matrix&#xD;
library("bigmemory")&#xD;
backing.file    &amp;lt;- "airlines.bin"&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- read.big.matrix(csv.file, header = TRUE,&#xD;
                        type = "integer",&#xD;
                        backingfile = backing.file,&#xD;
                        descriptorfile = descriptor.file,&#xD;
                        extraCols = c("age"))&#xD;
&lt;/pre&gt;&#xD;
&lt;h2&gt;Sample analysis&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
All done now.  Sample analysis:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## bigScale.R - Replicate the analysis from &lt;a href="http://bit.ly/aTFXeN"&gt;http://bit.ly/aTFXeN&lt;/a&gt; with normal R&#xD;
##   http://info.revolutionanalytics.com/bigdata.html&#xD;
## See big.R for the preprocessing of the data&#xD;
&#xD;
## Load required libraries&#xD;
library("biglm")&#xD;
library("bigmemory")&#xD;
library("biganalytics")&#xD;
library("bigtabulate")&#xD;
&#xD;
## Use parallel processing if available&#xD;
## (Multicore is for "anything-but-Windows" platforms)&#xD;
if ( require("multicore") ) {&#xD;
    library("doMC")&#xD;
    registerDoMC()&#xD;
} else {&#xD;
    warning("Consider registering a multi-core 'foreach' processor.")&#xD;
}&#xD;
&#xD;
day.names &amp;lt;- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",&#xD;
               "Saturday", "Sunday")&#xD;
&#xD;
## Attach to the data&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- attach.big.matrix(dget(descriptor.file))&#xD;
&#xD;
## Replicate Table 5 in the Revolutions document:&#xD;
## Table 5&#xD;
t.5 &amp;lt;- bigtabulate(data,&#xD;
                   ccols = "DayOfWeek",&#xD;
                   summary.cols = "ArrDelay", summary.na.rm = TRUE)&#xD;
## Pretty-fy the outout&#xD;
stat.names &amp;lt;- dimnames(t.5.2$summary[[1]])[2][[1]]&#xD;
t.5.p &amp;lt;- cbind(matrix(unlist(t.5$summary), byrow = TRUE,&#xD;
                      nrow = length(t.5$summary),&#xD;
                      ncol = length(stat.names),&#xD;
                      dimnames = list(day.names, stat.names)),&#xD;
               ValidObs = t.5$table)&#xD;
print(t.5.p)&#xD;
#             min  max     mean       sd    NAs ValidObs&#xD;
# Monday    -1410 1879 6.669515 30.17812 385262 18136111&#xD;
# Tuesday   -1426 2137 5.960421 29.06076 417965 18061938&#xD;
# Wednesday -1405 2598 7.091502 30.37856 405286 18103222&#xD;
# Thursday  -1395 2453 8.945047 32.30101 400077 18083800&#xD;
# Friday    -1437 1808 9.606953 33.07271 384009 18091338&#xD;
# Saturday  -1280 1942 4.187419 28.29972 298328 15915382&#xD;
# Sunday    -1295 2461 6.525040 31.11353 296602 17143178&#xD;
&#xD;
## Figure 1&#xD;
plot(t.5.p[, "mean"], type = "l", ylab="Average arrival delay")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Just like the Revolutions paper.  You can now use &lt;code&gt;biglm.big.matrix&lt;/code&gt; and &lt;code&gt;bigglm.big.matrix&lt;/code&gt; for basic regression and there are also k-means clustering and other functions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here.  It seems kind of sad if they change the syntax to be incompatible with standard R formulas, which is what appears to be happening.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Credit to Michael Kane and Jay Emerson of Yale who showed much of this in their poster &lt;a href="http://stat-computing.org/dataexpo/2009/posters/kane-emerson.pdf"&gt;The Airline Data Set... What's the big deal?&lt;/a&gt;.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/pegHIMxElX0" height="1" width="1"/&gt;</content><published>2010-08-05T08:22:00Z</published><updated>2010-08-05T08:22:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Big-data-for-R.html</feedburner:origLink></entry><entry><title type="text">Area Plots with Intensity Coloring</title><id>urn:uuid:6b83e364-13a9-58b5-9f83-ec94683bf592</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/6pq1Dbge-y0/Area-Plots-with-Intensity-Coloring.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="Click to read full article">
    <img src="http://static.cybaea.net/images/nino-150.png" width="150" height="150" alt="[Graphics output]" />
  </a>
</div>
<p>I am not sure apeescape’s <a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/">ggplot2 area plot with intensity colouring</a> is really the best way of presenting the information, but it had me intrigued enough to replicate it using base <a href="http://www.r-project.org/">R</a> graphics.</p>

<p>The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, <code>lines(..., type="l")</code> does not recycle the colour <code>col=</code> argument, so we end up with rather more loops than I thought would be necessary.</p>

<p>We also get a nice opportunity to use the under-appreciated <code>read.fwf</code> function.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;I am not sure apeescape’s &lt;a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/"&gt;ggplot2 area plot with intensity colouring&lt;/a&gt; is really the best way of presenting the information, but it had me intrigued enough to replicate it using base &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; graphics.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, &lt;code&gt;lines(..., type="l")&lt;/code&gt; does not recycle the colour &lt;code&gt;col=&lt;/code&gt; argument, so we end up with rather more loops than I thought would be necessary.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;(The answer is not to use &lt;code&gt;lines(..., type="h")&lt;/code&gt; which, confusingly, &lt;em&gt;does&lt;/em&gt; recycle the colour &lt;code&gt;col=&lt;/code&gt; argument.  This one had me for a while, but the &lt;code&gt;type=h&lt;/code&gt; lines always start from zero so you do not get the gradient feature.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We also get a nice opportunity to use the under-appreciated &lt;code&gt;read.fwf&lt;/code&gt; function.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;##!/usr/bin/Rscript&#xD;
## nino.R - another version of &lt;a href="http://bit.ly/9P9Gh1"&gt;http://bit.ly/9P9Gh1&lt;/a&gt;&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Get the data from the NOAA server&#xD;
nino &amp;lt;- read.fwf("&lt;a href="http://www.cpc.noaa.gov/data/indices/wksst.for"&gt;http://www.cpc.noaa.gov/data/indices/wksst.for&lt;/a&gt;",&#xD;
                 widths=c(-1, 9, rep(c(-5, 4, 4), 4)),&#xD;
                 skip=4,&#xD;
                 col.names=c("Week",&#xD;
                     paste(rep(c("Nino12","Nino3","Nino34","Nino4"), rep(2, 4)),&#xD;
                           c("SST", "SSTA"), sep=".")))&#xD;
&#xD;
## Make the date column something useful&#xD;
nino$Week &amp;lt;- as.Date(nino$Week, format="%d%b%Y")&#xD;
&#xD;
## Make colour gradients&#xD;
ncol &amp;lt;- 50&#xD;
grad.neg &amp;lt;- hsv(4/6, seq(0, 1, length.out=ncol), 1) # Blue gradient&#xD;
grad.pos &amp;lt;- hsv(  0, seq(0, 1, length.out=ncol), 1) # Red gradient&#xD;
&#xD;
## Make plot&#xD;
plot(Nino34.SSTA ~ Week, data=nino, type="n",&#xD;
     main="Nino34", xlab="Date", ylab="SSTA", axes=FALSE)&#xD;
do.call(function (...) rect(..., col="gray85", border=NA),&#xD;
        as.list(par("usr")[c(1, 3, 2, 4)]))&#xD;
&#xD;
y &amp;lt;- nino$Nino34.SSTA                   # The values we will plot&#xD;
x &amp;lt;- nino$Week&#xD;
&#xD;
axis.Date(1, x=x, tck=1, col="white")&#xD;
axis(2, tck=1, col="white")&#xD;
box()&#xD;
&#xD;
idx &amp;lt;- integer(NROW(nino))&#xD;
idx[y &amp;gt;= 0] &amp;lt;- 1 + round( y[y &amp;gt;= 0] * (ncol - 1) / max( y[y &amp;gt;= 0]), 0)&#xD;
idx[y &amp;lt;  0] &amp;lt;- 1 + round(-y[y &amp;lt;  0] * (ncol - 1) / max(-y[y &amp;lt;  0]), 0)&#xD;
&#xD;
draw.gradient &amp;lt;- function(x, ys, cols) {&#xD;
    xs &amp;lt;- rep(x, 2)&#xD;
    for (i in seq(1, length(ys)-1))&#xD;
        plot.xy(list(x=xs, y=c(ys[i], ys[i+1])), type="l", col=cols[i])&#xD;
}&#xD;
&#xD;
for (i in 1:length(x)) {&#xD;
    ys &amp;lt;- seq(0, y[i], length.out=idx[i]+1)&#xD;
    cols &amp;lt;- (if (y[i] &amp;gt;=0) grad.pos else grad.neg)&#xD;
    draw.gradient(x[i], ys, cols)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;The result is a decent gradient:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/nino-800.png" title="Click for larger version"&gt;&lt;img src="http://static.cybaea.net/images/nino-400.png" width="400" height="400" alt="[Graphics output]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I deliberately omitted the scale legend on the right hand side following Allan’s First Law of Happy Graphics: Thou shall not present the same information twice.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For less dense information, you should increase the line width.  That is left to the reader. (Hint: it is hard to get just right in base graphics, but &lt;code&gt;lwd &amp;lt;- ceiling(par("pin")[1] / dev.size("in")[1] * dev.size("px")[1] / length(x))&lt;/code&gt; could be a starting point for an approximation. We really need gradient-filled polygons in base R.)&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/6pq1Dbge-y0" height="1" width="1"/&gt;</content><published>2010-07-13T07:47:00Z</published><updated>2010-07-13T07:47:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html</feedburner:origLink></entry><entry><title type="text">Employee productivity as function of number of workers revisited</title><id>urn:uuid:cee42e41-ea6c-5ee6-a0b5-4e4644168052</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mWb7tx4LS94/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<div class="floatRight"><a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="Click for read full article"><img width="150" height="150" src="http://static.cybaea.net/images/ftse100-150.png" alt="[Results of analysis shown in graph]" /></a></div>We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that <a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html">when you treble the number of workers, you halve their individual productivity</a> which is mildly scary.
</p>
<p>
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;when you treble the number of workers, you halve their individual productivity&lt;/a&gt; which is mildly scary.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&amp;amp;P 500 leading American companies that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;we analyzed four years ago&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;We will of course use the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; for our analysis, and once again we are grateful to &lt;a href="http://uk.finance.yahoo.com/"&gt;Yahoo Finance&lt;/a&gt; for providing the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;The analysis script is available as &lt;a href="http://static.cybaea.net/files/ftse100.R"&gt;ftse100.R&lt;/a&gt; and is really simple:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;## ftse100.R - Display employee productivity for FTSE-100 consitituents&#xD;
## Copyright © 2010 Allan Engelhardt &amp;lt;http://www.cybaea.net/&amp;gt;&#xD;
## All Rights Reserved.&#xD;
&#xD;
## Get the index constituents.&#xD;
ftse.100 &amp;lt;- read.csv(file = "http://uk.old.finance.yahoo.com/d/quotes.csv?s=@%5EFTSE&amp;amp;f=s&amp;amp;e=.csv", header = FALSE)&#xD;
names(ftse.100) &amp;lt;- c("symbol")&#xD;
data &amp;lt;- data.frame(symbol=NULL, employees=NULL, profit=NULL, sector=NULL)&#xD;
&#xD;
## For each stock symbol, get employees, profit, and sector&#xD;
for (symbol in ftse.100$symbol) {&#xD;
    profile.url &amp;lt;- paste("http://uk.finance.yahoo.com/q/pr?s=", symbol, sep="")&#xD;
    con &amp;lt;- url(profile.url, open = "r")&#xD;
    text &amp;lt;- readChar(con, 2^24)     # enough bytes&#xD;
    close(con)&#xD;
    x &amp;lt;- sub('.*Number of employees:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([[:digit:],]+).*', "\\1", text, ignore.case = TRUE)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    empl &amp;lt;- tryCatch(as.integer(x), warning = function(x) NA)&#xD;
    x &amp;lt;- sub('.*Net Profit.*?&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([+-]?[[:digit:],]+).*', '\\1', text)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    profit &amp;lt;- tryCatch(as.integer(x)*1e6, warning = function(x) NA)&#xD;
    sector &amp;lt;- sub('.*Sector:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;(.*?)&amp;lt;/td&amp;gt;.*', '\\1', text)&#xD;
    if (any(c(empl, profit) &amp;lt;= 0, is.na(c(empl, profit)))) {&#xD;
        cat("Error parsing symbol", symbol, "see", profile.url, "\n")&#xD;
    } else {&#xD;
        data &amp;lt;- rbind(data, data.frame(symbol=symbol, employees=empl, profit=profit, sector=sector))&#xD;
    }&#xD;
    Sys.sleep(1)&#xD;
}&#xD;
&#xD;
## Save the data so we don't have to hit Yahoo all the time.&#xD;
save(data, file = "data.RData")&#xD;
&#xD;
## Save plot to file:&#xD;
#png(filename="ftse100.png", width=800, height=800, pointsize=14, bg="white", res=100)&#xD;
&#xD;
opar &amp;lt;- par(cex.sub = sqrt(sqrt(2)), font.sub = 3, font.lab = 2)&#xD;
&#xD;
## x and y coordinates of plot and plot limits&#xD;
x &amp;lt;- with(data, employees)&#xD;
y &amp;lt;- with(data, profit/employees)&#xD;
xlim &amp;lt;- c(10^floor(log10(min(x))), 10^ceiling(log10(max(x))))&#xD;
ylim &amp;lt;- c(10^floor(log10(min(y))), 10^ceiling(log10(max(y))))&#xD;
&#xD;
## Set up to display different color and symbols&#xD;
plot_col &amp;lt;- 1&#xD;
plot_pch &amp;lt;- 1&#xD;
markers &amp;lt;- 21:25&#xD;
pchs &amp;lt;- rep(markers, ceiling(length(levels(data$sector))/length(markers)))&#xD;
palette(rainbow(length(levels(data$sector)), start=3/6, end=6/6))&#xD;
&#xD;
# Make empty plot:&#xD;
plot.new()&#xD;
plot(profit/employees ~ employees, data = data[FALSE, ], &#xD;
     type = "p", pch = pchs[plot_pch], col = plot_col,&#xD;
     log="xy", xaxp = c(xlim, 1), yaxp = c(ylim, 1), xlim = xlim, ylim = ylim,&#xD;
     main = "Profit per employee (FTSE 100)", xlab = "Employees", ylab = "Profit per employees (GBP)")&#xD;
&#xD;
## Plot each sector&#xD;
for (sector in levels(data$sector)) {&#xD;
    plot.xy(xy.coords(with(data[data$sector == sector,], employees),&#xD;
                      with(data[data$sector == sector,], profit/employees),&#xD;
                      log = "xy", xlab = "", ylab = ""),&#xD;
            type = "p", pch = pchs[plot_pch], col = plot_col, bg = plot_col)&#xD;
    plot_pch &amp;lt;- plot_pch + 1&#xD;
    plot_col &amp;lt;- plot_col + 1&#xD;
}&#xD;
legend(x = "bottomleft", legend = levels(data$sector), title = "Industry Sectors", &#xD;
       col = palette(), pt.bg = palette(), pch = pchs, cex = 2/3, pt.cex = 1, ncol = 2)&#xD;
&#xD;
## Fit a linear model to the log-log data:&#xD;
m &amp;lt;- lm(log10(y) ~ log10(x))&#xD;
xl &amp;lt;- c(xlim[1]*5, xlim[2]/5)&#xD;
yl &amp;lt;- 10^predict(m, data.frame(x = xl))&#xD;
lines(xl, yl, col = "darkred", lty = "dashed", lwd = 2)&#xD;
t &amp;lt;- sprintf("Power = %0.3g", m$coefficients[2])&#xD;
text(xl[2], yl[2], t, adj = c(0.25, -1.5), col = "darkred", font = 2)&#xD;
&#xD;
## All done.&#xD;
par(opar)&#xD;
dev.off()&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Leave it to run and this is what you get:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse100.png"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse100-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;The power law still broadly holds.  In a large company, the productivity of the individual employee is only ¼ of the productivity in a company with one-tenth of the number of workers.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The analysis for the FTSE All-Share index is easy (&lt;a href="http://static.cybaea.net/files/ftse-all.R" title="Click for full size"&gt;ftse-all.R&lt;/a&gt;) and gives a slope of -0.7605541 for the 301 companies with the required information.  This is worse, much worse (you only need about 6 times as many workers to reduce the productivity to ¼).&lt;/p&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse-all.png" title="Click for full size"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse-all-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mWb7tx4LS94" height="1" width="1"/&gt;</content><published>2010-06-22T11:20:00Z</published><updated>2010-06-22T11:20:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html</feedburner:origLink></entry><entry><title type="text">Comparing standard R with Revoutions for performance</title><id>urn:uuid:3293adea-fce4-57ac-844d-8c40497745e3</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/SNu7nI9K28g/Comparing-standard-R-with-Revoutions-for-performance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Following on from my previous post about <a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html">improving performance of R by linking with optimized linear algebra libraries</a>, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their <a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php">Revolutionary Performance</a> pages.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Following on from my previous post about &lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html"&gt;improving performance of R by linking with optimized linear algebra libraries&lt;/a&gt;, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their &lt;a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php"&gt;Revolutionary Performance&lt;/a&gt; pages.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For convenience I collected their tests into a single script &lt;a href="http://static.cybaea.net/files/revolution_benchmark.R"&gt;revolution_benchmark.R&lt;/a&gt; that I can simply run with &lt;code&gt;Rscript --vanilla revolution_benchmark.R&lt;/code&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The results, compared with the speed-up factors Revolution claims for their version:&lt;/p&gt;&#xD;
&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Revolutions benchmarks compared with R on x86_64 system&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;&lt;/th&gt;&lt;th&gt;R&lt;/th&gt;&lt;th&gt;R + ATLAS&lt;/th&gt;&lt;th&gt;Speed-up&lt;/th&gt;&lt;th&gt;Revolution’s&lt;br&gt;claimed speed-up&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Matrix Multiply&lt;/td&gt;&lt;td&gt;360.96&lt;/td&gt;&lt;td&gt;9.30&lt;/td&gt;&lt;td&gt;37.8&lt;/td&gt;&lt;td&gt;41.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Cholesky Factorization&lt;/td&gt;&lt;td&gt;27.28&lt;/td&gt;&lt;td&gt;5.65&lt;/td&gt;&lt;td&gt;3.8&lt;/td&gt;&lt;td&gt;21.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Singular Value Decomposition&lt;/td&gt;&lt;td&gt;98.73&lt;/td&gt;&lt;td&gt;23.57&lt;/td&gt;&lt;td&gt;3.2&lt;/td&gt;&lt;td&gt;12.6&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Principal Components Analysis&lt;/td&gt;&lt;td&gt;454.55&lt;/td&gt;&lt;td&gt;40.92&lt;/td&gt;&lt;td&gt;10.1&lt;/td&gt;&lt;td&gt;15.2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Linear Discriminant Analysis&lt;/td&gt;&lt;td&gt;271.44&lt;/td&gt;&lt;td&gt;79.61&lt;/td&gt;&lt;td&gt;2.4&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;In all instances Revolution’s claimed speed-up is greater, though probably not significantly so for the Matrix Multiply test and hardly so for the Principal Components Analysis.  (Of course, I do not have a copy of Revolution Analytics’ product, so I can’t verify their claims or make a comparable test.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Whether saving 48 seconds on a linear discriminant analysis is enough to justify buying the product is a decision I leave to you: you know what analysis you do.  For me, there are (many) orders of magnitudes to be gained by better algorithms and better variable selections so I am not too worried about factors of 2 or even 10.  For extra raw power, I run R on a cloud service like AWS which scales well for many problems and is easy to do with stock R while I guess there are some sort of license implications if you wanted to do the same with Revolution’s product.  (But I &lt;em&gt;like&lt;/em&gt; Revolution and am still trying to find an excuse to use their product.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Your mileage may vary.&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/SNu7nI9K28g" height="1" width="1"/&gt;</content><published>2010-06-17T09:05:00Z</published><updated>2010-06-17T09:05:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html</feedburner:origLink></entry><entry><title type="text">Faster R through better BLAS</title><id>urn:uuid:428f009b-a07d-59dc-a643-50cc9a2b86ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/txxBGby0z-I/Faster-R-through-better-BLAS.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Can we make our analysis using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.</p>
<p>But recently David Smith was <a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries">suggesting</a> that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.</p>
<p>The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Can we make our analysis using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.&lt;/p&gt;&#xD;
&lt;p&gt;But recently David Smith was &lt;a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries"&gt;suggesting&lt;/a&gt; that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.&lt;/p&gt;&#xD;
&lt;p&gt;The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The environment&lt;/h2&gt;&#xD;
&lt;p&gt;I use R on a 64-bit &lt;a href="http://fedoraproject.org/"&gt;Fedora&lt;/a&gt; 12 Linux system.  Fortunately, it is very easy to rebuild R using different libraries on this platform.  For the following, I will assume that you have a working &lt;a href="http://www.rpm.org/max-rpm-snapshot/rpmbuild.8.html"&gt;rpmbuild&lt;/a&gt; environment.  The test system has a quad core Intel Xeon E5420 CPU with each core running at 2.50 GHz.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmarks&lt;/h2&gt;&#xD;
&lt;p&gt;Benchmarking R is complex.  Very complex.  But for this simple test we use two tests from the &lt;a href="http://r.research.att.com/benchmarks/"&gt;R Benchmarks&lt;/a&gt; page: &lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt; and &lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R-benchmark-25.R&lt;/a&gt;.  The first is a simple benchmark using the examples from the MASS package, and has the advantage that it reflects real-world problems and real-world analysis, albeit small problems and short analysis.  The second is a much more artificial example and primarily test matrix operations.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We run the MASS benchmark as:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;/usr/bin/time -p R --vanilla CMD BATCH MASS-ex.R /dev/null&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;While the R-benchmark-25 is simply:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Rscript --vanilla R-benchmark-25.R&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;For the MASS benchmark we simply capture the real elapsed time while the R benchmark 2.5 provides more detailed output for the three classes of tests (matrix calculation, -functions, and program execution) as well as overall summaries.  They are all shown in the table below.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Compiler-optimized R&lt;/h2&gt;&#xD;
&lt;p&gt;For the experiments that follow the first thing to do is to grab copies of the source RPMs for R and for ATLAS:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;cd ~/rpmbuild/SRPMS&#xD;
yumdownloader --source atlas R&#xD;
cd ..&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;At the time I did this, I got &lt;code&gt;R-2.11.0-1.fc12.src.rpm&lt;/code&gt; and &lt;code&gt;atlas-3.8.3-12.fc12.src.rpm&lt;/code&gt;.  I crank up the level of optimization that I do when building from source so the first thing is to edit &lt;code&gt;&lt;a href="http://static.cybaea.net/files/.rpmrc"&gt;~/.rpmrc&lt;/a&gt;&lt;/code&gt; to include the line &lt;code&gt;optflags: x86_64 -O3 -march=native -m64 -g&lt;/code&gt;.  With that in place we can simply do:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild --rebuild SRPMS/R-2.11.0-1.fc12.src.rpm  #  Change version numbers as needed&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We now have a compiler-optimized version of R and we can re-run our tests.  It doesn't make much difference, but that is also good to know.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;ATLAS BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;Now let's try linking to the ATLAS BLAS libraries instead.  I assume you have them installed (&lt;code&gt;yum install atlas&lt;/code&gt; if not) so you can just grab a copy of &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;R-atlas.diff&lt;/a&gt; to change the spec file like this:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpm -ihv SRPMS/R-2.11.0-1.fc12.src.rpm   # Install to your rpmbuild environment&#xD;
cd SPECS&#xD;
wget &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;http://static.cybaea.net/files/R-atlas.diff&lt;/a&gt;&#xD;
patch -o R-atlas.spec R.spec R-atlas.diff&#xD;
cd ..&#xD;
rpmbuild -bb SPECS/R-atlas.spec&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;You now have a version of R that uses the ATLAS BLAS libraries, so you can re-run the tests.  The results are in the table below in the “Optimized R + Standard ATLAS” row.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;As expected, the matrix operations from the &lt;code&gt;R-benchmark-25.R&lt;/code&gt; runs a lot faster: they complete in about 30-40% of the time, much of which comes from the multi-threading so all four CPU cores are used.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;However, for the analysis-heavy code is &lt;code&gt;MASS-ex.R&lt;/code&gt; there is little difference.  If anything, we see a tiny increase in running time.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
  &lt;em&gt;Multi-threaded BLAS libraries make no significant difference to real-world analysis problems using R.&lt;/em&gt;&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Other BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;For good measure we also try an optimized version of ATLAS, but it does not make much difference on the x86_64 architecture:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild -D "enable_native_atlas 1" --rebuild SRPMS/atlas-3.8.3-12.fc12.src.rpm&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/atlas*3.8.3-12*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;And (only) for completeness, we also try the standard Netlib BLAS and LAPACK libraries (&lt;code&gt;yum install blas lapack&lt;/code&gt;) by the same method as the ATLAS library above but with a slightly different change to the SPEC file: &lt;code&gt;&lt;a href="http://static.cybaea.net/files/R-blas.diff"&gt;R-blas.diff&lt;/a&gt;&lt;/code&gt;.  It performs a little better than vanilla R.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For more information about rebuilding R with different BLAS libraries, see the &lt;a href="http://cran.r-project.org/doc/manuals/R-admin.html#Linear-algebra"&gt;linear algebra section in the R Installation and Administration manual&lt;/a&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmark results&lt;/h2&gt;&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Benchmark results for various optimizations of R and the BLAS library&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th rowspan="3"&gt;R version&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;&lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;th colspan="10"&gt;&lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R benchmark 2.5&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th colspan="2"&gt;Real&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Total time&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Overall mean&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅰ. Matrix calc.&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅱ. Matrix functions&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅲ. Program.&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;seconds&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Base install&lt;/td&gt;&#xD;
&lt;td&gt;19.00&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;78.49&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.11&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.32&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;3.86&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;1.05&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R&lt;/td&gt;&#xD;
&lt;td&gt;18.98&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;76.11&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;2.02&lt;/td&gt;&lt;td&gt;0.96&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;3.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Netlib BLAS&lt;/td&gt;&#xD;
&lt;td&gt;18.56&lt;/td&gt;&lt;td&gt;0.98&lt;/td&gt;&lt;td&gt;73.22&lt;/td&gt;&lt;td&gt;0.93&lt;/td&gt;&lt;td&gt;1.81&lt;/td&gt;&lt;td&gt;0.86&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;2.41&lt;/td&gt;&lt;td&gt;0.62&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Standard ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.43&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.74&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;0.39&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.27&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Optimized ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.31&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.36&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&lt;td&gt;0.45&lt;/td&gt;&lt;td&gt;0.84&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.26&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/txxBGby0z-I" height="1" width="1"/&gt;</content><published>2010-06-15T10:21:00Z</published><updated>2010-06-15T10:21:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html</feedburner:origLink></entry><entry><title type="text">R: Eliminating observed values with zero variance</title><id>urn:uuid:5394cf3c-2009-5225-955d-1b6c90ae4445</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tQRUNQKxFng/R-Eliminating-observed-values-with-zero-variance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I needed a fast way of eliminating observed values with zero variance from large data sets using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I needed a fast way of eliminating observed values with zero variance from large data sets using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I use the &lt;a href="http://www.kddcup-orange.com/data.php"&gt;KDD Cup 2009 data sets&lt;/a&gt; as my reference for this experiment.  (You will need to register to download the data.)  It is a realistic example of the type of customer data that I usually work with.  It has 50,000 observations of 15,000 variables.  To load it into R you'll need a reasonably beefy machine.  My workstation has 16GB of memory; if yours have less then use a sample of the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We load the data into R and propose a few ways in which we may identify the columns we need:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;#!/usr/bin/Rscript&#xD;
## zero-var.R - find the fastest way of eliminating observations with zero variance&#xD;
## © 2010 Allan Engelhardt, http://www.cybaea.net&#xD;
&#xD;
## Read the data file.&#xD;
## We have already converted it to R format and saved it, so we can do&#xD;
load("train.RData")&#xD;
## instead of something like&#xD;
# train &amp;lt;- read.delim(file="../orange_large_train.data.bz2")&#xD;
&#xD;
## Some suggestions for zero variance functions:&#xD;
zv.1 &amp;lt;- function(x) {&#xD;
    ## The literal approach&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y == 0)&#xD;
}&#xD;
zv.2 &amp;lt;- function(x) {&#xD;
    ## As before, but avoiding direct comparison with zero&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y &amp;lt; .Machine$double.eps ^ 0.5)&#xD;
}&#xD;
zv.3 &amp;lt;- function(x) {&#xD;
    ## Maybe it is faster to check for equality than to compute?&#xD;
    y &amp;lt;- x[!is.na(x)]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
zv.4 &amp;lt;- function(x) {&#xD;
    ## Taking out the special case may speed things up?&#xD;
    ## (At least for this data set where this case is common.)&#xD;
    z &amp;lt;- is.na(x)&#xD;
    if ( all(z) ) return(TRUE);&#xD;
    y &amp;lt;- x[!z]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Now we just have to load the very useful &lt;a href="http://cran.r-project.org/web/packages/rbenchmark/index.html"&gt;rbenchmark&lt;/a&gt; package and let the machine figure it out:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("rbenchmark")&#xD;
&#xD;
cat("Running benchmarks:\n")&#xD;
benchmark(&#xD;
          zv1 = { sapply(train, zv.1) },&#xD;
          zv2 = { sapply(train, zv.2) },&#xD;
          zv3 = { sapply(train, zv.3) },&#xD;
          zv4 = { sapply(train, zv.4) },&#xD;
          replications = 5,&#xD;
          columns = c("test", "elapsed", "relative", "sys.self"),&#xD;
          order = "elapsed"&#xD;
          )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The answer (on my machine) is that it is faster to calculate than to check for equality:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Running benchmarks:&#xD;
  test elapsed relative sys.self&#xD;
1  zv1  78.619 1.000000    6.395&#xD;
2  zv2  79.276 1.008357    6.586&#xD;
3  zv3 113.024 1.437617    1.735&#xD;
4  zv4 118.579 1.508274    1.716&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Can you think of an even faster way to do it?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tQRUNQKxFng" height="1" width="1"/&gt;</content><published>2010-03-08T14:46:00Z</published><updated>2010-03-08T14:46:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html</feedburner:origLink></entry><entry><title type="text">Beautiful Data</title><id>urn:uuid:770cda82-5757-5a26-827a-2aeff8a8a098</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/nsxeLGyxKLM/Beautiful-Data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]" />
  </a>
</div>
<p>
O'Reilly's recent publication <a href="http://oreilly.com/catalog/9780596157111/">Beautiful Data</a> has a chapter by <a href="http://jeffjonas.typepad.com/jeff_jonas/">Jeff Jonas</a> which is enough reason in itself for me to recommend it.  The chapter, <a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf">Data Finds Data</a>, is also available as a PDF download.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O'Reilly's recent publication &lt;a href="http://oreilly.com/catalog/9780596157111/"&gt;Beautiful Data&lt;/a&gt; has a chapter by &lt;a href="http://jeffjonas.typepad.com/jeff_jonas/"&gt;Jeff Jonas&lt;/a&gt; which is enough reason in itself for me to recommend it.  The chapter, &lt;a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf"&gt;Data Finds Data&lt;/a&gt;, is also available as a PDF download.&#xD;
&lt;/p&gt;&#xD;
&lt;div class="floatRight"&gt;&#xD;
  &lt;a href="http://oreilly.com/catalog/9780596157111/" title="Click for book details"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I met Jeff a couple of year ago at an ETech conference, and he is easily one of the smartest people I have ever met who is thinking about data.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/nsxeLGyxKLM" height="1" width="1"/&gt;</content><published>2009-07-27T19:38:00Z</published><updated>2009-07-27T19:38:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Beautiful-Data.html</feedburner:origLink></entry><entry><title type="text">Massively parallel database for analytics</title><id>urn:uuid:a8ba9e43-b837-551c-bd02-a1a7b4506c41</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Apso1Get0Yk/Massively-parallel-database-for-analytics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html">Announcing release of HadoopDB</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;Announcing release of HadoopDB&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;See also:&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;Short version: key bullet points&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;Long version (12 pages, PDF)&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://tech.slashdot.org/story/09/07/21/1747241/Researchers-Create-Database-Hadoop-Hybrid?from=rss"&gt;Slashdot discussion&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/HadoopStreaming/index.html"&gt;R package HadoopStreaming&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;/ul&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Apso1Get0Yk" height="1" width="1"/&gt;</content><published>2009-07-22T13:37:00Z</published><updated>2009-07-22T13:37:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html</feedburner:origLink></entry><entry><title type="text">The Knapsack Problem</title><id>urn:uuid:6efce6d8-6489-5275-aa88-1ddce86d4e65</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/JCEN5oEfIRM/The-Knapsack-Problem.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html">David posts a question</a> about how to solve <a href="http://xkcd.com/287/">this</a> <a href="http://en.wikipedia.org/wiki/Knapsack_problem">knapsack problem </a> using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  My reply in the comments seems to have disappeared for a while so here is my proposed solution:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatCenter" style="width: 640px;"&gt;&#xD;
  &lt;a href="http://xkcd.com/287/"&gt;&#xD;
    &lt;img src="http://imgs.xkcd.com/comics/np_complete.png" width="640" height="414" alt="[Cartoon from XKCD]" title="NP-Complete"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html"&gt;David posts a question&lt;/a&gt; about how to solve this &lt;a href="http://en.wikipedia.org/wiki/Knapsack_problem"&gt;knapsack problem &lt;/a&gt; using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  My reply in the comments seems to have disappeared for a while so here is my proposed solution.  See David’s blog for my earlier proposed solution with a very common error.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html&#xD;
appetizer.solution &amp;lt;- local (&#xD;
function (target) {&#xD;
  app &amp;lt;- c(2.15, 2.75, 3.35, 3.55, 4.20, 5.80)&#xD;
  r &amp;lt;- 2L&#xD;
  repeat {&#xD;
	c &amp;lt;- gtools::combinations(length(app), r=r, v=app, repeats.allowed=TRUE)&#xD;
	s &amp;lt;- rowSums(c)&#xD;
	if ( all(s &amp;gt; target) ) {&#xD;
	  print("No solution found")&#xD;
	  break&#xD;
	}&#xD;
	x &amp;lt;- which( abs(s-target) &amp;lt; 1e-4 )&#xD;
	if ( length(x) &amp;gt; 0L ) {&#xD;
	  cat("Solution found: ", c[x,], "\n")&#xD;
	  break&#xD;
	}&#xD;
	r &amp;lt;- r + 1L&#xD;
  }&#xD;
})&#xD;
&#xD;
appetizer.solution(15.05)&#xD;
# Solution found:  2.15 3.55 3.55 5.8 &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Brute force works, it just doesn’t scale well.  (Note that 7×2.15 is another solution.)&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/JCEN5oEfIRM" height="1" width="1"/&gt;</content><published>2009-07-10T20:30:00Z</published><updated>2009-07-10T20:30:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html</feedburner:origLink></entry><entry><title type="text">OECD Statistics</title><id>urn:uuid:43e585f9-9c60-505d-b349-b65d1a20c969</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/OECD-Statistics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/5fn__mpTK8o/OECD-Statistics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am a sucker for good quality data.  I <a href="http://www.cybaea.net/Blogs/Data/Data_gov.html">wrote about data.gov</a>, the US Government data site before, and now I find <a href="http://stats.oecd.org/">OECD Statistics</a> which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am a sucker for good quality data.  I &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html"&gt;wrote about data.gov&lt;/a&gt;, the US Government data site before, and now I find &lt;a href="http://stats.oecd.org/"&gt;OECD Statistics&lt;/a&gt; which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Exports in multiple formats, including Excel, CSV, and &lt;a href="http://sdmx.org/"&gt;SDMX&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/5fn__mpTK8o" height="1" width="1"/&gt;</content><published>2009-07-02T20:33:00Z</published><updated>2009-07-02T20:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/OECD-Statistics.html</feedburner:origLink></entry><entry><title type="text">R tips: Determine if function is called from specific package</title><id>urn:uuid:08988ce4-0f96-564e-9575-7c6f2ff16147</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mqFcMzo8FLQ/R-tips-Determine-if-function-is-called-from-specific-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I like the "multicore" library for a particular task.  I can easily write a combination of<code> if(require("multicore",...))</code> that means that my function will automatically use the parallel <code>mclapply()</code> instead of <code>lapply()</code> where it is available.  Which is grand 99% of the time, except when my function is called from <code>mclapply()</code> (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
</p>
<p>
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I like the "multicore" library for a particular task.  I can easily write a combination of&lt;code&gt; if(require("multicore",...))&lt;/code&gt; that means that my function will automatically use the parallel &lt;code&gt;mclapply()&lt;/code&gt; instead of &lt;code&gt;lapply()&lt;/code&gt; where it is available.  Which is grand 99% of the time, except when my function is called from &lt;code&gt;mclapply()&lt;/code&gt; (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First define a generally useful function:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code" title="is.in.namespace()"&gt;&#xD;
is.in.namespace &amp;lt;-&#xD;
function (ns) {&#xD;
  for ( frame in seq(1, sys.nframe(), 1) ) {&#xD;
	fun &amp;lt;- sys.function(frame);&#xD;
	env &amp;lt;- environment(fun)&#xD;
	n   &amp;lt;- environmentName(env)&#xD;
	if ( n == ns ) return(TRUE);&#xD;
  }&#xD;
  return(FALSE);&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we use it for our purpose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
is.in.multicore &amp;lt;- function (...) { return(is.in.namespace("multicore")) }&#xD;
library("multicore")&#xD;
stopifnot( mclapply(as.list(1), is.in.multicore)[[1]] == TRUE )&#xD;
stopifnot(   lapply(as.list(1), is.in.multicore)[[1]] == FALSE )&#xD;
stopifnot( local( {mclapply &amp;lt;- function(x) return(x); mclapply(is.in.multicore())} ) == FALSE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Easy when you know how.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mqFcMzo8FLQ" height="1" width="1"/&gt;</content><published>2009-06-16T10:27:00Z</published><updated>2009-06-16T10:27:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html</feedburner:origLink></entry><entry><title type="text">R tips: Installing Rmpi on Fedora Linux</title><id>urn:uuid:57259815-f049-5226-bda6-95b15ae0f4f2</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/z6AZUNX1s3Y/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Somebody on the R-help mailing list asked how to get <a href="http://cran.r-project.org/web/packages/Rmpi/index.html">Rmpi</a> working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  Since it is unusually painful to get working, I might as well copy the instructions here.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Somebody on the R-help mailing list asked how to get &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  Since it is unusually painful to get working, I might as well copy the instructions here.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;1. Install Open MPI on Fedora Core&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First install the &lt;a href="http://www.open-mpi.org/"&gt;openmpi&lt;/a&gt; libraries using:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;yum install openmpi openmpi-devel openmpi-libs&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The default installation on Fedora still doesn’t &lt;i&gt;quite&lt;/i&gt; work, so you need to execute the following command as root (only once is required, after installation of the package):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;ldconfig /usr/lib64/openmpi/lib/&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
You are not quite done: for R to work right with the libraries, you need to modify the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variable to include the path to the Open MPI libraries.  I have the following in my &lt;code&gt;~/.bash_profile&lt;/code&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".bash_profile"&gt;export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/usr/lib64/openmpi/lib/"&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Edit your file to contain the same, and execute that line at the command prompt and you are ready to continue.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;2. Install the &lt;code&gt;Rmpi&lt;/code&gt; package for &lt;code&gt;R&lt;/code&gt;&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
Now that your Open MPI libraries are set up, and what you do next depends on what version of &lt;code&gt;Rmpi&lt;/code&gt; you are installing.  Most likely you are installing the latest version in which case the following section applies.  The instructions for older versions are retained in a later section for reference.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.1. Current versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Make sure you have executed the &lt;code&gt;ldconfig&lt;/code&gt; command and set the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variables as described in the previous section before you continue.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Since at least version 0.5-8 of the &lt;code&gt;Rmpi&lt;/code&gt; library you can install it from the &lt;code&gt;R&lt;/code&gt; command line after you have fixed the Open MPI install.  At the &lt;code&gt;R&lt;/code&gt; prompt do:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;install.packages("Rmpi",&#xD;
                 configure.args =&#xD;
                 c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
                   "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
                   "--with-Rmpi-type=OPENMPI"))&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
It should work and install OK.  This is obviously quite a mouthful to remember, but help is at hand through the &lt;code&gt;options()&lt;/code&gt; mechanism in R.  In your &lt;code&gt;~/.Rprofile&lt;/code&gt; you can add something like:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".Rprofile"&gt;local({&#xD;
    my.configure.args &amp;lt;-&#xD;
        list("Rmpi" =&#xD;
             c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
               "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
               "--with-Rmpi-type=OPENMPI"),&#xD;
             ## Not needed for Rmpi but shown to illustrate the format&#xD;
             "ncdf" =&#xD;
             c("-with-netcdf_incdir=/usr/include/netcdf",&#xD;
               "-with-netcdf_libdir=/usr/lib64/")&#xD;
             );&#xD;
    options("configure.args" = my.configure.args)&#xD;
})&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Then you can just type &lt;code&gt;install.packages("Rmpi")&lt;/code&gt; at the R command prompt to install the package.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.2. Older versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
The problem is the configuration file &lt;code&gt;configure.ac&lt;/code&gt; which is, unfortunately, completely brain-damaged with hard-coded assumptions about which subdirectories should contain header and library files and no way of overriding it.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Download the latest &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; package from CRAN and unpack it using &lt;code&gt;tar zxvf Rmpi_0.5-7.tar.gz&lt;/code&gt;.  Go to the new &lt;code&gt;Rmpi&lt;/code&gt; directory and replace the file &lt;code&gt;configure.ac&lt;/code&gt; with the one below (for a x86_64 system; for 32 bit you probably need to change &lt;code&gt;-64&lt;/code&gt; to &lt;code&gt;-32&lt;/code&gt;):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="configure.ac"&gt; Process this file with autoconf to produce a configure script.&#xD;
&#xD;
AC_INIT(DESCRIPTION)&#xD;
&#xD;
AC_PROG_CC&#xD;
&#xD;
MPI_LIBS=`pkg-config --libs openmpi-1.3.1-gcc-64`&#xD;
MPI_INCLUDE=`pkg-config --cflags openmpi-1.3.1-gcc-64`&#xD;
MPITYPE="OPENMPI"&#xD;
MPI_DEPS="-DMPI2"&#xD;
&#xD;
AC_CHECK_LIB(util, openpty, [ MPI_LIBS="$MPI_LIBS -lutil" ])&#xD;
AC_CHECK_LIB(pthread, main, [ MPI_LIBS="$MPI_LIBS -lpthread" ])&#xD;
&#xD;
PKG_LIBS="${MPI_LIBS} -fPIC"&#xD;
PKG_CPPFLAGS="${MPI_INCLUDE} ${MPI_DEPS} -D${MPITYPE} -fPIC"&#xD;
&#xD;
AC_SUBST(PKG_LIBS)&#xD;
AC_SUBST(PKG_CPPFLAGS)&#xD;
AC_SUBST(DEFS)&#xD;
&#xD;
AC_OUTPUT(src/Makevars) &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The number 1.3.1 may change in future releases of Fedora: see &lt;code&gt;/usr/lib64/pkgconfig/openmpi-*.pc&lt;/code&gt; for the current value.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Still in the &lt;code&gt;Rmpi&lt;/code&gt; directory do the following in your shell:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;autoconf&#xD;
cd ..&#xD;
tar zcvf Rmpi_0.5-7-F11.tar.gz Rmpi&#xD;
R CMD INSTALL Rmpi_0.5-7-F11.tar.gz &#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;3. Test it&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;Now &lt;code&gt;Rmpi&lt;/code&gt; should be working in R:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("Rmpi")&#xD;
&amp;gt; mpi.spawn.Rslaves(nslaves=2)&#xD;
    2 slaves are spawned successfully. 0 failed.&#xD;
master (rank 0, comm 1) of size 3 is running on: server&#xD;
slave1 (rank 1, comm 1) of size 3 is running on: server&#xD;
slave2 (rank 2, comm 1) of size 3 is running on: server&#xD;
&amp;gt; x &amp;lt;- c(10,20)&#xD;
&amp;gt; mpi.apply(x,runif)&#xD;
[[1]]&#xD;
 [1] 0.25142616 0.93505554 0.03162852 0.71783194 0.35916139 0.85082154&#xD;
 [7] 0.35404191 0.14221315 0.60063773 0.71805190&#xD;
&#xD;
[[2]]&#xD;
 [1] 0.84157864 0.63481773 0.38217188 0.67839089 0.27827728 0.35429266&#xD;
 [7] 0.04898744 0.96601584 0.25687905 0.77381186 0.69011927 0.37391028&#xD;
[13] 0.19017369 0.51196594 0.51970563 0.15791524 0.21358237 0.69642478&#xD;
[19] 0.12690207 0.44177656&#xD;
&lt;/pre&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/z6AZUNX1s3Y" height="1" width="1"/&gt;</content><published>2009-06-12T10:23:00Z</published><updated>2009-06-12T10:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html</feedburner:origLink></entry><entry><title type="text">Data Mashups in R from O'Reilly</title><id>urn:uuid:edb63dc9-21f9-5664-8b35-afb01d7d6472</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/QV_4TAhfmFU/Data-Mashups-in-R-from-O_Reilly.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" title="Click for full article"><img src="http://static.cybaea.net/images/fc_heat_small.png" width="150" height="150" alt="[Philadelphia County July 2009 Foreclosure Heat Map]" /></a>
</div>
<p>
O’Reilly has published <a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R ">Data Mashups in R</a> as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here.  This is all done with the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O’Reilly has published &lt;a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R "&gt;Data Mashups in R&lt;/a&gt; as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below.  This is all done with the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/fc_heat.png" title="Larger version of Philadelphia County July 2009 Foreclosure Heat Map"&gt;&lt;img src="http://static.cybaea.net/images/fc_heat_medium.png" width="400" height="400" alt="[Philadelphia County July 2009 Foreclosure Heat Map]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They show how to:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Use regular expressions to parse HTML files&lt;/li&gt;&#xD;
&lt;li&gt;Use the &lt;a href="http://cran.r-project.org/web/packages/XML/index.html"&gt;XML&lt;/a&gt; package to parse XML data from a web service (&lt;a href="http://developer.yahoo.com/maps/rest/V1/geocode.html"&gt;Yahoo! Geocode&lt;/a&gt;)&lt;/li&gt;&#xD;
&lt;li&gt;Find ERSI shape files for your maps&lt;/li&gt;&#xD;
&lt;li&gt;Use &lt;a href="http://cran.r-project.org/web/packages/PBSmapping/index.html"&gt;PBSmapping&lt;/a&gt; to process and display geographical data (GIS)&lt;/li&gt;&#xD;
&lt;li&gt;Importing and using US Census data with your maps&lt;/li&gt;&#xD;
&lt;/ul&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/QV_4TAhfmFU" height="1" width="1"/&gt;</content><published>2009-06-09T11:23:00Z</published><updated>2009-06-09T11:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html</feedburner:origLink></entry><entry><title type="text">How to win the KDD Cup Challenge with R and gbm</title><id>urn:uuid:3fb3545e-ea30-5e3b-8f8b-8902f107b81d</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/iWBVzSGe3Aw/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about <a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html">recently</a>) kindly provides more information about how to win this public challenge using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> on a laptop (!).
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;recently&lt;/a&gt;) kindly provides more information about how to win this public challenge using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; on a laptop (!).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
As a reminder of &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;what we wrote before&lt;/a&gt;, the challenge provided two anonymized data set each of 50,000 mobile teleco customers and each entry having 15,000 variables.  The task was to find the best churn, up-, and cross-sell models.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Hugh summarizes his team’s approach:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
Feature selection was an important first step [we &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;mentioned before&lt;/a&gt; that this is key for all successful data mining projects – AE]. We looked at how effective each individual variable was as a predictor, which also allowed us to reading parts of the data only, &lt;em&gt;as the whole dataset didn’t fit in memory&lt;/em&gt; [my emphasis – AE]. The assessment here was homebrew, making a simple predictor on half the data and measuring performance (by the AUC measure) on the other half:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor&lt;/li&gt;&#xD;
&lt;li&gt;For continuous variables we split the variable up into "bins", as you would a histogram, and again took the average number of 1's in the response for each bin as the predictor.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
From this we came up with a set of about 200 variables for each model, which we continued to tinker with. The main model was a gradient boosted machine which used the "&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/gbm/index.html"&gt;gbm&lt;/a&gt;" package in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimising the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit. The package itself is quite powerful as it gives some useful diagnostics such as relative variable importance, allowing us to exclude some and include others.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;&#xD;
They did this on standard Windows laptops (Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive) against a competition that had more computing clusters available than Imelda Marcos had shoes.  It is not what you’ve got, it’s how you use it &lt;tt&gt;:-)&lt;/tt&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Congratulations to Hugh and his team!&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/iWBVzSGe3Aw" height="1" width="1"/&gt;</content><published>2009-06-01T07:07:00Z</published><updated>2009-06-01T07:07:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html</feedburner:origLink></entry><entry><title type="text">R used by KDD 2009 cup winner of slow challenge</title><id>urn:uuid:23be031b-ddb6-5244-ab24-77042c61951c</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/OqKxuXq79pQ/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
The results from the <a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html">KDD Cup 2009 challenge</a> (which we wrote about before) are in, and the winner of the slow challenge used the <a href="http://www.r-project.org">R statistical computing and analysis platform</a> for their winning submission.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
The results from the &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;KDD Cup 2009 challenge&lt;/a&gt; (which we wrote about before) are in, and the winner of the slow challenge used the &lt;a href="http://www.r-project.org"&gt;R statistical computing and analysis platform&lt;/a&gt; for their winning submission.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;a href="http://www.kddcup-orange.com/factsheet.php?id=21"&gt;write up&lt;/a&gt; (username/password may be required) from &lt;a href="http://www.ms.unimelb.edu.au/Personnel/profile.php?PC_id=590"&gt;Hugh Miller&lt;/a&gt; and team at the University of Melbourne includes these points:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function&lt;/li&gt;&#xD;
&lt;li&gt;Models fit in an hour or so&lt;/li&gt;&#xD;
&lt;li&gt;Used the &lt;a href="http://www.r-project.org"&gt;R statistical package&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
Impressive hardware selection!  Well done R.  Weka was another popular tool among the top entrants.  Key for all of them were clever data preparation and variable substitution.  The fast track winners from IBM document this in some detail:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We tried PCA on the large data set, but it did not seem to help.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/OqKxuXq79pQ" height="1" width="1"/&gt;</content><published>2009-05-31T13:17:00Z</published><updated>2009-05-31T13:17:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html</feedburner:origLink></entry><entry><title type="text">R tips: Use read.table instead of strsplit to split a text column into multiple columns</title><id>urn:uuid:60775fac-6d0b-5d55-9e76-eb21bdde97c1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/uowubtD_s_4/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving <code>strsplit</code>.  But <a href="http://staff.pubhealth.ku.dk/~pd/">Peter Dalgaard</a> comes up with a much nicer method using <code>read.table</code> on a <code>textConnection</code> object:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving &lt;code&gt;strsplit&lt;/code&gt;.  But &lt;a href="http://staff.pubhealth.ku.dk/~pd/"&gt;Peter Dalgaard&lt;/a&gt; comes up with a much nicer method using &lt;code&gt;read.table&lt;/code&gt; on a &lt;code&gt;textConnection&lt;/code&gt; object:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; a &amp;lt;- data.frame(cbind(color=c("yellow","red","blue","red"),&#xD;
                        status=c("no","yes","yes","no"),&#xD;
                        ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))&#xD;
&amp;gt; con &amp;lt;- textConnection(as.character(a$ip))&#xD;
&amp;gt; o &amp;lt;- do.call(order,read.table(con, sep="."))&#xD;
&amp;gt; close(con)&#xD;
&amp;gt; a[o,]&#xD;
   color status            ip&#xD;
3   blue    yes     2.2.58.10&#xD;
2    red    yes   2.131.58.16&#xD;
4    red     no 162.131.58.17&#xD;
1 yellow     no 162.131.58.26&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
That is very, very neat!  Thank you Peter.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/uowubtD_s_4" height="1" width="1"/&gt;</content><published>2009-05-29T10:53:00Z</published><updated>2009-05-29T10:53:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html</feedburner:origLink></entry><entry><title type="text">Data.gov</title><id>urn:uuid:a914e8e4-59f0-5054-a5ce-d0aa76d47247</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data_gov.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CaDId-zLq1A/Data_gov.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am always on the lookout for useful data sources for training in statistics, so I am excited that <a href="http://www.data.gov/">Data.gov</a> has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. 
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am always on the lookout for useful data sources for training in statistics, so I am excited that &lt;a href="http://www.data.gov/"&gt;Data.gov&lt;/a&gt; has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is a great initiative which I look forward to explore when I am not in a tiny airport at 3 am (but hey: they have free wifi) and which I hope other countries will take up.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Are there other catalogues of data sets that you use?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CaDId-zLq1A" height="1" width="1"/&gt;</content><published>2009-05-22T02:23:00Z</published><updated>2009-05-22T02:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data_gov.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading large networks using the igraph library</title><id>urn:uuid:8764d0b0-00b6-5d9b-9c45-5d3373bc97a8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/UafsWYtoE_U/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
</p>
<p>
In <a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html">our previous post on SNA</a> we gave up on using the <code>statnet</code> package because it was not able to handle our data volumes.  In this entry we have better success with the <code>igraph</code> package.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html"&gt;our previous post on SNA&lt;/a&gt; we gave up on using the &lt;code&gt;statnet&lt;/code&gt; package because it was not able to handle our data volumes.  In this entry we have better success with the &lt;code&gt;igraph&lt;/code&gt; package.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The task we are considering is still how to load the network data into the R package’s internal representation.  We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Loading the data in the &lt;code&gt;igraph&lt;/code&gt; package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We are able to load the previous test data with 51 million records easily:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), &#xD;
+                  what=integer(0), skip=1, sep=','), &#xD;
+             ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; ### Vertices are numbered from zero in the igraph library&#xD;
&amp;gt; m[,1] &amp;lt;- m[,1]-1; m[,2] &amp;lt;- m[,2]-1&#xD;
&amp;gt; g &amp;lt;- graph.edgelist(m[,c(2,1)])&#xD;
&amp;gt; E(g)$start    &amp;lt;- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")&#xD;
&amp;gt; E(g)$duration &amp;lt;- m[,4]&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Time to up the ante!  We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections.  Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
### Note that R can only handle 2^31-1 elements in a vector (on any&#xD;
### platform, including 64-bit), so we need to read this file as a&#xD;
### list.&#xD;
&amp;gt; s &amp;lt;- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)&#xD;
Read 700466826 records&#xD;
&amp;gt; m &amp;lt;- as.vector(rbind(s[[2]], s[[1]]))&#xD;
&amp;gt; print(length(m))&#xD;
[1] 1400933652&#xD;
&amp;gt; length(m) &amp;lt;- 1e9&#xD;
&amp;gt; g &amp;lt;- graph(m, directed=TRUE)&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&amp;gt; summary(ns)&#xD;
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. &#xD;
   1.00   35.00   40.00   42.92   47.00  101.00 &#xD;
&amp;gt; hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", &#xD;
       sub="From cdr.1e6x1e1.1e9")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatRight"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/neighborhood_hist.png"&gt;&lt;img src="http://static.cybaea.net/images/neighborhood_hist_small.png" width="400" height="400" title="Distribution of neighborhood size" alt="[Distribution of neighborhood size plot]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes.  Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles.  A mix of these two algorithms provide a reasonably good fit to actual customer behaviour.  (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Problems&lt;/h2&gt;&#xD;
&lt;p&gt;However, it is not all perfect.  When we attempt to add the edge parameters in the obvious way it fails:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; length(s[[3]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; length(s[[4]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; E(g)$start     &amp;lt;- s[[3]]&#xD;
Error: cannot allocate vector of size 3.7 Gb&#xD;
Execution halted&#xD;
&amp;gt; E(g)$duration  &amp;lt;- s[[4]]&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
So we are just at the limit.  Probably 100 million records is OK in this environment.  But &lt;a href="http://igraph.sourceforge.net/"&gt;the core igraph library&lt;/a&gt; is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/UafsWYtoE_U" height="1" width="1"/&gt;</content><published>2009-05-06T15:33:00Z</published><updated>2009-05-06T15:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading your network data</title><id>urn:uuid:048bbc8f-cad3-5c39-8dee-7c05fb4204ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tZM7HWtF50M/SNA-with-R-Loading-your-network-data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the <a href="http://csde.washington.edu/statnet/index.shtml">statnet</a> group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
</p>
<p>
The first task which we consider in this post is to load our data into a <code>network</code> object, which is how all the <code>statnet</code> packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the &lt;a href="http://csde.washington.edu/statnet/index.shtml"&gt;statnet&lt;/a&gt; group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first task which we consider in this post is to load our data into a &lt;code&gt;network&lt;/code&gt; object, which is how all the &lt;code&gt;statnet&lt;/code&gt; packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (&lt;dfn title="Call Detail Record"&gt;CDR&lt;/dfn&gt;) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Implementation in the &lt;code&gt;network&lt;/code&gt; class&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges.  That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In the terminology of the &lt;code&gt;network&lt;/code&gt; class, that means that our network will be &lt;b&gt;directed&lt;/b&gt; (calls and money transfers have a direction &lt;em&gt;from&lt;/em&gt; one person &lt;em&gt;to&lt;/em&gt; another) and will need to allow &lt;b&gt;multiple&lt;/b&gt; edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We could consider dropping the &lt;b&gt;multiple&lt;/b&gt; attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute.  We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Mapping customer identifiers to network vertex numbers&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices &lt;em&gt;and&lt;/em&gt; that this numbering must be sequential and start from 1.  Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while.  The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; n &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Error in add.edges(g, as.list(x[, 1]), as.list(x[, 2]), edge.check = edge.check) : &#xD;
  (edge check) Illegal vertex reference in addEdges_R.  Exiting.&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
For the discussion that follows, we will assume that you have changed your identifies externally to R.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Loading the data&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
The good news is that our data is essentially in a format that the &lt;code&gt;network&lt;/code&gt; package calls &lt;b&gt;edge list&lt;/b&gt; and which it can import directly.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me.  But assume we have our data in a file &lt;code&gt;cdr.csv&lt;/code&gt; like this (we only have calls here):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="cdr.csv"&gt;       src,      dest,     start,  duration&#xD;
         5,         6,1238510028,        52&#xD;
         6,         9,1238510627,       154&#xD;
         9,         7,1238511103,        48&#xD;
         6,         5,1238511145,        49&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we can load the data into R easily:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE)&#xD;
Read 1896 items&#xD;
&amp;gt; # &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html"&gt;Swap columns&lt;/a&gt; for ass-backward network package&#xD;
&amp;gt; m[,c(1,2)] &amp;lt;- m[,c(2,1)]&#xD;
&#xD;
&amp;gt; # Create network&#xD;
&amp;gt; net &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
&#xD;
&amp;gt; summary(net)&#xD;
Network attributes:&#xD;
 vertices = 10&#xD;
 directed = TRUE&#xD;
 hyper = FALSE&#xD;
 loops = FALSE&#xD;
 multiple = TRUE&#xD;
 bipartite = FALSE&#xD;
 total edges = 474 &#xD;
   missing edges = 0 &#xD;
   non-missing edges = 474 &#xD;
 density = 5.266667 &#xD;
&#xD;
Vertex attributes:&#xD;
 vertex.names:&#xD;
   character valued attribute&#xD;
   10 valid vertex names&#xD;
&#xD;
No edge attributes&#xD;
&#xD;
Network adjacency matrix:&#xD;
Error in as.matrix.network.adjacency(x = x, attrname = attrname, ...) : &#xD;
  Multigraphs not currently supported in as.matrix.network.adjacency.  Exiting.&#xD;
In addition: Warning message:&#xD;
In network.density(x) :&#xD;
  Network is multiplex - no general way to define density.  Returning value for a non-multiplex network (hope that's what you wanted).&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
OK, that's a lot of warnings, but it basically worked.  We have figured out how to load our network data into the &lt;a href="http://www.jstatsoft.org/v24/i02/"&gt;network&lt;/a&gt; package in R.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Performance&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks.  We change our CDR simulator to emit the desitnation before the source just like &lt;code&gt;network&lt;/code&gt; likes it and let it run.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; n &amp;lt;- network(matrix(scan(file="&lt;a href="http://static.cybaea.net/files/cdr.1e5x1e0.csv.bz2"&gt;cdr.1e5x1e0.csv&lt;/a&gt;", &#xD;
                           what=integer(0), skip=1, sep=','), &#xD;
                      ncol=4, byrow=TRUE), &#xD;
               matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Read 10581152 items&#xD;
&amp;gt; proc.time()&#xD;
   user  system elapsed &#xD;
138.304   1.597 140.878 &#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=FALSE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The size of the saved network object is 373MB (only 27MB compressed).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the &lt;code&gt;network&lt;/code&gt; call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.51M.csv", &#xD;
                   what=integer(0), skip=1, sep=','),&#xD;
              ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; num_vert &amp;lt;- max(m[,1], m[,2])&#xD;
&amp;gt; num_vert&#xD;
[1] 100000&#xD;
&amp;gt; n &amp;lt;- network.initialize(n=num_vert, directed=TRUE, multiple=TRUE)&#xD;
&amp;gt; add.edges(n, tail=m[,2], head=m[,1], edge.check=FALSE)&#xD;
&amp;gt; proc.time()&#xD;
&lt;i&gt;(several hours: I’ll let you know when it is done)&lt;/i&gt;&#xD;
&amp;gt; rm(m)&#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=TRUE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Our attempted optimization did not seem to matter and this network is too big for the machine and the &lt;code&gt;network&lt;/code&gt; package.  Building the network was painful as I was working on the workstation at the same time.  The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications.  I’ll need to do some more digging around.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Does anybody have any SNA benchmarks?  I like the &lt;a href="http://www.kxen.com/"&gt;KXEN&lt;/a&gt; implementation for its simplicity and speed so I might get a copy and try it out.  Any R performance experts who could make suggestions in the comments?  How big are &lt;em&gt;your&lt;/em&gt; networks?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tZM7HWtF50M" height="1" width="1"/&gt;</content><published>2009-04-01T16:08:00Z</published><updated>2009-04-01T16:08:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html</feedburner:origLink></entry><entry><title type="text">R tips: Swapping columns in a matrix</title><id>urn:uuid:64357922-73f7-5e9a-bb13-1c34f4f98256</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/ez3q0foTWdw/R-tips-Swapping-columns-in-a-matrix.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Swapping two columns in a matrix is really easy: <code>m[ , c(1,2)]  &lt;- m[ , c(2,1)]</code>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, swapping two columns in a matrix is really easy: &lt;code&gt;m[ , c(1,2)]  &amp;lt;- m[ , c(2,1)]&lt;/code&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Note, however, that this does not swap the column names (if you have any) but only the values.  You could do something like &lt;code&gt;colnames(m)[c(1,2)] &amp;lt;- colnames(m)[c(2,1)]&lt;/code&gt; if you need the names changed as well, but better is perhaps just to assign:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;m &amp;lt;- m[ , c(2, 1, 3:ncol(m)) ]&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/ez3q0foTWdw" height="1" width="1"/&gt;</content><published>2009-03-31T15:59:00Z</published><updated>2009-03-31T15:59:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html</feedburner:origLink></entry><entry><title type="text">R tips: Eliminating the “save workspace image” prompt on exit</title><id>urn:uuid:9db2db2b-1dc0-55ff-b83d-db44ca9cb2b6</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mZpStaDMWdA/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
When using <a href="http://www.r-project.org/">R</a>, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
When using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I wish there was an option to change the default of the &lt;code&gt;q&lt;/code&gt;/&lt;code&gt;quit&lt;/code&gt; functions.  I start and stop R frequently and so the exit question which I &lt;em&gt;have&lt;/em&gt; to answer &lt;em&gt;every time&lt;/em&gt; is really annoying:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Save workspace image? [y/n/c]:&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Why is there no R option to disable this prompt?  If I want to save the image, I have already saved it.  And I don’t like the default name anyhow, preferring to give my own with &lt;code&gt;save.image(file=...)&lt;/code&gt;.  For a while, I had a function defined in my &lt;code&gt;~/.Rprofile&lt;/code&gt; that terminated the session without prompting.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;exit &amp;lt;- function() { q("no") }&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
While this means I can type &lt;code&gt;exit()&lt;/code&gt; and avoid the annoying prompt, in practice I normally type Control-D to end the session which still calls the normal &lt;code&gt;q&lt;/code&gt; function with its annoying prompt.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So instead I use the &lt;code&gt;alias&lt;/code&gt; functionality of my (bash) shell to change the default.  In my &lt;code&gt;~/.bashrc&lt;/code&gt; I now have&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.bashrc"&gt;alias R="$(/usr/bin/which R) --no-save"&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
And finally I am happy.  But I still think R should have an option (accessible through &lt;code&gt;options&lt;/code&gt;) to change the default behavior.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mZpStaDMWdA" height="1" width="1"/&gt;</content><published>2009-03-26T08:14:00Z</published><updated>2009-03-26T08:14:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html</feedburner:origLink></entry><entry><title type="text">R tips: Keep your packages up-to-date</title><id>urn:uuid:925677fd-a8f7-5fbd-ac11-4bd8ef410537</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Sm6R7iW5IEw/R-tips-Keep-your-packages-up_to_date.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
In this entry in a small series of tips for the use of the <a href="http://www.r-project.org/" title="The R Project for Statistical Computing">R</a> statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
In this entry in a small series of tips for the use of the &lt;a href="http://www.r-project.org/" title="The R Project for Statistical Computing"&gt;R&lt;/a&gt; statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
One of the great strengths of R is the many packages available.  All the new approaches, as well as some of the best implementations of your old favorites are there.  But it can also be a little daunting, and so the &lt;a href="http://www.stats.bris.ac.uk/R/web/views/"&gt;CRAN task views&lt;/a&gt; are often the best way to get started and download a reasonable “bundle” of packages for your analysis.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First we need a place to store the packages.  On Linux (and other Unix-like systems) I use the file &lt;code&gt;~/.Renviron&lt;/code&gt; to set the &lt;code&gt;R_LIBS&lt;/code&gt; variable to where I want the files:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Renviron"&gt;## R environment&#xD;
R_LIBS="~/R"&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
On Windows, I set the same variable for the user account.  Don’t forget to create the directory.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now your can start R and install the CRAN task view package:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; install.packages("ctv")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then I have a few things in my &lt;code&gt;~/.Rprofile&lt;/code&gt; startup file.  The previous command probably prompted you for a download mirror which is annoying, so let’s exit R and edit the startup file to contain:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;## Default CRAN mirror&#xD;
local({r &amp;lt;- getOption("repos"); r["CRAN"] &amp;lt;- "http://cran.uk.r-project.org"; options(repos=r)})&#xD;
## Libraries&#xD;
require("utils", quietly=TRUE)&#xD;
require("ctv", quietly=TRUE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then I define three functions.  The first is to install the views I need.  I like to try new things, so my list is long.  Edit it to suit your needs:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;install.myviews &amp;lt;- function() {&#xD;
  require("ctv", quietly=TRUE)&#xD;
  my.views = c("Bayesian", "Cluster", "Graphics", "gR", "HighPerformanceComputing", "MachineLearning", "Multivariate", "NaturalLanguageProcessing", "Robust", "SocialSciences", "Spatial", "Survival", "TimeSeries")&#xD;
  install.views(views=my.views, lib=Sys.getenv("R_LIBS"), dependencies=c("Depends","Suggests"))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Try it out!  Save the file, start R, and type &lt;code&gt;install.myviews()&lt;/code&gt; at the prompt.  If your list is as long as mine, then this may take some time and you may get some warnings and errors.  We might add a tip on these later, but the main reason for the errors is probably that you are missing the development files for external libraries (or that R just can’t find it).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now that we have finally got them, we need to make sure they are up-to-date.  I add two functions to &lt;code&gt;~/.Rprofile&lt;/code&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;update.local &amp;lt;- function() {&#xD;
  update.packages(lib.loc=Sys.getenv("R_LIBS"), ask=FALSE)&#xD;
}&#xD;
&#xD;
update.myviews &amp;lt;- function() {&#xD;
  require("ctv", quietly=TRUE)&#xD;
  my.views = c("Bayesian", "Cluster", "Graphics", "gR", "HighPerformanceComputing", "MachineLearning", "Multivariate", "NaturalLanguageProcessing", "Robust", "SocialSciences", "Spatial", "Survival", "TimeSeries")&#xD;
  update.views(views=my.views, lib.loc=Sys.getenv("R_LIBS"))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The first allows me to easily update all my locally installed libraries (not just these installed from views).  The second updates my views which is useful when the view definitions change (rarely, but it happens as the recommended packages evolve).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now I can of course update from the R command prompt using &lt;code&gt;update.local()&lt;/code&gt; or &lt;code&gt;update.myviews()&lt;/code&gt;.  But that is not the main benefit.  I can now update directly from the shell command line using commands like:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;echo "update.local()" &amp;gt; /tmp/r.cmd&#xD;
R CMD BATCH /tmp/r.cmd /tmp/r.out&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The beauty of this is that I can add it to my &lt;code&gt;crontab(5)&lt;/code&gt; and have it run automatically every night or every week as I feel I need it.  This way I always have the latest versions installed.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Sm6R7iW5IEw" height="1" width="1"/&gt;</content><published>2009-03-25T20:59:00Z</published><updated>2009-03-25T20:59:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html</feedburner:origLink></entry></feed>
