<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.cybaea.net/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"><title>CYBAEA Data and Analysis</title><rights>Copyright by the author(s). All rights reserved.</rights><logo>http://static.cybaea.net/logo/cybaea-data-200.png</logo><subtitle type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.</p><p>This is the blog for practising data analysts and theoretical statisticians.  The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.</p></div></subtitle><updated>2010-03-08T14:46:14Z</updated><id>urn:uuid:259dced6-9721-5b16-a8aa-d91dc8e40f56</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/" /><link rel="alternate" type="text/html" href="http://www.cybaea.net/Blogs/Data/" /><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><generator uri="http://www.cybaea.net/atom/feed.pl?short_name=Data" version="$Revision: 97 $">feed.pl</generator><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.cybaea.net/CybaeaData" /><feedburner:info uri="cybaeadata" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>CybaeaData</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry><title type="text">R: Eliminating observed values with zero variance</title><id>urn:uuid:5394cf3c-2009-5225-955d-1b6c90ae4445</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tQRUNQKxFng/R-Eliminating-observed-values-with-zero-variance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I needed a fast way of eliminating observed values with zero variance from large data sets using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I needed a fast way of eliminating observed values with zero variance from large data sets using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I use the &lt;a href="http://www.kddcup-orange.com/data.php"&gt;KDD Cup 2009 data sets&lt;/a&gt; as my reference for this experiment.  (You will need to register to download the data.)  It is a realistic example of the type of customer data that I usually work with.  It has 50,000 observations of 15,000 variables.  To load it into R you'll need a reasonably beefy machine.  My workstation has 16GB of memory; if yours have less then use a sample of the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We load the data into R and propose a few ways in which we may identify the columns we need:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;#!/usr/bin/Rscript&#xD;
## zero-var.R - find the fastest way of eliminating observations with zero variance&#xD;
## © 2010 Allan Engelhardt, http://www.cybaea.net&#xD;
&#xD;
## Read the data file.&#xD;
## We have already converted it to R format and saved it, so we can do&#xD;
load("train.RData")&#xD;
## instead of something like&#xD;
# train &amp;lt;- read.delim(file="../orange_large_train.data.bz2")&#xD;
&#xD;
## Some suggestions for zero variance functions:&#xD;
zv.1 &amp;lt;- function(x) {&#xD;
    ## The literal approach&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y == 0)&#xD;
}&#xD;
zv.2 &amp;lt;- function(x) {&#xD;
    ## As before, but avoiding direct comparison with zero&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y &amp;lt; .Machine$double.eps ^ 0.5)&#xD;
}&#xD;
zv.3 &amp;lt;- function(x) {&#xD;
    ## Maybe it is faster to check for equality than to compute?&#xD;
    y &amp;lt;- x[!is.na(x)]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
zv.4 &amp;lt;- function(x) {&#xD;
    ## Taking out the special case may speed things up?&#xD;
    ## (At least for this data set where this case is common.)&#xD;
    z &amp;lt;- is.na(x)&#xD;
    if ( all(z) ) return(TRUE);&#xD;
    y &amp;lt;- x[!z]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Now we just have to load the very useful &lt;a href="http://cran.r-project.org/web/packages/rbenchmark/index.html"&gt;rbenchmark&lt;/a&gt; package and let the machine figure it out:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("rbenchmark")&#xD;
&#xD;
cat("Running benchmarks:\n")&#xD;
benchmark(&#xD;
          zv1 = { sapply(train, zv.1) },&#xD;
          zv2 = { sapply(train, zv.2) },&#xD;
          zv3 = { sapply(train, zv.3) },&#xD;
          zv4 = { sapply(train, zv.4) },&#xD;
          replications = 5,&#xD;
          columns = c("test", "elapsed", "relative", "sys.self"),&#xD;
          order = "elapsed"&#xD;
          )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The answer (on my machine) is that it is faster to calculate than to check for equality:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Running benchmarks:&#xD;
  test elapsed relative sys.self&#xD;
1  zv1  78.619 1.000000    6.395&#xD;
2  zv2  79.276 1.008357    6.586&#xD;
3  zv3 113.024 1.437617    1.735&#xD;
4  zv4 118.579 1.508274    1.716&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Can you think of an even faster way to do it?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tQRUNQKxFng" height="1" width="1"/&gt;</content><published>2010-03-08T14:46:00Z</published><updated>2010-03-08T14:46:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html</feedburner:origLink></entry><entry><title type="text">Beautiful Data</title><id>urn:uuid:770cda82-5757-5a26-827a-2aeff8a8a098</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/nsxeLGyxKLM/Beautiful-Data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]" />
  </a>
</div>
<p>
O'Reilly's recent publication <a href="http://oreilly.com/catalog/9780596157111/">Beautiful Data</a> has a chapter by <a href="http://jeffjonas.typepad.com/jeff_jonas/">Jeff Jonas</a> which is enough reason in itself for me to recommend it.  The chapter, <a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf">Data Finds Data</a>, is also available as a PDF download.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O'Reilly's recent publication &lt;a href="http://oreilly.com/catalog/9780596157111/"&gt;Beautiful Data&lt;/a&gt; has a chapter by &lt;a href="http://jeffjonas.typepad.com/jeff_jonas/"&gt;Jeff Jonas&lt;/a&gt; which is enough reason in itself for me to recommend it.  The chapter, &lt;a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf"&gt;Data Finds Data&lt;/a&gt;, is also available as a PDF download.&#xD;
&lt;/p&gt;&#xD;
&lt;div class="floatRight"&gt;&#xD;
  &lt;a href="http://oreilly.com/catalog/9780596157111/" title="Click for book details"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I met Jeff a couple of year ago at an ETech conference, and he is easily one of the smartest people I have ever met who is thinking about data.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/nsxeLGyxKLM" height="1" width="1"/&gt;</content><published>2009-07-27T19:38:00Z</published><updated>2009-07-27T19:38:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Beautiful-Data.html</feedburner:origLink></entry><entry><title type="text">Massively parallel database for analytics</title><id>urn:uuid:a8ba9e43-b837-551c-bd02-a1a7b4506c41</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Apso1Get0Yk/Massively-parallel-database-for-analytics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html">Announcing release of HadoopDB</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;Announcing release of HadoopDB&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;See also:&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;Short version: key bullet points&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;Long version (12 pages, PDF)&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://tech.slashdot.org/story/09/07/21/1747241/Researchers-Create-Database-Hadoop-Hybrid?from=rss"&gt;Slashdot discussion&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/HadoopStreaming/index.html"&gt;R package HadoopStreaming&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;/ul&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Apso1Get0Yk" height="1" width="1"/&gt;</content><published>2009-07-22T13:37:00Z</published><updated>2009-07-22T13:37:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html</feedburner:origLink></entry><entry><title type="text">The Knapsack Problem</title><id>urn:uuid:6efce6d8-6489-5275-aa88-1ddce86d4e65</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/JCEN5oEfIRM/The-Knapsack-Problem.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html">David posts a question</a> about how to solve <a href="http://xkcd.com/287/">this</a> <a href="http://en.wikipedia.org/wiki/Knapsack_problem">knapsack problem </a> using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  My reply in the comments seems to have disappeared for a while so here is my proposed solution:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatCenter" style="width: 640px;"&gt;&#xD;
  &lt;a href="http://xkcd.com/287/"&gt;&#xD;
    &lt;img src="http://imgs.xkcd.com/comics/np_complete.png" width="640" height="414" alt="[Cartoon from XKCD]" title="NP-Complete"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html"&gt;David posts a question&lt;/a&gt; about how to solve this &lt;a href="http://en.wikipedia.org/wiki/Knapsack_problem"&gt;knapsack problem &lt;/a&gt; using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  My reply in the comments seems to have disappeared for a while so here is my proposed solution.  See David’s blog for my earlier proposed solution with a very common error.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html&#xD;
appetizer.solution &amp;lt;- local (&#xD;
function (target) {&#xD;
  app &amp;lt;- c(2.15, 2.75, 3.35, 3.55, 4.20, 5.80)&#xD;
  r &amp;lt;- 2L&#xD;
  repeat {&#xD;
	c &amp;lt;- gtools::combinations(length(app), r=r, v=app, repeats.allowed=TRUE)&#xD;
	s &amp;lt;- rowSums(c)&#xD;
	if ( all(s &amp;gt; target) ) {&#xD;
	  print("No solution found")&#xD;
	  break&#xD;
	}&#xD;
	x &amp;lt;- which( abs(s-target) &amp;lt; 1e-4 )&#xD;
	if ( length(x) &amp;gt; 0L ) {&#xD;
	  print("Solution found")&#xD;
	  print(c[x,])&#xD;
	  break&#xD;
	}&#xD;
	r &amp;lt;- r + 1L&#xD;
  }&#xD;
})&#xD;
&#xD;
appetizer.solution(15.05)&#xD;
# [1] "Solution found"&#xD;
# [1] 3.55 5.80 5.80&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Brute force works, it just doesn’t scale well.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/JCEN5oEfIRM" height="1" width="1"/&gt;</content><published>2009-07-10T20:30:00Z</published><updated>2009-07-10T20:30:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html</feedburner:origLink></entry><entry><title type="text">OECD Statistics</title><id>urn:uuid:43e585f9-9c60-505d-b349-b65d1a20c969</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/OECD-Statistics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/5fn__mpTK8o/OECD-Statistics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am a sucker for good quality data.  I <a href="http://www.cybaea.net/Blogs/Data/Data_gov.html">wrote about data.gov</a>, the US Government data site before, and now I find <a href="http://stats.oecd.org/">OECD Statistics</a> which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am a sucker for good quality data.  I &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html"&gt;wrote about data.gov&lt;/a&gt;, the US Government data site before, and now I find &lt;a href="http://stats.oecd.org/"&gt;OECD Statistics&lt;/a&gt; which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Exports in multiple formats, including Excel, CSV, and &lt;a href="http://sdmx.org/"&gt;SDMX&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/5fn__mpTK8o" height="1" width="1"/&gt;</content><published>2009-07-02T20:33:00Z</published><updated>2009-07-02T20:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/OECD-Statistics.html</feedburner:origLink></entry><entry><title type="text">R tips: Determine if function is called from specific package</title><id>urn:uuid:08988ce4-0f96-564e-9575-7c6f2ff16147</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mqFcMzo8FLQ/R-tips-Determine-if-function-is-called-from-specific-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I like the "multicore" library for a particular task.  I can easily write a combination of<code> if(require("multicore",...))</code> that means that my function will automatically use the parallel <code>mclapply()</code> instead of <code>lapply()</code> where it is available.  Which is grand 99% of the time, except when my function is called from <code>mclapply()</code> (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
</p>
<p>
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I like the "multicore" library for a particular task.  I can easily write a combination of&lt;code&gt; if(require("multicore",...))&lt;/code&gt; that means that my function will automatically use the parallel &lt;code&gt;mclapply()&lt;/code&gt; instead of &lt;code&gt;lapply()&lt;/code&gt; where it is available.  Which is grand 99% of the time, except when my function is called from &lt;code&gt;mclapply()&lt;/code&gt; (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First define a generally useful function:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code" title="is.in.namespace()"&gt;&#xD;
is.in.namespace &amp;lt;-&#xD;
function (ns) {&#xD;
  for ( frame in seq(1, sys.nframe(), 1) ) {&#xD;
	fun &amp;lt;- sys.function(frame);&#xD;
	env &amp;lt;- environment(fun)&#xD;
	n   &amp;lt;- environmentName(env)&#xD;
	if ( n == ns ) return(TRUE);&#xD;
  }&#xD;
  return(FALSE);&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we use it for our purpose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
is.in.multicore &amp;lt;- function (...) { return(is.in.namespace("multicore")) }&#xD;
library("multicore")&#xD;
stopifnot( mclapply(as.list(1), is.in.multicore)[[1]] == TRUE )&#xD;
stopifnot(   lapply(as.list(1), is.in.multicore)[[1]] == FALSE )&#xD;
stopifnot( local( {mclapply &amp;lt;- function(x) return(x); mclapply(is.in.multicore())} ) == FALSE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Easy when you know how.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mqFcMzo8FLQ" height="1" width="1"/&gt;</content><published>2009-06-16T10:27:00Z</published><updated>2009-06-16T10:27:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html</feedburner:origLink></entry><entry><title type="text">R tips: Installing Rmpi on Fedora Linux</title><id>urn:uuid:57259815-f049-5226-bda6-95b15ae0f4f2</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/z6AZUNX1s3Y/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Somebody on the R-help mailing list asked how to get <a href="http://cran.r-project.org/web/packages/Rmpi/index.html">Rmpi</a> working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  Since it is unusually painful to get working, I might as well copy the instructions here.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Somebody on the R-help mailing list asked how to get &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  Since it is unusually painful to get working, I might as well copy the instructions here.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The problem is the configuration file &lt;code&gt;configure.ac&lt;/code&gt; which is, unfortunately, completely brain-damaged with hard-coded assumptions about which subdirectories should contain header and library files and no way of overriding it.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First install the &lt;a href="http://www.open-mpi.org/"&gt;openmpi&lt;/a&gt; libraries using:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;yum install openmpi openmpi-devel openmpi-libs&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then download the latest &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; package from CRAN and unpack it using &lt;code&gt;tar zxvf Rmpi_0.5-7.tar.gz&lt;/code&gt;.  Go to the new &lt;code&gt;Rmpi&lt;/code&gt; directory and replace the file &lt;code&gt;configure.ac&lt;/code&gt; with the one below (for a x86_64 system; for 32 bit you probably need to change &lt;code&gt;-64&lt;/code&gt; to &lt;code&gt;-32&lt;/code&gt;):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="configure.ac"&gt; Process this file with autoconf to produce a configure script.&#xD;
&#xD;
AC_INIT(DESCRIPTION)&#xD;
&#xD;
AC_PROG_CC&#xD;
&#xD;
MPI_LIBS=`pkg-config --libs openmpi-1.3.1-gcc-64`&#xD;
MPI_INCLUDE=`pkg-config --cflags openmpi-1.3.1-gcc-64`&#xD;
MPITYPE="OPENMPI"&#xD;
MPI_DEPS="-DMPI2"&#xD;
&#xD;
AC_CHECK_LIB(util, openpty, [ MPI_LIBS="$MPI_LIBS -lutil" ])&#xD;
AC_CHECK_LIB(pthread, main, [ MPI_LIBS="$MPI_LIBS -lpthread" ])&#xD;
&#xD;
PKG_LIBS="${MPI_LIBS} -fPIC"&#xD;
PKG_CPPFLAGS="${MPI_INCLUDE} ${MPI_DEPS} -D${MPITYPE} -fPIC"&#xD;
&#xD;
AC_SUBST(PKG_LIBS)&#xD;
AC_SUBST(PKG_CPPFLAGS)&#xD;
AC_SUBST(DEFS)&#xD;
&#xD;
AC_OUTPUT(src/Makevars) &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The number 1.3.1 may change in future releases of Fedora: see &lt;code&gt;/usr/lib64/pkgconfig/openmpi-*.pc&lt;/code&gt; for the current value.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Still in the &lt;code&gt;Rmpi&lt;/code&gt; directory do the following in your shell:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;autoconf&#xD;
cd ..&#xD;
tar zcvf Rmpi_0.5-7-F11.tar.gz Rmpi&#xD;
R CMD INSTALL Rmpi_0.5-7-F11.tar.gz &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Now it should be working in R:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("Rmpi")&#xD;
&amp;gt; mpi.spawn.Rslaves(nslaves=2)&#xD;
    2 slaves are spawned successfully. 0 failed.&#xD;
master (rank 0, comm 1) of size 3 is running on: server&#xD;
slave1 (rank 1, comm 1) of size 3 is running on: server&#xD;
slave2 (rank 2, comm 1) of size 3 is running on: server&#xD;
&amp;gt; x &amp;lt;- c(10,20)&#xD;
&amp;gt; mpi.apply(x,runif)&#xD;
[[1]]&#xD;
 [1] 0.25142616 0.93505554 0.03162852 0.71783194 0.35916139 0.85082154&#xD;
 [7] 0.35404191 0.14221315 0.60063773 0.71805190&#xD;
&#xD;
[[2]]&#xD;
 [1] 0.84157864 0.63481773 0.38217188 0.67839089 0.27827728 0.35429266&#xD;
 [7] 0.04898744 0.96601584 0.25687905 0.77381186 0.69011927 0.37391028&#xD;
[13] 0.19017369 0.51196594 0.51970563 0.15791524 0.21358237 0.69642478&#xD;
[19] 0.12690207 0.44177656&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Painful.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/z6AZUNX1s3Y" height="1" width="1"/&gt;</content><published>2009-06-12T10:23:00Z</published><updated>2009-06-12T10:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html</feedburner:origLink></entry><entry><title type="text">Data Mashups in R from O'Reilly</title><id>urn:uuid:edb63dc9-21f9-5664-8b35-afb01d7d6472</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/QV_4TAhfmFU/Data-Mashups-in-R-from-O_Reilly.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" title="Click for full article"><img src="http://static.cybaea.net/images/fc_heat_small.png" width="150" height="150" alt="[Philadelphia County July 2009 Foreclosure Heat Map]" /></a>
</div>
<p>
O’Reilly has published <a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R ">Data Mashups in R</a> as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here.  This is all done with the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O’Reilly has published &lt;a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R "&gt;Data Mashups in R&lt;/a&gt; as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below.  This is all done with the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/fc_heat.png" title="Larger version of Philadelphia County July 2009 Foreclosure Heat Map"&gt;&lt;img src="http://static.cybaea.net/images/fc_heat_medium.png" width="400" height="400" alt="[Philadelphia County July 2009 Foreclosure Heat Map]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They show how to:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Use regular expressions to parse HTML files&lt;/li&gt;&#xD;
&lt;li&gt;Use the &lt;a href="http://cran.r-project.org/web/packages/XML/index.html"&gt;XML&lt;/a&gt; package to parse XML data from a web service (&lt;a href="http://developer.yahoo.com/maps/rest/V1/geocode.html"&gt;Yahoo! Geocode&lt;/a&gt;)&lt;/li&gt;&#xD;
&lt;li&gt;Find ERSI shape files for your maps&lt;/li&gt;&#xD;
&lt;li&gt;Use &lt;a href="http://cran.r-project.org/web/packages/PBSmapping/index.html"&gt;PBSmapping&lt;/a&gt; to process and display geographical data (GIS)&lt;/li&gt;&#xD;
&lt;li&gt;Importing and using US Census data with your maps&lt;/li&gt;&#xD;
&lt;/ul&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/QV_4TAhfmFU" height="1" width="1"/&gt;</content><published>2009-06-09T11:23:00Z</published><updated>2009-06-09T11:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html</feedburner:origLink></entry><entry><title type="text">How to win the KDD Cup Challenge with R and gbm</title><id>urn:uuid:3fb3545e-ea30-5e3b-8f8b-8902f107b81d</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/iWBVzSGe3Aw/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about <a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html">recently</a>) kindly provides more information about how to win this public challenge using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> on a laptop (!).
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;recently&lt;/a&gt;) kindly provides more information about how to win this public challenge using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; on a laptop (!).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
As a reminder of &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;what we wrote before&lt;/a&gt;, the challenge provided two anonymized data set each of 50,000 mobile teleco customers and each entry having 15,000 variables.  The task was to find the best churn, up-, and cross-sell models.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Hugh summarizes his team’s approach:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
Feature selection was an important first step [we &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;mentioned before&lt;/a&gt; that this is key for all successful data mining projects – AE]. We looked at how effective each individual variable was as a predictor, which also allowed us to reading parts of the data only, &lt;em&gt;as the whole dataset didn’t fit in memory&lt;/em&gt; [my emphasis – AE]. The assessment here was homebrew, making a simple predictor on half the data and measuring performance (by the AUC measure) on the other half:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor&lt;/li&gt;&#xD;
&lt;li&gt;For continuous variables we split the variable up into "bins", as you would a histogram, and again took the average number of 1's in the response for each bin as the predictor.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
From this we came up with a set of about 200 variables for each model, which we continued to tinker with. The main model was a gradient boosted machine which used the "&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/gbm/index.html"&gt;gbm&lt;/a&gt;" package in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimising the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit. The package itself is quite powerful as it gives some useful diagnostics such as relative variable importance, allowing us to exclude some and include others.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;&#xD;
They did this on standard Windows laptops (Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive) against a competition that had more computing clusters available than Imelda Marcos had shoes.  It is not what you’ve got, it’s how you use it &lt;tt&gt;:-)&lt;/tt&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Congratulations to Hugh and his team!&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/iWBVzSGe3Aw" height="1" width="1"/&gt;</content><published>2009-06-01T07:07:00Z</published><updated>2009-06-01T07:07:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html</feedburner:origLink></entry><entry><title type="text">R used by KDD 2009 cup winner of slow challenge</title><id>urn:uuid:23be031b-ddb6-5244-ab24-77042c61951c</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/OqKxuXq79pQ/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
The results from the <a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html">KDD Cup 2009 challenge</a> (which we wrote about before) are in, and the winner of the slow challenge used the <a href="http://www.r-project.org">R statistical computing and analysis platform</a> for their winning submission.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
The results from the &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;KDD Cup 2009 challenge&lt;/a&gt; (which we wrote about before) are in, and the winner of the slow challenge used the &lt;a href="http://www.r-project.org"&gt;R statistical computing and analysis platform&lt;/a&gt; for their winning submission.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;a href="http://www.kddcup-orange.com/factsheet.php?id=21"&gt;write up&lt;/a&gt; (username/password may be required) from &lt;a href="http://www.ms.unimelb.edu.au/Personnel/profile.php?PC_id=590"&gt;Hugh Miller&lt;/a&gt; and team at the University of Melbourne includes these points:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function&lt;/li&gt;&#xD;
&lt;li&gt;Models fit in an hour or so&lt;/li&gt;&#xD;
&lt;li&gt;Used the &lt;a href="http://www.r-project.org"&gt;R statistical package&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
Impressive hardware selection!  Well done R.  Weka was another popular tool among the top entrants.  Key for all of them were clever data preparation and variable substitution.  The fast track winners from IBM document this in some detail:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We tried PCA on the large data set, but it did not seem to help.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/OqKxuXq79pQ" height="1" width="1"/&gt;</content><published>2009-05-31T13:17:00Z</published><updated>2009-05-31T13:17:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html</feedburner:origLink></entry><entry><title type="text">R tips: Use read.table instead of strsplit to split a text column into multiple columns</title><id>urn:uuid:60775fac-6d0b-5d55-9e76-eb21bdde97c1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/uowubtD_s_4/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving <code>strsplit</code>.  But <a href="http://staff.pubhealth.ku.dk/~pd/">Peter Dalgaard</a> comes up with a much nicer method using <code>read.table</code> on a <code>textConnection</code> object:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving &lt;code&gt;strsplit&lt;/code&gt;.  But &lt;a href="http://staff.pubhealth.ku.dk/~pd/"&gt;Peter Dalgaard&lt;/a&gt; comes up with a much nicer method using &lt;code&gt;read.table&lt;/code&gt; on a &lt;code&gt;textConnection&lt;/code&gt; object:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; a &amp;lt;- data.frame(cbind(color=c("yellow","red","blue","red"),&#xD;
                        status=c("no","yes","yes","no"),&#xD;
                        ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))&#xD;
&amp;gt; con &amp;lt;- textConnection(as.character(a$ip))&#xD;
&amp;gt; o &amp;lt;- do.call(order,read.table(con, sep="."))&#xD;
&amp;gt; close(con)&#xD;
&amp;gt; a[o,]&#xD;
   color status            ip&#xD;
3   blue    yes     2.2.58.10&#xD;
2    red    yes   2.131.58.16&#xD;
4    red     no 162.131.58.17&#xD;
1 yellow     no 162.131.58.26&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
That is very, very neat!  Thank you Peter.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/uowubtD_s_4" height="1" width="1"/&gt;</content><published>2009-05-29T10:53:00Z</published><updated>2009-05-29T10:53:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html</feedburner:origLink></entry><entry><title type="text">Data.gov</title><id>urn:uuid:a914e8e4-59f0-5054-a5ce-d0aa76d47247</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data_gov.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CaDId-zLq1A/Data_gov.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am always on the lookout for useful data sources for training in statistics, so I am excited that <a href="http://www.data.gov/">Data.gov</a> has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. 
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am always on the lookout for useful data sources for training in statistics, so I am excited that &lt;a href="http://www.data.gov/"&gt;Data.gov&lt;/a&gt; has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is a great initiative which I look forward to explore when I am not in a tiny airport at 3 am (but hey: they have free wifi) and which I hope other countries will take up.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Are there other catalogues of data sets that you use?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CaDId-zLq1A" height="1" width="1"/&gt;</content><published>2009-05-22T02:23:00Z</published><updated>2009-05-22T02:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data_gov.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading large networks using the igraph library</title><id>urn:uuid:8764d0b0-00b6-5d9b-9c45-5d3373bc97a8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/UafsWYtoE_U/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
</p>
<p>
In <a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html">our previous post on SNA</a> we gave up on using the <code>statnet</code> package because it was not able to handle our data volumes.  In this entry we have better success with the <code>igraph</code> package.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html"&gt;our previous post on SNA&lt;/a&gt; we gave up on using the &lt;code&gt;statnet&lt;/code&gt; package because it was not able to handle our data volumes.  In this entry we have better success with the &lt;code&gt;igraph&lt;/code&gt; package.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The task we are considering is still how to load the network data into the R package’s internal representation.  We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Loading the data in the &lt;code&gt;igraph&lt;/code&gt; package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We are able to load the previous test data with 51 million records easily:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), &#xD;
+                  what=integer(0), skip=1, sep=','), &#xD;
+             ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; ### Vertices are numbered from zero in the igraph library&#xD;
&amp;gt; m[,1] &amp;lt;- m[,1]-1; m[,2] &amp;lt;- m[,2]-1&#xD;
&amp;gt; g &amp;lt;- graph.edgelist(m[,c(2,1)])&#xD;
&amp;gt; E(g)$start    &amp;lt;- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")&#xD;
&amp;gt; E(g)$duration &amp;lt;- m[,4]&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Time to up the ante!  We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections.  Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
### Note that R can only handle 2^31-1 elements in a vector (on any&#xD;
### platform, including 64-bit), so we need to read this file as a&#xD;
### list.&#xD;
&amp;gt; s &amp;lt;- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)&#xD;
Read 700466826 records&#xD;
&amp;gt; m &amp;lt;- as.vector(rbind(s[[2]], s[[1]]))&#xD;
&amp;gt; print(length(m))&#xD;
[1] 1400933652&#xD;
&amp;gt; length(m) &amp;lt;- 1e9&#xD;
&amp;gt; g &amp;lt;- graph(m, directed=TRUE)&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&amp;gt; summary(ns)&#xD;
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. &#xD;
   1.00   35.00   40.00   42.92   47.00  101.00 &#xD;
&amp;gt; hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", &#xD;
       sub="From cdr.1e6x1e1.1e9")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatRight"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/neighborhood_hist.png"&gt;&lt;img src="http://static.cybaea.net/images/neighborhood_hist_small.png" width="400" height="400" title="Distribution of neighborhood size" alt="[Distribution of neighborhood size plot]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes.  Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles.  A mix of these two algorithms provide a reasonably good fit to actual customer behaviour.  (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Problems&lt;/h2&gt;&#xD;
&lt;p&gt;However, it is not all perfect.  When we attempt to add the edge parameters in the obvious way it fails:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; length(s[[3]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; length(s[[4]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; E(g)$start     &amp;lt;- s[[3]]&#xD;
Error: cannot allocate vector of size 3.7 Gb&#xD;
Execution halted&#xD;
&amp;gt; E(g)$duration  &amp;lt;- s[[4]]&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
So we are just at the limit.  Probably 100 million records is OK in this environment.  But &lt;a href="http://igraph.sourceforge.net/"&gt;the core igraph library&lt;/a&gt; is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/UafsWYtoE_U" height="1" width="1"/&gt;</content><published>2009-05-06T15:33:00Z</published><updated>2009-05-06T15:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading your network data</title><id>urn:uuid:048bbc8f-cad3-5c39-8dee-7c05fb4204ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tZM7HWtF50M/SNA-with-R-Loading-your-network-data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the <a href="http://csde.washington.edu/statnet/index.shtml">statnet</a> group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
</p>
<p>
The first task which we consider in this post is to load our data into a <code>network</code> object, which is how all the <code>statnet</code> packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the &lt;a href="http://csde.washington.edu/statnet/index.shtml"&gt;statnet&lt;/a&gt; group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first task which we consider in this post is to load our data into a &lt;code&gt;network&lt;/code&gt; object, which is how all the &lt;code&gt;statnet&lt;/code&gt; packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (&lt;dfn title="Call Detail Record"&gt;CDR&lt;/dfn&gt;) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Implementation in the &lt;code&gt;network&lt;/code&gt; class&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges.  That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In the terminology of the &lt;code&gt;network&lt;/code&gt; class, that means that our network will be &lt;b&gt;directed&lt;/b&gt; (calls and money transfers have a direction &lt;em&gt;from&lt;/em&gt; one person &lt;em&gt;to&lt;/em&gt; another) and will need to allow &lt;b&gt;multiple&lt;/b&gt; edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We could consider dropping the &lt;b&gt;multiple&lt;/b&gt; attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute.  We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Mapping customer identifiers to network vertex numbers&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices &lt;em&gt;and&lt;/em&gt; that this numbering must be sequential and start from 1.  Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while.  The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; n &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Error in add.edges(g, as.list(x[, 1]), as.list(x[, 2]), edge.check = edge.check) : &#xD;
  (edge check) Illegal vertex reference in addEdges_R.  Exiting.&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
For the discussion that follows, we will assume that you have changed your identifies externally to R.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Loading the data&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
The good news is that our data is essentially in a format that the &lt;code&gt;network&lt;/code&gt; package calls &lt;b&gt;edge list&lt;/b&gt; and which it can import directly.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me.  But assume we have our data in a file &lt;code&gt;cdr.csv&lt;/code&gt; like this (we only have calls here):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="cdr.csv"&gt;       src,      dest,     start,  duration&#xD;
         5,         6,1238510028,        52&#xD;
         6,         9,1238510627,       154&#xD;
         9,         7,1238511103,        48&#xD;
         6,         5,1238511145,        49&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we can load the data into R easily:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE)&#xD;
Read 1896 items&#xD;
&amp;gt; # &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html"&gt;Swap columns&lt;/a&gt; for ass-backward network package&#xD;
&amp;gt; m[,c(1,2)] &amp;lt;- m[,c(2,1)]&#xD;
&#xD;
&amp;gt; # Create network&#xD;
&amp;gt; net &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
&#xD;
&amp;gt; summary(net)&#xD;
Network attributes:&#xD;
 vertices = 10&#xD;
 directed = TRUE&#xD;
 hyper = FALSE&#xD;
 loops = FALSE&#xD;
 multiple = TRUE&#xD;
 bipartite = FALSE&#xD;
 total edges = 474 &#xD;
   missing edges = 0 &#xD;
   non-missing edges = 474 &#xD;
 density = 5.266667 &#xD;
&#xD;
Vertex attributes:&#xD;
 vertex.names:&#xD;
   character valued attribute&#xD;
   10 valid vertex names&#xD;
&#xD;
No edge attributes&#xD;
&#xD;
Network adjacency matrix:&#xD;
Error in as.matrix.network.adjacency(x = x, attrname = attrname, ...) : &#xD;
  Multigraphs not currently supported in as.matrix.network.adjacency.  Exiting.&#xD;
In addition: Warning message:&#xD;
In network.density(x) :&#xD;
  Network is multiplex - no general way to define density.  Returning value for a non-multiplex network (hope that's what you wanted).&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
OK, that's a lot of warnings, but it basically worked.  We have figured out how to load our network data into the &lt;a href="http://www.jstatsoft.org/v24/i02/"&gt;network&lt;/a&gt; package in R.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Performance&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks.  We change our CDR simulator to emit the desitnation before the source just like &lt;code&gt;network&lt;/code&gt; likes it and let it run.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; n &amp;lt;- network(matrix(scan(file="&lt;a href="http://static.cybaea.net/files/cdr.1e5x1e0.csv.bz2"&gt;cdr.1e5x1e0.csv&lt;/a&gt;", &#xD;
                           what=integer(0), skip=1, sep=','), &#xD;
                      ncol=4, byrow=TRUE), &#xD;
               matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Read 10581152 items&#xD;
&amp;gt; proc.time()&#xD;
   user  system elapsed &#xD;
138.304   1.597 140.878 &#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=FALSE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The size of the saved network object is 373MB (only 27MB compressed).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the &lt;code&gt;network&lt;/code&gt; call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.51M.csv", &#xD;
                   what=integer(0), skip=1, sep=','),&#xD;
              ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; num_vert &amp;lt;- max(m[,1], m[,2])&#xD;
&amp;gt; num_vert&#xD;
[1] 100000&#xD;
&amp;gt; n &amp;lt;- network.initialize(n=num_vert, directed=TRUE, multiple=TRUE)&#xD;
&amp;gt; add.edges(n, tail=m[,2], head=m[,1], edge.check=FALSE)&#xD;
&amp;gt; proc.time()&#xD;
&lt;i&gt;(several hours: I’ll let you know when it is done)&lt;/i&gt;&#xD;
&amp;gt; rm(m)&#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=TRUE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Our attempted optimization did not seem to matter and this network is too big for the machine and the &lt;code&gt;network&lt;/code&gt; package.  Building the network was painful as I was working on the workstation at the same time.  The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications.  I’ll need to do some more digging around.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Does anybody have any SNA benchmarks?  I like the &lt;a href="http://www.kxen.com/"&gt;KXEN&lt;/a&gt; implementation for its simplicity and speed so I might get a copy and try it out.  Any R performance experts who could make suggestions in the comments?  How big are &lt;em&gt;your&lt;/em&gt; networks?&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tZM7HWtF50M" height="1" width="1"/&gt;</content><published>2009-04-01T16:08:00Z</published><updated>2009-04-01T16:08:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html</feedburner:origLink></entry><entry><title type="text">R tips: Swapping columns in a matrix</title><id>urn:uuid:64357922-73f7-5e9a-bb13-1c34f4f98256</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/ez3q0foTWdw/R-tips-Swapping-columns-in-a-matrix.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Swapping two columns in a matrix is really easy: <code>m[ , c(1,2)]  &lt;- m[ , c(2,1)]</code>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, swapping two columns in a matrix is really easy: &lt;code&gt;m[ , c(1,2)]  &amp;lt;- m[ , c(2,1)]&lt;/code&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Note, however, that this does not swap the column names (if you have any) but only the values.  You could do something like &lt;code&gt;colnames(m)[c(1,2)] &amp;lt;- colnames(m)[c(2,1)]&lt;/code&gt; if you need the names changed as well, but better is perhaps just to assign:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;m &amp;lt;- m[ , c(2, 1, 3:ncol(m)) ]&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/ez3q0foTWdw" height="1" width="1"/&gt;</content><published>2009-03-31T15:59:00Z</published><updated>2009-03-31T15:59:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html</feedburner:origLink></entry><entry><title type="text">R tips: Eliminating the “save workspace image” prompt on exit</title><id>urn:uuid:9db2db2b-1dc0-55ff-b83d-db44ca9cb2b6</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mZpStaDMWdA/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
When using <a href="http://www.r-project.org/">R</a>, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
When using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I wish there was an option to change the default of the &lt;code&gt;q&lt;/code&gt;/&lt;code&gt;quit&lt;/code&gt; functions.  I start and stop R frequently and so the exit question which I &lt;em&gt;have&lt;/em&gt; to answer &lt;em&gt;every time&lt;/em&gt; is really annoying:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Save workspace image? [y/n/c]:&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Why is there no R option to disable this prompt?  If I want to save the image, I have already saved it.  And I don’t like the default name anyhow, preferring to give my own with &lt;code&gt;save.image(file=...)&lt;/code&gt;.  For a while, I had a function defined in my &lt;code&gt;~/.Rprofile&lt;/code&gt; that terminated the session without prompting.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;exit &amp;lt;- function() { q("no") }&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
While this means I can type &lt;code&gt;exit()&lt;/code&gt; and avoid the annoying prompt, in practice I normally type Control-D to end the session which still calls the normal &lt;code&gt;q&lt;/code&gt; function with its annoying prompt.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So instead I use the &lt;code&gt;alias&lt;/code&gt; functionality of my (bash) shell to change the default.  In my &lt;code&gt;~/.bashrc&lt;/code&gt; I now have&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.bashrc"&gt;alias R="$(/usr/bin/which R) --no-save"&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
And finally I am happy.  But I still think R should have an option (accessible through &lt;code&gt;options&lt;/code&gt;) to change the default behavior.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mZpStaDMWdA" height="1" width="1"/&gt;</content><published>2009-03-26T08:14:00Z</published><updated>2009-03-26T08:14:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html</feedburner:origLink></entry><entry><title type="text">R tips: Keep your packages up-to-date</title><id>urn:uuid:925677fd-a8f7-5fbd-ac11-4bd8ef410537</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Sm6R7iW5IEw/R-tips-Keep-your-packages-up_to_date.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
In this entry in a small series of tips for the use of the <a href="http://www.r-project.org/" title="The R Project for Statistical Computing">R</a> statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
In this entry in a small series of tips for the use of the &lt;a href="http://www.r-project.org/" title="The R Project for Statistical Computing"&gt;R&lt;/a&gt; statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
One of the great strengths of R is the many packages available.  All the new approaches, as well as some of the best implementations of your old favorites are there.  But it can also be a little daunting, and so the &lt;a href="http://www.stats.bris.ac.uk/R/web/views/"&gt;CRAN task views&lt;/a&gt; are often the best way to get started and download a reasonable “bundle” of packages for your analysis.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First we need a place to store the packages.  On Linux (and other Unix-like systems) I use the file &lt;code&gt;~/.Renviron&lt;/code&gt; to set the &lt;code&gt;R_LIBS&lt;/code&gt; variable to where I want the files:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Renviron"&gt;## R environment&#xD;
R_LIBS="~/R"&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
On Windows, I set the same variable for the user account.  Don’t forget to create the directory.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now your can start R and install the CRAN task view package:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; install.packages("ctv")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then I have a few things in my &lt;code&gt;~/.Rprofile&lt;/code&gt; startup file.  The previous command probably prompted you for a download mirror which is annoying, so let’s exit R and edit the startup file to contain:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;## Default CRAN mirror&#xD;
local({r &amp;lt;- getOption("repos"); r["CRAN"] &amp;lt;- "http://cran.uk.r-project.org"; options(repos=r)})&#xD;
## Libraries&#xD;
require("utils", quietly=TRUE)&#xD;
require("ctv", quietly=TRUE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then I define three functions.  The first is to install the views I need.  I like to try new things, so my list is long.  Edit it to suit your needs:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;install.myviews &amp;lt;- function() {&#xD;
  require("ctv", quietly=TRUE)&#xD;
  my.views = c("Bayesian", "Cluster", "Graphics", "gR", "HighPerformanceComputing", "MachineLearning", "Multivariate", "NaturalLanguageProcessing", "Robust", "SocialSciences", "Spatial", "Survival", "TimeSeries")&#xD;
  install.views(views=my.views, lib=Sys.getenv("R_LIBS"), dependencies=c("Depends","Suggests"))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Try it out!  Save the file, start R, and type &lt;code&gt;install.myviews()&lt;/code&gt; at the prompt.  If your list is as long as mine, then this may take some time and you may get some warnings and errors.  We might add a tip on these later, but the main reason for the errors is probably that you are missing the development files for external libraries (or that R just can’t find it).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now that we have finally got them, we need to make sure they are up-to-date.  I add two functions to &lt;code&gt;~/.Rprofile&lt;/code&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;update.local &amp;lt;- function() {&#xD;
  update.packages(lib.loc=Sys.getenv("R_LIBS"), ask=FALSE)&#xD;
}&#xD;
&#xD;
update.myviews &amp;lt;- function() {&#xD;
  require("ctv", quietly=TRUE)&#xD;
  my.views = c("Bayesian", "Cluster", "Graphics", "gR", "HighPerformanceComputing", "MachineLearning", "Multivariate", "NaturalLanguageProcessing", "Robust", "SocialSciences", "Spatial", "Survival", "TimeSeries")&#xD;
  update.views(views=my.views, lib.loc=Sys.getenv("R_LIBS"))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The first allows me to easily update all my locally installed libraries (not just these installed from views).  The second updates my views which is useful when the view definitions change (rarely, but it happens as the recommended packages evolve).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Now I can of course update from the R command prompt using &lt;code&gt;update.local()&lt;/code&gt; or &lt;code&gt;update.myviews()&lt;/code&gt;.  But that is not the main benefit.  I can now update directly from the shell command line using commands like:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;echo "update.local()" &amp;gt; /tmp/r.cmd&#xD;
R CMD BATCH /tmp/r.cmd /tmp/r.out&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The beauty of this is that I can add it to my &lt;code&gt;crontab(5)&lt;/code&gt; and have it run automatically every night or every week as I feel I need it.  This way I always have the latest versions installed.&#xD;
&lt;/p&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Sm6R7iW5IEw:f89JGmHk9P8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Sm6R7iW5IEw:f89JGmHk9P8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Sm6R7iW5IEw" height="1" width="1"/&gt;</content><published>2009-03-25T20:59:00Z</published><updated>2009-03-25T20:59:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html</feedburner:origLink></entry></feed>
