<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.cybaea.net/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"><title>CYBAEA Data and Analysis</title><rights>Copyright by the author(s). All rights reserved.</rights><logo>http://static.cybaea.net/logo2011/cybaea-data-200.png</logo><subtitle type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.</p><p>This is the blog for practising data analysts and theoretical statisticians.  The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.</p></div></subtitle><updated>2011-10-28T11:10:24Z</updated><id>urn:uuid:259dced6-9721-5b16-a8aa-d91dc8e40f56</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/" /><link rel="alternate" type="text/html" href="http://www.cybaea.net/Blogs/Data/" /><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><generator uri="http://www.cybaea.net/atom/feed.pl?short_name=Data" version="$Revision: 97 $">feed.pl</generator><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.cybaea.net/CybaeaData" /><feedburner:info uri="cybaeadata" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>CybaeaData</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry><title type="text">R versus SAS/SPSS in corporations</title><id>urn:uuid:96da848e-e1c0-526a-b214-213b613df848</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Jo8p0HAP-iI/R-versus-SAS_SPSS-in-corporations.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/graph_151-150.png" width="150" height="150" alt="[graph]" title="Graph from http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=151" />
  </a>
</div>
<p>A recent question on one of the LinkedIn groups about the advantages of using <a href="http://www.r-project.org/">R</a> over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R.  We like R a lot and we use it extensively, but I also wanted to balance the discussion.  R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;A recent question on one of the LinkedIn groups about the advantages of using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R.  We like R a lot and we use it extensively, but I also wanted to balance the discussion.  R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Background&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We have created and managed analytics teams in commercial organizations (mainly telecommunications) across Europe.  The teams were using SAS or SPSS.  Our company now has a commercial analytics as a service offering and we mainly use R.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
The benefits of R is productivity.  We want to spend time on the actions from the analytical insights, not the coding, and we choose our tool accordingly.  Being a consulting type organization it is easier for us to attract and retain talent.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The advantages of SAS/SPSS in a commercial environment&lt;/h2&gt;&#xD;
&#xD;
&#xD;
&lt;h3&gt;1. You can buy the tool for money.&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Big corporations have procurement departments who do not have a process for free software.  Also software spend goes on the balance sheet in a way that the CFO prefers to people but something like R will take a little talent to set up initally.  (And yes, we know the Revolutions guys well, but they are not really credible in Europe yet.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
This will change as (a) companies become more mature in their procurement and as (b) commercial support for R improves.  (On the latter point, &lt;a href="http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html"&gt;Oracle’s R integration&lt;/a&gt; to the database is great news.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2. You can recruit for the commercial tool&lt;/h3&gt;&#xD;
&#xD;
&lt;ol&gt;&#xD;
&#xD;
&lt;li&gt;Recruiters are familiar with SAS and SPSS but not with R so it is easier to brief them and to get good quality CVs.  This will change and R becomes ever more popular and prevalent.  [And yes, we could in theory change recruiters to someone clued in, but again in large corporations there are procurement processes to be followed and existing agreements to be honoured so it will all take months or years.]&#xD;
&lt;/li&gt;&#xD;
&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;There are recognised training programmes for SPSS and (especially) SAS which makes it easier to recruit the technical skills.  How do you know what somebody knows when they say they “know R”?  How do you even &lt;em&gt;begin&lt;/em&gt; to quantify it from a CV?  How do you separate the guy who downloaded the tool and just read “&lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html"&gt;An Introduction to R&lt;/a&gt;” from the Frank Harrells of this world?&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Yes,  I would argue (and in fact have argued in &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html"&gt;Commercial Analytics: The Capabilities&lt;/a&gt;) that technical skill is not the most important in an analyst (and can be learned anyhow) but it does help filter the CVs and, you guessed it, fits well with the corporation’s processes.&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
(One reason we use R internally is that we find that it is, on average, a more interesting type of analyst who is proficient in that tool.  It seems to encourage curiosity and love or learning in a way that menu-based tools do not.)&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I think the commercial R companies are really missing a trick here to provide recognised certification.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
You can’t search for R.  Seriously: try searching for R on LinkedIn (tip: there is &lt;a href="http://www.linkedin.com/skills/skill/R"&gt;another way&lt;/a&gt;).  Much easier to find SAS / SPSS skills in a large CV database (like LinkedIn where this discussion started).&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&#xD;
&lt;h3&gt;3. You can recruit for the commercial tool.&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Yes I know I already said that but there is another reason why this is critical.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
R takes talent to use.  (That is kind of why we like it.)  It takes talent to maintain.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
My problem as the manager of a commercial analytical insights team is that it is very hard for me to retain that talent.  Think about it: what can I offer in terms of career progression?  If you are an analyst you might become a senior analyst but you will always be an analyst.  There are no examples of a way up the organization (except perhaps out through IT and then up to CIO).    [This too will change with time.]  And new challenges: yes, some, but we are not a research university and it tends to be the same few problem types that we are always working on.  So if you are an analyst looking for new challenges and more pay, the best thing – the logical and rational thing to do – is to get a new job.  And your time with Big Corporation will look good on your CV and you will probably land the job easily.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;We can help&lt;/h2&gt;&#xD;
&lt;div class="floatRight" style="width: 150px"&gt;&#xD;
  &lt;p&gt;&#xD;
    &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html" title="Click to read Commercial Analytics: The Capabilities"&gt;&#xD;
      &lt;img src="http://static.cybaea.net/files/CCA/commercial-analytics-150.png" width="150" height="150" alt="[capabilities]"&gt;&lt;/img&gt;&#xD;
    &lt;/a&gt;&#xD;
  &lt;/p&gt;&#xD;
  &lt;p class="caption"&gt;Our &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html"&gt;commercial analytics capabilities model&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
If you want to set up a commercial analytical group we can help you get it right first time.  The right people, the right processes, the right infrastructure and most importantly the right results.  We have done it before and are not tied to any specific tool or vendor.&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
If you want to improve or enhance your existing analytical teams, then we can &lt;a href="http://www.cybaea.net/Services/Reboot.html"&gt;Reboot your Analytics&lt;/a&gt; to deliver both rapid and sustained commercial results.&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
And if you just want the results we can provide commercial analytics as a service where we provide the insights and then work with you to turn those insights into commercial actions and better understanding of your business, markets, and customers, leaving you to focus on what you do best.&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p class="link"&gt;&#xD;
&lt;a href="http://www.cybaea.net/Contact/"&gt;Contact us&lt;/a&gt; now and get results from your analytics.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html" title="Commercial Analytics is the kind that makes money. From data to dollars, insights to income, this is all about how to run the business better. To do it and to do it well you need certain capabilities in place. This article builds a map of those business capabilities to help you assess, understand, and plan your business. Usually we talk about this and we are happy to talk to you about it (just contact us ) but we recently had occasion to make a slide pack that covered some of the materials as a stand-alone presentation. This article is based on that pack which is also available for download."&gt;Commercial Analytics: The Capabilities&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Commercial Analytics is the kind that makes money. From data to dollars, insights to income, this is all about how to run the business better. To do it and to do it well you need certain capabilities in place. This article builds a map of those business c…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/5-common-pitfalls-of-commercial-analytics-projects.html" title="We have seen data mining and other analytics projects fail; we have seen insights teams unable to deliver the insights needed to actually improve the business; we have seen marketing teams unable to use data effectively to guide and quantify their activities; we have seen business leaders who are sitting on piles of data but are effectively flying blind because they can not get from the data to the knowledge they need to inform their decisions. Below we have listed five common pitfalls of analytics in a commercial environment, their warning signs, and what you can do differently."&gt;5 common pitfalls of commercial analytics projects&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Jo8p0HAP-iI" height="1" width="1"/&gt;</content><published>2011-10-28T11:10:00Z</published><updated>2011-10-28T11:10:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html</feedburner:origLink></entry><entry><title type="text">Friday quote: what is the question to which this number is the answer?</title><id>urn:uuid:4c6c66d3-6fac-53f0-88fe-87e29b0488f6</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Friday-quote-20110826.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/sbNH3HpbijQ/Friday-quote-20110826.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>John Kay <a href="http://www.johnkay.com/2011/08/24/sex-lies-and-pitfalls-of-overblown-statistics">muses</a> on interpreting statistical data:</p>
<blockquote>
<p>Always ask of such data “<b>what is the question to which this number is the answer?</b>”. “<i>Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs</i>” is the answer to the question “<i>what is the highest profit number we can present without attracting flat disbelief?</i>”.</p>
</blockquote></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;John Kay &lt;a href="http://www.johnkay.com/2011/08/24/sex-lies-and-pitfalls-of-overblown-statistics"&gt;muses&lt;/a&gt; on interpreting statistical data:&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;Always ask of such data “&lt;b&gt;what is the question to which this number is the answer?&lt;/b&gt;”. “&lt;i&gt;Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs&lt;/i&gt;” is the answer to the question “&lt;i&gt;what is the highest profit number we can present without attracting flat disbelief?&lt;/i&gt;”.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;And on the pitfalls of powerful data analysis tools:&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;When the data seem to point to an unexpected finding, always consider the possibility that the problem is a feature of the data, rather than a feature of the world.  […] It is now easy to import data into a computer program without thought. The unwarranted precision of the projected growth in rail traffic – a 96 per cent increase, rather than a doubling – is a clue that the number was generated by a computer, not a skilled interpreter of evidence.&lt;/p&gt;&#xD;
&lt;p&gt;Statistics are only as valid as the sources from which they are drawn and the abilities of those who use them. When I discover something surprising in data, the most common explanation is that I made a mistake.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/sbNH3HpbijQ" height="1" width="1"/&gt;</content><published>2011-08-26T09:05:00Z</published><updated>2011-08-26T09:05:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Friday-quote-20110826.html</feedburner:origLink></entry><entry><title type="text">A warning on the R save format</title><id>urn:uuid:52d4ca53-07ff-59e3-92cb-54f97d3dd30e</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CwF2gIjFK2Y/A-warning-on-the-R-save-format.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>The <code>save()</code> function in the <a href="http://www.r-project.org/">R platform for statistical computing</a> is very convenient and I suspect many of us use it a lot.  But I was recently bitten by a “feature” of the format which meant I could not recover my data.</p>
<p>I recommend that you save data in a data format (e.g. CSV or CDF), not using the <code>save()</code> function which is really for objects (data and code).  What is your approach?</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;The &lt;code&gt;save()&lt;/code&gt; function in the &lt;a href="http://www.r-project.org/"&gt;R platform for statistical computing&lt;/a&gt; is very convenient and I suspect many of us use it a lot.  But I was recently bitten by a “feature” of the format which meant I could not recover my data.&lt;/p&gt;&#xD;
&lt;h2&gt;How to lose your data with &lt;code&gt;save()&lt;/code&gt;&lt;/h2&gt;&#xD;
&lt;p&gt;I am using Windows on my travel laptop and Linux on my workstation.  To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.&lt;/p&gt;&#xD;
&lt;p&gt;To illustrate the problem with the save file format, I created a file on the Linux machine simply as:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("multicore")&#xD;
a &amp;lt;- list(data = 1:10, fun = mclapply)&#xD;
save(a, file = "a.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;What could be simpler?  The &lt;code&gt;mclapply&lt;/code&gt; is a function from the ‘multicore’ package but it clearly has no impact on the stored data.  (We will show a more realistic example below ­– work with me here.)&lt;/p&gt;&#xD;
&lt;p&gt;But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Error in loadNamespace(name) : there is no package called 'multicore'&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&lt;strong&gt;There is no way of getting to your precious data&lt;/strong&gt; without installing the missing package.&lt;/p&gt;&#xD;
&lt;p&gt;If the package has been withdrawn or is no longer available then your data is basically lost.&lt;/p&gt;&#xD;
&lt;h2&gt;What can you do?&lt;/h2&gt;&#xD;
&lt;p&gt;Some suggestions from the helpful people on R-help:&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;(&lt;a href="http://www.statistik.tu-dortmund.de/ligges.html"&gt;Uwe Ligges&lt;/a&gt;): You could try to rewrite &lt;code&gt;./src/main/saveload.R&lt;/code&gt; and &lt;code&gt;serialize.R&lt;/code&gt; to extract only the parts you need.  “This is probably not worth the effort.”&lt;/li&gt;&#xD;
&lt;li&gt;(&lt;a href="http://www.stats.ox.ac.uk/~ripley/"&gt;Prof. Brian Ripley&lt;/a&gt;): You could try installing the missing package; &lt;code&gt;R CMD INSTALL --fake&lt;/code&gt; should be sufficient to let you load the data.  Also suggests that the proposal above would be very hard indeed.&lt;/li&gt;&#xD;
&lt;li&gt;(&lt;a href="http://blog.revolutionanalytics.com/2011/05/the-r-files-martin-morgan.html"&gt;Martin Morgan&lt;/a&gt;): Don't store package functions with your code.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;That is three good answers from three of the heavy-weights in the R community.  Thank you all!&lt;/p&gt;&#xD;
&lt;p&gt;Martin’s comment is worth expanding.  We can change the above example to:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("multicore")&#xD;
computeFunction &amp;lt;- function(...) {&#xD;
    if (require(multicore)) mclapply(...)&#xD;
    else lapply(...) &#xD;
}&#xD;
a &amp;lt;- list(data = 1:10, fun = computeFunction)&#xD;
save(a, file = "a.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Now everything works fine!  No data is horribly lost: the file loads fine on the ‘multicore’-less machine.&lt;/p&gt;&#xD;
&lt;p&gt;And for the more realistic example, I had been using &lt;code&gt;&lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt;::rfe&lt;/code&gt; as Martin knew in the example he provided:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("&lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt;")&#xD;
data(BloodBrain)&#xD;
&#xD;
x &amp;lt;- scale(bbbDescr[,-nearZeroVar(bbbDescr)])&#xD;
x &amp;lt;- x[, -findCorrelation(cor(x), .8)]&#xD;
x &amp;lt;- as.data.frame(x)&#xD;
&#xD;
set.seed(1)&#xD;
lmProfile &amp;lt;- rfe(x, logBBB,&#xD;
                 sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65),&#xD;
                 rfeControl = rfeControl(functions = lmFuncs,&#xD;
                   number = 5,&#xD;
                   computeFunction=mclapply))&#xD;
save(lmProfile, file = "lmProfile.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.&lt;/p&gt;&#xD;
&lt;p&gt;For old files I will use the &lt;code&gt;R CMD INSTALL --fake&lt;/code&gt; suggestion, but for new data I am going with the last approach and using a &lt;code&gt;computeFunction&lt;/code&gt; like this:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl &#xD;
### that does not leave a reference to the multicore package in the save file&#xD;
MCCompute &amp;lt;- function(X, FUN, ...) {&#xD;
    FUN &amp;lt;- match.fun(FUN)&#xD;
    if (!is.vector(X) || is.object(X)) &#xD;
        X &amp;lt;- as.list(X)&#xD;
    if (require("multicore")) mclapply(X, FUN, ...)&#xD;
    else lapply(X, FUN, ...)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;I know that Max Kuhn is rewriting the &lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt; package which should make this a moot point in the near future for that specific case.  But the indirection approach is generally useful and will also be relevant in other situations.&lt;/p&gt;&#xD;
&lt;h2&gt;Recommendations&lt;/h2&gt;&#xD;
&lt;p&gt;My recommendations:&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&lt;strong&gt;Save data in a data format, not using the &lt;code&gt;save()&lt;/code&gt; function which is really for objects (data and code)&lt;/strong&gt;.  Suitable formats include CSV and variants, &lt;a href="http://cran.r-project.org/web/packages/hdf5/index.html"&gt;HDF5&lt;/a&gt;, and &lt;a href="http://cran.r-project.org/web/packages/ncdf4/index.html"&gt;CDF&lt;/a&gt;, as well as others.&lt;/li&gt;&#xD;
&lt;li&gt;Avoid references to packages in your objects by using the one level indirection trick exemplified by the &lt;code&gt;MCCompute&lt;/code&gt; function shown.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;What is your approach?  Suggestions in the comments below, please.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" title="I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is."&gt;R tips: Determine if function is called from specific package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.42]" title="[0.42]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want."&gt;SNA with R: Loading your network data&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CwF2gIjFK2Y" height="1" width="1"/&gt;</content><published>2011-08-23T07:20:00Z</published><updated>2011-08-23T07:20:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html</feedburner:origLink></entry><entry><title type="text">Friday quote: the handmaiden and the whore</title><id>urn:uuid:11daa2ff-5d4a-534e-aef8-66ce1e157cd8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Friday-quote-20110819.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/KFA3sPOOdCI/Friday-quote-20110819.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Because it is Friday and because we collect quotes:</p>
<blockquote>
  <p>If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship.  Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers.  They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.</p>
</blockquote></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Because it is Friday and because we collect quotes.&lt;/p&gt;&#xD;
&lt;blockquote&gt;&lt;p&gt;If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship.  Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers.  They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.&lt;/p&gt;&#xD;
&lt;p&gt;Statistics is a wonderful discipline.  It has it all: mathematics and philosophy, analysis and empiricism, as well as applicability, relevance and the fascination of data.  It demands clear thinking, good judgement and flair.  Statisticians are engaged in an exhausting but exhilarating struggle with the biggest challenge that philosophy makes to science: how do we translate information into knowledge?&lt;/p&gt;&#xD;
&lt;p&gt;―Stephen Senn: &lt;a href="http://www.amazon.co.uk/gp/product/0521540232/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=19450&amp;amp;creativeASIN=0521540232"&gt;Dicing with Death: Chance, Risk and Health&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=0521540232" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&lt;/blockquote&gt;&#xD;
&lt;p&gt;Which one of the two views are closest to your opinion?&lt;/p&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/KFA3sPOOdCI" height="1" width="1"/&gt;</content><published>2011-08-19T12:04:00Z</published><updated>2011-08-19T12:04:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Friday-quote-20110819.html</feedburner:origLink></entry><entry><title type="text">Spreadsheet errors</title><id>urn:uuid:17669694-59c5-5798-a85d-ebb7c8d5802b</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/RcDqZYZa4mM/Spreadsheet-errors.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" title="Read full article"><img src="http://static.cybaea.net/files/GS-spreadsheet-error-thumb.png" width="150" height="150" alt="[Click for article]" /></a>
</div>
<p>For my sins, I have done more than my fair share of analysis in Excel.  I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client).  Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation.  But I don’t like it and let’s have a look at one reason why.  In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;For my sins, I have done more than my fair share of analysis in Excel.  I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client).  Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation.  But I don’t like it and let’s have a look at one reason why.  In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;div style="float: right; margin-left: 1em; overflow: scroll; height: 30em"&gt;&#xD;
&lt;table class="excel"&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;Y&lt;/th&gt;&lt;th&gt;X1&lt;/th&gt;&lt;th&gt;X2&lt;/th&gt;&lt;th&gt;X3&lt;/th&gt;&lt;th&gt;X4&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5.88&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.56&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;11.11&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.79&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;15.6&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.7&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8.49&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;51.2&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;14.2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7.14&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6.15&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10.46&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10.42&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;17.36&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;13.41&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;41.67&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.78&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.98&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.62&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4.65&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.13&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;24.58&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5.56&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.26&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.13&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7.56&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.93&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;16.67&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;16.89&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;13.71&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6.35&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.5&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.47&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;21.74&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;23.6&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;11.11&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.57&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.9&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.94&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.42&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;18.75&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.27&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
Spreadsheets are good for some things, but analysing data is not one of them.  The example data in the table on the right is from  Jeffrey S. Simonoff, “&lt;a href="http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf" title="Statistical analysis using Microsoft Excel"&gt;Statistical analysis using Microsoft Excel&lt;/a&gt;” (2008), and looks at first (and maybe even second) glance like a reasonable set of observations.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, the predictors are (accidentally) collinear so no meaningful fit is possible, unless one of them are dropped.  We see that very easily if we try to do the analysis using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; d &amp;lt;- read.delim("clipboard")  # Read DATA range from clipboard&#xD;
&amp;gt; summary(lm(Y ~ ., data = d))&#xD;
&#xD;
Call:&#xD;
lm(formula = Y ~ ., data = d)&#xD;
&#xD;
Residuals:&#xD;
    Min      1Q  Median      3Q     Max &#xD;
-11.222  -5.821  -2.546   3.171  40.750 &#xD;
&#xD;
Coefficients: &lt;strong&gt;(1 not defined because of singularities)&lt;/strong&gt;&#xD;
            Estimate Std. Error t value Pr(&amp;gt;|t|)&#xD;
(Intercept)   4.1945     3.9749   1.055    0.296&#xD;
X1            0.3862     0.5652   0.683    0.497&#xD;
X2            0.2308     3.1590   0.073    0.942&#xD;
X3            3.7072     2.9922   1.239    0.221&#xD;
X4                NA         NA      NA       NA&#xD;
&#xD;
Residual standard error: 10.14 on 50 degrees of freedom&#xD;
Multiple R-squared: 0.04767,	Adjusted R-squared: -0.009466 &#xD;
F-statistic: 0.8343 on 3 and 50 DF,  p-value: 0.4814 &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
We have highlighted the message that R has automatically dropped one of the predictors.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Everybody likes to pick on Excel, so let us load the data into version 3.3.2 of &lt;a href="http://www.libreoffice.org/"&gt;LibreOffice&lt;/a&gt;, the free Open Source personal productivity suite, instead.  It faithfully implements many of the worst features of Excel.  You can grab a copy of the spreadsheet &lt;a href="http://static.cybaea.net/files/GS-spreadsheet-error.ods"&gt;GS-spreadsheet-error.ods&lt;/a&gt; yourself and see the results.  The relevant function in both Excel and LibreOffice for linear regression is LINEST and applying it to the data set give us:&#xD;
&lt;/p&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/GS-spreadsheet-error-1.png" width="723" height="119" alt="[Screenshot 1]"&gt;&lt;/img&gt;&#xD;
&lt;p&gt;&#xD;
Of the 16 values returned by the function, fully 12 of them are incorrect (highlighted in red), and the '#VALUE!' entries are the only thing that suggests we may have a problem.  (The '#N/A' values are a feature of the function and not a problem.)  Excluding the X4 values from the function call gives meaningful (and correct) results:&#xD;
&lt;/p&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/GS-spreadsheet-error-2.png" width="602" height="119" alt="[Screenshot 2]"&gt;&lt;/img&gt;&#xD;
&lt;p&gt;&#xD;
There is so much wrong with doing even this trivial analysis in a spreadsheet that it is hard to know where to start.  Some of the problems:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;dl&gt;&#xD;
&lt;dt&gt;Garbage results instead of errors&lt;/dt&gt;&lt;dd&gt;Instead of giving meaningful errors or warnings, the spreadsheets simply produce garbage results.  This is nearly impossible to debug.&lt;/dd&gt;&#xD;
&lt;dt&gt;No help on how to correct the problem&lt;/dt&gt;&lt;dd&gt;In the erroneous results of the first figure, there is no clue, no hint, no help to figure out how to correct the problem.  You could argue about R correcting the issue ”automagically”, but at least it finds a solution to the problem and tells you about it.&lt;/dd&gt;&#xD;
&lt;dt&gt;Error prone output formats&lt;/dt&gt;&lt;dd&gt;I put in the row and column headings because otherwise it is just too hard to read the data.  Where does the function stuff the F statistics again?&lt;/dd&gt;&#xD;
&lt;/dl&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
And don’t get me started on version control and documentation.  Don’t even mention that the maths in Excel are wrong.  Remember: Friends do not let friends do data analysis in spreadsheets.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Excel_Tip_1.html" title="I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below."&gt;Excel Tip: Array boolean operator&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" title="Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here."&gt;R tips: Installing Rmpi on Fedora Linux&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Bubble 2.0.html" title="We are seeing the same thing, if a little less and a little delayed. Does it have to be like this? I dont think it is just the tech industry but any new and hot growth area. Fred Wilson writes in Bubble 2.0 that we are heading for a new bubble, similar to the one that ended around the year 2000. “ But increasingly money is being made the way we made it from 1998 to early 2000; [momentum] investing, speculation, fast money chasing deals, caution being thrown to the wind, and amateurs jumping in on the action. Its hard to say no to a good party. I am struggling with the temptations myself. ” I am in two minds about how it will go this time...."&gt;Bubble 2.0&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are seeing the same thing, if a little less and a little delayed. Does it have to be like this? I dont think it is just the tech industry but any new and hot growth area. Fred Wilson writes in Bubble 2.0 that we are heading for a new bubble, similar to…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/mainstream_2007.html" title="Enterprise Social Software has gone mainstream. I say this based on the fact that the analysts are now releasing stacks of research on this area. Forrester is a good example, and McKinsey is also in on it (yes, Web 2.0 is a strategic management issue now)."&gt;Enterprise Social Software is Mainstream&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Enterprise Social Software has gone mainstream. I say this based on the fact that the analysts are now releasing stacks of research on this area. Forrester is a good example, and McKinsey is also in on it (yes, Web 2.0 is a strategic management issue now).&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/RcDqZYZa4mM" height="1" width="1"/&gt;</content><published>2011-04-20T11:19:00Z</published><updated>2011-04-20T11:19:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html</feedburner:origLink></entry><entry><title type="text">Getting started with the Heritage Health Price competition</title><id>urn:uuid:7e9f3d60-249c-5df1-9c75-a584492c0fa1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/DoFsYQmBMRM/Getting-started-with-HHP.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>The US$ 3 million <a href="http://www.heritagehealthprize.com/">Heritage Health Price</a> competition is on so we take a look at how to get started using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;The US$ 3 million &lt;a href="http://www.heritagehealthprize.com/"&gt;Heritage Health Price&lt;/a&gt; competition is on so we take a look at how to get started using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Prerequisites&lt;/h2&gt;&#xD;
&lt;p&gt;Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.&lt;/p&gt;&#xD;
&#xD;
&lt;h2 id="h2DataPrep"&gt;Data preparation&lt;/h2&gt;&#xD;
&lt;p&gt;Let’s load the data into R and do some basic housekeeping:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
##############################&#xD;
#### DATA PREPARATION&#xD;
&#xD;
##++++&#xD;
## Members&#xD;
members   &amp;lt;- read.csv(file = "HHP_release1/Members_Y1.csv",&#xD;
                      colClasses = rep("factor", 3),&#xD;
                      comment.char = "")&#xD;
##----&#xD;
##++++&#xD;
## Claims&#xD;
claims.Y1 &amp;lt;- read.csv(file = "HHP_release1/Claims_Y1.csv",&#xD;
                      colClasses = c(&#xD;
                          rep("factor", 7),&#xD;
                          "integer",    # paydelay&#xD;
                          "character",  # LengthOfStay&#xD;
                          "character",  # dsfs&#xD;
                          "factor",     # PrimaryConditionGroup&#xD;
                          "character"   # CharlsonIndex&#xD;
                          ),&#xD;
                      comment.char = "")&#xD;
## Utility function&#xD;
make.numeric &amp;lt;- function (cv, FUN = mean) {&#xD;
### make a character vector numeric by splitting on '-'&#xD;
    sapply(strsplit(gsub("[^[:digit:]]+",&#xD;
                         " ",&#xD;
                         cv,&#xD;
                         perl = TRUE),&#xD;
                    " ",&#xD;
                    fixed = TRUE),&#xD;
           function (x) FUN(as.numeric(x)))&#xD;
}&#xD;
## Length of stay as days&#xD;
{&#xD;
    z &amp;lt;- make.numeric(claims.Y1$LengthOfStay)&#xD;
    z.week &amp;lt;- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE)&#xD;
    z[z.week] &amp;lt;- z[z.week] * 7          # Weeks are 7 days&#xD;
    z[is.nan(z)] &amp;lt;- 0&#xD;
    claims.Y1$LengthOfStay.days &amp;lt;- z&#xD;
}&#xD;
los.levels &amp;lt;- c("", "1 day", sprintf("%d days", 2:6),&#xD;
                "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",&#xD;
                "12-26 weeks", "26+ weeks")&#xD;
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels))&#xD;
claims.Y1$LengthOfStay &amp;lt;- factor(claims.Y1$LengthOfStay,&#xD;
                                 levels = los.levels,&#xD;
                                 labels = c("0 days", los.levels[-1]),&#xD;
                                 ordered = TRUE)&#xD;
## Months since first claim&#xD;
claims.Y1$dsfs.months &amp;lt;- make.numeric(claims.Y1$dsfs)&#xD;
## dsfs is an ordered factor and gives the ordering of the claims&#xD;
dsfs.levels &amp;lt;- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))&#xD;
claims.Y1$dsfs &amp;lt;- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE)&#xD;
## Index as numeric&#xD;
claims.Y1$CharlsonIndex.numeric &amp;lt;- make.numeric(claims.Y1$CharlsonIndex)&#xD;
claims.Y1$CharlsonIndex &amp;lt;- factor(claims.Y1$CharlsonIndex, ordered = TRUE)&#xD;
##----&#xD;
##++++&#xD;
## Days in hospital&#xD;
dih.Y2    &amp;lt;- read.csv(file = "HHP_release1/DayInHospital_Y2.csv",&#xD;
                      colClasses = c("factor", "integer"),&#xD;
                      comment.char = "")&#xD;
names(dih.Y2)[1] &amp;lt;- "MemberID"          # Fix broken file&#xD;
##----&#xD;
save(members, claims.Y1, dih.Y2,&#xD;
     file = "HHPR1.RData")&#xD;
##############################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2 id="h2Score"&gt;Scoring&lt;/h2&gt;&#xD;
&lt;p&gt;We will need a function to score our predictions &lt;code&gt;p&lt;/code&gt; against the actual values &lt;code&gt;a&lt;/code&gt;.  The formula is on the &lt;a href="http://www.heritagehealthprize.com/c/hhp/Details/Evaluation"&gt;evaluation page&lt;/a&gt; and we implement it as:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
##############################&#xD;
#### FUNCTION TO CALCULATE SCORE&#xD;
HPPScore &amp;lt;- function (p, a) {&#xD;
### Scorng function after&#xD;
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation&#xD;
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&amp;amp;m=2226#post2226&#xD;
    sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))&#xD;
}&#xD;
##############################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;The simplest benchmarks&lt;/h2&gt;&#xD;
&lt;p&gt;The simplest models don’t really model at all: they just use the average and are simple benchmarks.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
y &amp;lt;- dih.Y2$DaysInHospital_Y2           # Actual&#xD;
p &amp;lt;- rep(mean(y), NROW(dih.Y2))&#xD;
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))&#xD;
# Score using mean  : 0.278725&#xD;
&#xD;
p &amp;lt;- rep(median(y), NROW(dih.Y2))&#xD;
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))&#xD;
# Score using median: 0.267969&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;Simple single-variable linear models&lt;/h2&gt;&#xD;
&lt;p&gt;OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
library("reshape2")&#xD;
&#xD;
vars &amp;lt;- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")&#xD;
names(vars)[2] &amp;lt;- "LengthOfStay"&#xD;
data &amp;lt;- merge(vars, dih.Y2)&#xD;
&#xD;
model &amp;lt;- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)&#xD;
p &amp;lt;- predict(model)&#xD;
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))&#xD;
# Score using lm(LengthOfStay): 0.279062&#xD;
&#xD;
model &amp;lt;- glm(DaysInHospital_Y2 ~ LengthOfStay,&#xD;
             family = quasipoisson(),&#xD;
             data = data)&#xD;
p &amp;lt;- predict(model, type="response")&#xD;
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))&#xD;
# Score using glm(LengthOfStay): 0.278914&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Let the competition begin.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/DoFsYQmBMRM" height="1" width="1"/&gt;</content><published>2011-04-08T08:39:00Z</published><updated>2011-04-08T08:39:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html</feedburner:origLink></entry><entry><title type="text">Benchmarking feature selection with Boruta and caret</title><id>urn:uuid:1a953ff9-7aa7-5db9-9a49-ec6e3ba6872f</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/G4iSTwL88Q0/Benchmarking-feature-selection-with-Boruta-and-caret.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Click for full article"><img src="http://static.cybaea.net/images/Boruta-feature-benchmark-150.png" width="150" height="150" alt="[Performance of Boruta feature selection]" /></a>
</div>
<p>
<dfn>Feature selection</dfn> is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering.  For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process.  And since we often work on very large data sets the performance of our process is very important to us.
</p>
<p>
Having looked at <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html">feature selection using the Boruta package</a> and <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html">feature selection using the caret package</a> separately, we now consider the performance of the two approaches.
</p>
<p>
Neither approach is suitable out of the box for the sizes of data sets that we normally work with.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
&lt;dfn&gt;Feature selection&lt;/dfn&gt; is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering.  For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process.  And since we often work on very large data sets the performance of our process is very important to us.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Having looked at &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;feature selection using the Boruta package&lt;/a&gt; and &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html"&gt;feature selection using the caret package&lt;/a&gt; separately, we now consider the performance of the two approaches.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For our tests we will use an artificially constructed trivial data sets that the automated process should have no problems with (but we will be disappointed later on this expectation, as we will see).  The data set has an equal number of normal and uniform random variables with mean 0 and variance 1 of which 20% are used for the target variable.  There are 10 time as many observations as variables.  We create a function to set this up:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;make.data &amp;lt;- function (n.var, m.rand = 5, m.obs = 10) {&#xD;
    n.col &amp;lt;- n.var * m.rand&#xD;
    n.obs &amp;lt;- n.col * m.obs * 2&#xD;
    x &amp;lt;- data.frame(N = matrix(rnorm(n = n.col*n.obs),&#xD;
                        nrow = n.obs, ncol = n.col),&#xD;
                    U = matrix(runif(n = n.col*n.obs,&#xD;
                        min = -sqrt(3), max = sqrt(3)), n.obs, n.col))&#xD;
    deps.n &amp;lt;- 1:n.var&#xD;
    deps.u &amp;lt;- (1+n.col):(n.var+n.col)&#xD;
    y &amp;lt;- rowSums(as.matrix(x[, c(deps.n, deps.u)]))&#xD;
    x &amp;lt;- cbind(x, Y = factor(y &amp;gt;= 0, labels=c("N", "P")))&#xD;
    attr(x, "vars") &amp;lt;- names(x)[c(deps.n, deps.u)]&#xD;
    return(x)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;The Boruta package&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
Then we run a test using the Boruta package for different sizes:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;#!/usr/bin/Rscript&#xD;
## bench.R - benchmark Boruta package&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
run.name &amp;lt;- "bench-1"&#xD;
library("Boruta")&#xD;
&#xD;
set.seed(1)&#xD;
&#xD;
sizes &amp;lt;- c(1:10, 10*(2:10), 100*(2:10), 1e3*(2:10))&#xD;
n.sizes &amp;lt;- length(sizes)&#xD;
bench &amp;lt;- data.frame(n.vars = sizes, elapsed = NA, right = NA, wrong = NA)&#xD;
file.name &amp;lt;- paste(run.name, "RData", sep = ".")&#xD;
&#xD;
for (n in 1:length(sizes)) {&#xD;
    size &amp;lt;- sizes[n]&#xD;
    cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))&#xD;
    tries &amp;lt;- max(3, round(10/size, 0))&#xD;
    n.right &amp;lt;- 0&#xD;
    n.wrong &amp;lt;- 0&#xD;
    elapsed &amp;lt;- 0&#xD;
    for (try in 1:tries) {&#xD;
        cat(tries-try, ".", sep = "")&#xD;
        x &amp;lt;- make.data(size)&#xD;
        x.vars &amp;lt;- attr(x, "vars")&#xD;
        elapsed &amp;lt;- elapsed +&#xD;
            system.time({b &amp;lt;- Boruta(x[,-NCOL(x)], x[,NCOL(x)])}&#xD;
                        )["elapsed"]&#xD;
        b.vars &amp;lt;- names(b$finalDecision)[b$finalDecision!="Rejected"]&#xD;
        n.right &amp;lt;- n.right + length(intersect(b.vars, x.vars))&#xD;
        n.wrong &amp;lt;- n.wrong + length(setdiff(b.vars, x.vars))&#xD;
    }&#xD;
    elapsed &amp;lt;- elapsed / tries&#xD;
    cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")&#xD;
    n.right &amp;lt;- n.right / tries&#xD;
    n.wrong &amp;lt;- n.wrong / tries&#xD;
    bench[n, ] &amp;lt;- c(size, elapsed, n.right, n.wrong)&#xD;
    save(bench, file = file.name, ascii = FALSE, compress = FALSE)&#xD;
}&#xD;
&#xD;
print(bench)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
As it turned out, our expectations for the size of data set we could handle were wildly optimistic and we killed the process at size 30.  We add to the data set a field with the total number of variables in the &lt;code&gt;x&lt;/code&gt; data set and plot the results.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;load(file = "bench-1.RData")&#xD;
bench &amp;lt;- na.omit(bench)&#xD;
bench$n.elem &amp;lt;- bench$n.var^2 * 1e3&#xD;
plot(elapsed ~ n.elem, data = bench, type = "b",&#xD;
     main = "Feature selections with Boruta",&#xD;
     sub = "Elapsed time versus number of data elements",&#xD;
     log = "xy",&#xD;
     xlab = "Elements in data set", ylab = "Elapsed time (seconds)")&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/Boruta-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/Boruta-feature-benchmark-400.png" width="400" height="400" alt="[Boruta feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with Boruta package shows linear scaling (slope is 1.01 with standard error 0.025 and adjusted R² 0.993)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;A quick check using &lt;code&gt;summary(lm(log(elapsed) ~ log(n.elem), data = bench))&lt;/code&gt; shows us a linear scaling with the number of elements (slope is 1.01 with standard error 0.025 and adjusted R² 0.993).  The algorithm selects all the right features up to &lt;code&gt;n.vars = 10&lt;/code&gt; when it starts to miss some of them:&#xD;
&lt;/p&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for Boruta package&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.00000&lt;/td&gt;&lt;td&gt;1.1000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;4.00000&lt;/td&gt;&lt;td&gt;1.2000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;6.00000&lt;/td&gt;&lt;td&gt;1.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;8.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;10.00000&lt;/td&gt;&lt;td&gt;1.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;12.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;14.00000&lt;/td&gt;&lt;td&gt;1.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;16.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;18.00000&lt;/td&gt;&lt;td&gt;0.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;20.00000&lt;/td&gt;&lt;td&gt;0.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;39.33333&lt;/td&gt;&lt;td&gt;0.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;56.33333&lt;/td&gt;&lt;td&gt;0.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;p&gt;&#xD;
A higher accuracy in the feature selection for the larger problems could presumably be achieved by adjusting the &lt;code&gt;maxRuns&lt;/code&gt; and perhaps &lt;code&gt;confidence&lt;/code&gt; parameters on the &lt;code&gt;Boruta&lt;/code&gt; call.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In summary, the Boruta package performs well up to about 20 features out of 100 (&lt;code&gt;n.vars = 10&lt;/code&gt;) which runs in about 11 minutes on my machine.  If we changed the technical implementation to support multicore, MPI, and other parallel frameworks, then the out of the box settings would be useful up to &lt;code&gt;n.vars&lt;/code&gt; of 20 or 30 (40-60 features out of 200-300) which an 8-core machine should be able to complete in 20 minutes or so.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is still a lot less than the size of data sets we normally work with.  (Our usual benchmark is 15,000 variables and 50,000 observations.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The caret package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
One of the nice features of the caret package is that is supports most parallel processing frameworks out of the box, but for comparison with the previous analysis we will (somewhat unfairly) not use that here.  The setup is then quite simple, using the same &lt;code&gt;make.data&lt;/code&gt; function as before.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;#!/usr/bin/Rscript&#xD;
## bench.R - benchmark caret package&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
run.name &amp;lt;- "bench-2"&#xD;
library("caret")&#xD;
library("randomForest")&#xD;
set.seed(1)&#xD;
&#xD;
control &amp;lt;- rfeControl(functions = rfFuncs, verbose = FALSE,&#xD;
                      returnResamp = "final")&#xD;
&#xD;
## if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) {&#xD;
##     control$workers &amp;lt;- multicore:::detectCores()&#xD;
##     control$computeFunction &amp;lt;- mclapply&#xD;
##     control$computeArgs &amp;lt;- list(mc.preschedule = FALSE, mc.set.seed = FALSE)&#xD;
## }&#xD;
&#xD;
our.sizes &amp;lt;- c(2:10, 10*(2:10), 100*(2:10), 1e3*(2:10))&#xD;
n.sizes &amp;lt;- length(our.sizes)&#xD;
bench &amp;lt;- data.frame(n.vars = our.sizes, elapsed = NA, right = NA, wrong = NA)&#xD;
file.name &amp;lt;- paste(run.name, "RData", sep = ".")&#xD;
&#xD;
for (n in 1:length(our.sizes)) {&#xD;
    size &amp;lt;- our.sizes[n]&#xD;
    cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))&#xD;
    tries &amp;lt;- max(3, round(10/size, 0))&#xD;
    n.right &amp;lt;- 0&#xD;
    n.wrong &amp;lt;- 0&#xD;
    elapsed &amp;lt;- 0&#xD;
    for (try in 1:tries) {&#xD;
        cat(tries-try, ".", sep = "")&#xD;
        x &amp;lt;- make.data(size)&#xD;
        x.vars &amp;lt;- attr(x, "vars")&#xD;
        elapsed &amp;lt;- elapsed + &#xD;
            system.time({p &amp;lt;- rfe(x[,-NCOL(x)], x[,NCOL(x)],&#xD;
                                  sizes = 1:(2*size), rfeControl = control)}&#xD;
                        )["elapsed"]&#xD;
        p.vars &amp;lt;- predictors(p)&#xD;
        n.right &amp;lt;- n.right + length(intersect(p.vars, x.vars))&#xD;
        n.wrong &amp;lt;- n.wrong + length(setdiff(p.vars, x.vars))&#xD;
    }&#xD;
    elapsed &amp;lt;- elapsed / tries&#xD;
    cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")&#xD;
    n.right &amp;lt;- n.right / tries&#xD;
    n.wrong &amp;lt;- n.wrong / tries&#xD;
    bench[n, ] &amp;lt;- c(size, elapsed, n.right, n.wrong)&#xD;
    save(bench, file = file.name, ascii = FALSE, compress = FALSE)&#xD;
}&#xD;
&#xD;
print(bench)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
This uses the &lt;code&gt;randomForest&lt;/code&gt; classifier from the package of the same name.  To use the &lt;code&gt;ipredbagg&lt;/code&gt; bagging classifier from Andrea Peters and Torsten Hothorn's &lt;a href="http://CRAN.R-project.org/package=ipred"&gt;ipred: Improved Predictors&lt;/a&gt; package we simply change the &lt;code&gt;control&lt;/code&gt; object to:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;control &amp;lt;- rfeControl(functions = treebagFuncs, verbose = FALSE,&#xD;
                      returnResamp = "final")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
As usual, we were widely optimistic in our guesses for the size of problems we could handle, and had to abort the run.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;div class="floatCenter"&gt;&#xD;
&lt;div style="width: 400px; margin-right: 10px; display: inline-block;"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/caret-rf-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/caret-rf-feature-benchmark-400.png" width="400" height="400" alt="[caret feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with caret package using randomForest classifier (slope is 1.17 with standard error 0.024 and adjusted R² 0.996)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div style="width: 400px; display: inline-block;"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/caret-treebag-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/caret-treebag-feature-benchmark-400.png" width="400" height="400" alt="[caret feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with caret package using treebag classifier shows non-power behaviour (nevertheless, a linear log-log fit gives a slope of 1.12 with standard error 0.067 and adjusted R² 0.96)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;div class="floatCenter"&gt;&#xD;
&lt;div style="width: 400px; margin-right: 10px; display: inline-block;"&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for caret package using randomForest classifier&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3.20000&lt;/td&gt;&lt;td&gt;3.200000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;5.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;7.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;9.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;11.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;13.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;14.66667&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;16.66667&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;19.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;38.66667&lt;/td&gt;&lt;td&gt;1.333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;54.00000&lt;/td&gt;&lt;td&gt;86.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div style="width: 400px; display: inline-block;"&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for caret package using ipredbagg classifier&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;5.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;7.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;9.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;10.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;13.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;14.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;16.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;18.66667&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;35.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;54.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;69.66667&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Remember that the right number of significant features are &lt;code&gt;2 * n.vars&lt;/code&gt; and we see that the caret package apparently always miss one feature in its selection, which is very odd and possibly a bug.  It is less likely to select the wrong features than Boruta, but that could be partially due to "Tentative" data in Boruta.  Timing-wise, performance is a little worse in the non-parallel situation but realistically of course a lot better than Boruta depending on the number of cores on your processor or nodes in your cluster.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.52]" title="[0.52]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a d…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/G4iSTwL88Q0" height="1" width="1"/&gt;</content><published>2010-11-25T13:43:00Z</published><updated>2010-11-25T13:43:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html</feedburner:origLink></entry><entry><title type="text">Feature selection: Using the caret package</title><id>urn:uuid:1dda2c01-4d41-54a6-b70c-8d9c5be380fc</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/O6IQ4h7grTk/Feature-selection-Using-the-caret-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  In a previous post we looked at <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html">all-relevant feature selection using the Boruta package</a> while in this post we consider the same (artificial, toy) examples using the <a href="http://CRAN.R-project.org/package=caret">caret</a> package.  Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  In a previous post we looked at &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;all-relevant feature selection using the Boruta package&lt;/a&gt; while in this post we consider the same (artificial, toy) examples using the &lt;a href="http://CRAN.R-project.org/package=caret"&gt;caret&lt;/a&gt; package.  Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The caret package provides a very flexible framework for the analysis as we shall see, but first we set up the artificial test data set as in the previous article.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Feature-bc.R - Compare Boruta and caret feature selection&#xD;
## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/)&#xD;
run.name &amp;lt;- "feature-bc"&#xD;
library("caret")&#xD;
&#xD;
## Load early to get the warnings out of the way:&#xD;
library("randomForest")&#xD;
library("ipred")&#xD;
library("gbm")&#xD;
&#xD;
set.seed(1)&#xD;
&#xD;
## Set up artificial test data for our analysis&#xD;
n.var &amp;lt;- 20&#xD;
n.obs &amp;lt;- 200&#xD;
x &amp;lt;- data.frame(V = matrix(rnorm(n.var*n.obs), n.obs, n.var))&#xD;
n.dep &amp;lt;- floor(n.var/5)&#xD;
cat( "Number of dependent variables is", n.dep, "\n")&#xD;
m &amp;lt;- diag(n.dep:1)&#xD;
&#xD;
## These are our four test targets&#xD;
y.1 &amp;lt;- factor( ifelse( x$V.1 &amp;gt;= 0, 'A', 'B' ) )&#xD;
y.2 &amp;lt;- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) &amp;gt;= 0, "A", "B" )&#xD;
y.2 &amp;lt;- factor(y.2)&#xD;
y.3 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0))&#xD;
y.4 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0) %% 2)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The flexibility of the caret package is to a large extent implemented by using control objects.  Here we specify to use the &lt;code&gt;randomForest&lt;/code&gt; classification algorithm (which is also what Boruta uses) and if the multicore package is available then we use that for extra perfomance (you can also use MPI etc ­– see the documentation):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;control &amp;lt;- rfeControl(functions = rfFuncs, method = "boot", verbose = FALSE,&#xD;
                      returnResamp = "final", number = 50)&#xD;
&#xD;
if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) {&#xD;
    control$workers &amp;lt;- multicore:::detectCores()&#xD;
    control$computeFunction &amp;lt;- mclapply&#xD;
    control$computeArgs &amp;lt;- list(mc.preschedule = FALSE, mc.set.seed = FALSE)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
We will consider from one to six features (using the &lt;code&gt;sizes&lt;/code&gt; variable) and then we simply let it lose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;sizes &amp;lt;- 1:6&#xD;
&#xD;
## Use randomForest for prediction&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The results are:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;rf     : Profile 1 predictors: V.1 V.16 V.6&#xD;
rf     : Profile 2 predictors: V.1 V.2&#xD;
rf     : Profile 3 predictors: V.4 V.1 V.2&#xD;
rf     : Profile 4 predictors: V.10 V.11 V.7&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
If you recall the &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;feature selection with Boruta&lt;/a&gt; article, then the results there were&#xD;
&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;Profile 1: &lt;code&gt;V.1, (V.16, V.17)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 2: &lt;code&gt;V.1, V.2, V,3, (V.8, V.9, V.4)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 3: &lt;code&gt;V.1, V.4, V.3, V.2, (V.7, V.6)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 4: &lt;code&gt;V.10, (V.11, V.13)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;To show the flexibility of caret, we can run the analysis with another of the built-in classifiers:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Use ipred::ipredbag for prediction&#xD;
control$functions &amp;lt;- treebagFuncs&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;This gives:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;treebag: Profile 1 predictors: V.1 V.16&#xD;
treebag: Profile 2 predictors: V.2 V.1&#xD;
treebag: Profile 3 predictors: V.1 V.3 V.2&#xD;
treebag: Profile 4 predictors: V.10 V.11 V.1 V.7 V.13&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;And of course, if you have your own favourite model class that is not already implemented, then you can easily do that yourself.  We like &lt;code&gt;gbm&lt;/code&gt; from the package of the same name, which is kind of silly to use here because it provides variable importance automatically as part of the fitting process, but may still be useful.  It needs numeric predictors so we do:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Use gbm for prediction&#xD;
y.1 &amp;lt;- as.numeric(y.1)-1&#xD;
y.2 &amp;lt;- as.numeric(y.2)-1&#xD;
y.3 &amp;lt;- as.numeric(y.3)-1&#xD;
y.4 &amp;lt;- as.numeric(y.4)-1&#xD;
&#xD;
gbmFuncs &amp;lt;- treebagFuncs&#xD;
gbmFuncs$fit &amp;lt;- function (x, y, first, last, ...) {&#xD;
    library("gbm")&#xD;
    n.levels &amp;lt;- length(unique(y))&#xD;
    if ( n.levels == 2 ) {&#xD;
        distribution = "bernoulli"&#xD;
    } else {&#xD;
        distribution = "gaussian"&#xD;
    }&#xD;
    gbm.fit(x, y, distribution = distribution, ...)&#xD;
}&#xD;
gbmFuncs$pred &amp;lt;- function (object, x) {&#xD;
    n.trees &amp;lt;- suppressWarnings(gbm.perf(object,&#xD;
                                         plot.it = FALSE,&#xD;
                                         method = "OOB"))&#xD;
    if ( n.trees &amp;lt;= 0 ) n.trees &amp;lt;- object$n.trees&#xD;
    predict(object, x, n.trees = n.trees, type = "link")&#xD;
}&#xD;
control$functions &amp;lt;- gbmFuncs&#xD;
&#xD;
n.trees &amp;lt;- 1e2                          # Default value for gbm is 100&#xD;
&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;And we get the results below:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;gbm    : Profile 1 predictors: V.1 V.10 V.11 V.12 V.13&#xD;
gbm    : Profile 2 predictors: V.1 V.2&#xD;
gbm    : Profile 3 predictors: V.4 V.1 V.2 V.3 V.7&#xD;
gbm    : Profile 4 predictors: V.11 V.10 V.1 V.6 V.7 V.18&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;It is all good and very flexible, for sure, but I can’t really say it is better than the Boruta approach for these simple examples.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/O6IQ4h7grTk" height="1" width="1"/&gt;</content><published>2010-11-16T19:35:00Z</published><updated>2010-11-18T06:58:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html</feedburner:origLink></entry><entry><title type="text">Feature selection: All-relevant selection with the Boruta package</title><id>urn:uuid:72b78e0b-1552-5e4c-8305-a363cc446cea</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/0S81Gxhmv0s/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/feature-1.4.150.png" wifht="150" height="150" alt="[Variable importance example]" />
  </a>
</div>
<p>
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  There are two main approaches to selecting the features (variables) we will use for the analysis: the <dfn>minimal-optimal feature selection</dfn> which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the <dfn>all-relevant feature selection</dfn> which identifies all variables that are in some circumstances relevant for the classification.
</p>
<p>
In this article we take a first look at the problem of all-relevant feature selection using the <a href="http://www.jstatsoft.org/v36/i11/">Boruta package</a> by Miron B. Kursa and Witold R. Rudnicki.  This package is developed for the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  There are two main approaches to selecting the features (variables) we will use for the analysis: the &lt;dfn&gt;minimal-optimal feature selection&lt;/dfn&gt; which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the &lt;dfn&gt;all-relevant feature selection&lt;/dfn&gt; which identifies all variables that are in some circumstances relevant for the classification.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In this article we take a first look at the problem of all-relevant feature selection using the &lt;a href="http://www.jstatsoft.org/v36/i11/"&gt;Boruta package&lt;/a&gt; by Miron B. Kursa and Witold R. Rudnicki.  This package is developed for the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Background&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
All-relevant feature selection is extremely useful for commercial data miners.  We deploy it when we want to &lt;em&gt;understand&lt;/em&gt; the mechanisms behind the behaviour or subject of interest, rather than just building a black-box predictive model.  This understanding leads us to a better appreciation of our customers (or other subject under investigation) and not just how, but &lt;em&gt;why&lt;/em&gt; they behave as they do, which is useful for all areas of the business, including strategy and product development.  More narrowly, it also help us define the variables that we want to observe which is what will really make a difference in our ability to predict behaviour (as opposed to, say, run the data mining application a little longer).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I really like the theoretical approach that the Boruta package tries to implement.  It is based on the more general idea that by adding randomness to a system and then collecting results from random samples of the bigger system, one can actually reduce the misleading impact of randomness in the original sample.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For the implementation, the Boruta package relies on a random forest classification algorithm.  This provides an intrinsic measure of the importance of each feature, known as the Z score.  While this score is not directly a statistical measure of the significance of the feature, we can compare it to random permutations of (a selection of) the variables to test if it is higher than the scores from random variables.  This is the essence of the implementation in Boruta.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The tests&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
This article is a first investigation into the performance of the Boruta package.  For this initial examination we will use a test data sample that we can control so we know what is important and what is not.  We will consider 200 observations of 20 normally distributed random variables:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;run.name &amp;lt;- "feature-1"&#xD;
library("Boruta")&#xD;
set.seed(1)&#xD;
## Set up artificial test data for our analysis&#xD;
n.var &amp;lt;- 20&#xD;
n.obs &amp;lt;- 200&#xD;
x &amp;lt;- data.frame(V=matrix(rnorm(n.var*n.obs), n.obs, n.var))&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Normal distribution has the advantage of simplicity, but for commercial application where highly non-normally distributed features like money spent are important may not be the best test.  Nevertheless, we will use it for now and define a simple utility function before we get on to the tests:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Utility function to make plots of Boruta test results&#xD;
make.plots &amp;lt;- function(b, num,&#xD;
                       true.var = NA,&#xD;
                       main = paste("Boruta feature selection for test", num)) {&#xD;
    write.text &amp;lt;- function(b, true.var) {&#xD;
        if ( !is.na(true.var) ) {&#xD;
            text(1, max(attStats(b)$meanZ), pos = 4,&#xD;
                 labels = paste("True vars are V.1-V.",&#xD;
                     true.var, sep = ""))        &#xD;
        }&#xD;
    }&#xD;
    plot(b, main = main, las = 3, xlab = "")&#xD;
    write.text(b, true.var)&#xD;
    png(paste(run.name, num, "png", sep = "."), width = 8, height = 8,&#xD;
        units = "cm", res = 300, pointsize = 4)&#xD;
    plot(b, main = main, lwd = 0.5, las = 3, xlab = "")&#xD;
    write.text(b, true.var)&#xD;
    dev.off()&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h3&gt;Test 1: Simple test of single significant variable&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For a simple classification based on a single variable, Boruta performs well: while it identifies three variables as being potentially important, this does include the true variable (V.1) and the plot clearly shows it as being by far the most significant.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 1. Simple test of single variable&#xD;
y.1 &amp;lt;- factor( ifelse( x$V.1 &amp;gt;= 0, 'A', 'B' ) )&#xD;
&#xD;
b.1 &amp;lt;- Boruta(x, y.1, doTrace = 2)&#xD;
make.plots(b.1, 1)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.1.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.1.400.png" width="400" height="400" alt="[Example 1]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 1: Simple test of Boruta feature selection with single variable.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;h3&gt;Test 2: Simple test of linear combination of variables&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
With a test of a linear combination of the first four variables where the weights are decreasing from 4 to 1, we begin to get closer to the limitations of the approach.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 2. Simple test of linear combination&#xD;
n.dep &amp;lt;- floor(n.var/5)&#xD;
print(n.dep)&#xD;
&#xD;
m &amp;lt;- diag(n.dep:1)&#xD;
&#xD;
y.2 &amp;lt;- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) &amp;gt;= 0, "A", "B" )&#xD;
y.2 &amp;lt;- factor(y.2)&#xD;
&#xD;
b.2 &amp;lt;- Boruta(x, y.2, doTrace = 2)&#xD;
make.plots(b.2, 2, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.2.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.2.400.png" width="400" height="400" alt="[Example 2]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 2: Simple test of Boruta feature selection with linear combination of four variables.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
The implementation correctly identified the first three variables (with weights 4, 3, and 2, respectively) as being important, but it had the fourth variable as possible along with the two random variables V.8 and V.9.  Still, six variables are more approachable than twenty.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;Test 3: Simple test of less-linear combination of four variables&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For this text and the following we consider less obvious combinations of the first four variables.  If we just count how many of them are positive, then we get to a situation where Boruta excels (because random forests excel at this type of problem).&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;## 3. Simple test of less-linear combination&#xD;
y.3 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0))&#xD;
print(summary(y.3))&#xD;
b.3 &amp;lt;- Boruta(x, y.3, doTrace = 2)&#xD;
print(b.3)&#xD;
make.plots(b.3, 3, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.3.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.3.400.png" width="400" height="400" alt="[Example 3]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 3: Simple test of Boruta feature selection counting the positives of four variables.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;h3&gt;Test 4: Simple test of non-linear combination&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For a spectacular fail of the Boruta approach we will have to consider a classification in the hyperplane of the four variables.  For this simple example, we simply count if there are an even or odd number of positive values among the first four variables:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 4. Simple test of non-linear combination&#xD;
y.4 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0) %% 2)&#xD;
b.4 &amp;lt;- Boruta(x, y.4, doTrace = 2)&#xD;
print(b.4)&#xD;
make.plots(b.4, 4, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.4.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.4.400.png" width="400" height="400" alt="Example 4"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 4: Simple test of Boruta feature selection with non-linear combination of four variables&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
Ouch.  The package rejects the four known significant variables.  It is too hard for the random forest approach.  Increasing the number of observations to 1,000 does not help though at 5,000 observations Boruta identifies the four variables right.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Limitations&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Some limitations of the Boruta package are worth highlighting:&#xD;
&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;It only works with classification (factor) target variables.  I am not sure why: as far as I remember, the random forest algorithm also provides a variable significance score when it is used as a predictor, not just when it is run as a classifier.&lt;/li&gt;&#xD;
&lt;li&gt;It does not handle missing (&lt;code&gt;NA&lt;/code&gt;) values at all.  This is quite a problem when working with real data sets, and a shame as random forests are in principle very good at handling missing values.  A simple re-write of the package using the &lt;code&gt;party&lt;/code&gt; package instead of &lt;code&gt;randomForest&lt;/code&gt; should be able to fix this issue.&lt;/li&gt;&#xD;
&lt;li&gt;It does not seem to be completely stable.  I have crashed it on several real-world data sets and am working on a minimal set to send to the authors.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;&#xD;
But this is a really promising approach, if somewhat slow on large sets.  I will have a look at some real-world data in a future post.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.52]" title="[0.52]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/0S81Gxhmv0s" height="1" width="1"/&gt;</content><published>2010-11-15T10:04:00Z</published><updated>2010-11-16T19:10:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html</feedburner:origLink></entry><entry><title type="text">Big data for R</title><id>urn:uuid:04001d8b-1947-56b3-86a5-265707a84aa9</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/pegHIMxElX0/Big-data-for-R.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Revolutions Analytics recently <a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html">announced</a> their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.
</p>
<p>
However, if you want to replicate their analysis in standard <a href="http://www.r-project.org/">R</a>, then you can absolutely do so and we show you how.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Revolutions Analytics recently &lt;a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html"&gt;announced&lt;/a&gt; their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, if you want to replicate their analysis in standard &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, then you can absolutely do so and we show you how.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Data preparation&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First you need to prepare the rather large data set that they use in the Revolutions white paper.  The preparation script shown  below does two passes over alal the files which is not needed: changing it to a single pass is left as an exercise for the reader....  Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="big.R"&gt;&#xD;
#!/usr/bin/Rscript&#xD;
## big.R - Preprocess the airline data&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Install the packages we will use&#xD;
install.packages("bigmemory",&#xD;
                 dependencies = c("Depends", "Suggests", "Enhances"))&#xD;
&#xD;
## Data sets are downloaded from the Data Expo '09 web site at&#xD;
## http://stat-computing.org/dataexpo/2009/the-data.html&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    if ( !file.exists(file.name) ) {&#xD;
        url.text &amp;lt;- paste("http://stat-computing.org/dataexpo/2009/",&#xD;
                          year, ".csv.bz2", sep = "")&#xD;
        cat("Downloading missing data file ", file.name, "\n", sep = "")&#xD;
        download.file(url.text, file.name)&#xD;
    }&#xD;
}&#xD;
&#xD;
## Read sample file to get column names and types&#xD;
d &amp;lt;- read.csv("2008.csv.bz2")&#xD;
integer.columns &amp;lt;- sapply(d, is.integer)&#xD;
factor.columns  &amp;lt;- sapply(d, is.factor)&#xD;
factor.levels   &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
n.rows &amp;lt;- 0L&#xD;
&#xD;
## Process each file determining the factor levels&#xD;
## TODO: Combine with next loop&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name)&#xD;
    n.rows &amp;lt;- n.rows + NROWS(d)&#xD;
    new.levels &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        factor.levels[[i]] &amp;lt;- c(factor.levels[[i]], new.levels[[i]])&#xD;
    }&#xD;
    rm(d)&#xD;
}&#xD;
save(integer.columns, factor.columns, factor.levels, file = "factors.RData")&#xD;
&#xD;
## Now convert all factors to integers so we can create a bigmatrix of the data&#xD;
col.classes &amp;lt;- rep("integer", length(integer.columns))&#xD;
col.classes[factor.columns] &amp;lt;- "character"&#xD;
cols  &amp;lt;- which(factor.columns)&#xD;
first &amp;lt;- TRUE&#xD;
csv.file &amp;lt;- "airlines.csv"   # Write combined integer-only data to this file&#xD;
csv.con  &amp;lt;- file(csv.file, open = "w")&#xD;
&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name, colClasses = col.classes)&#xD;
    ## Convert the strings to integers&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        col &amp;lt;- cols[i]&#xD;
        d[, col] &amp;lt;- match(d[, col], factor.levels[[i]])&#xD;
    }&#xD;
    write.table(d, file = csv.con, sep = ",", &#xD;
                row.names = FALSE, col.names = first)&#xD;
    first &amp;lt;- FALSE&#xD;
}&#xD;
close(csv.con)&#xD;
&#xD;
## Now convert to a big.matrix&#xD;
library("bigmemory")&#xD;
backing.file    &amp;lt;- "airlines.bin"&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- read.big.matrix(csv.file, header = TRUE,&#xD;
                        type = "integer",&#xD;
                        backingfile = backing.file,&#xD;
                        descriptorfile = descriptor.file,&#xD;
                        extraCols = c("age"))&#xD;
&lt;/pre&gt;&#xD;
&lt;h2&gt;Sample analysis&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
All done now.  Sample analysis:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## bigScale.R - Replicate the analysis from &lt;a href="http://bit.ly/aTFXeN"&gt;http://bit.ly/aTFXeN&lt;/a&gt; with normal R&#xD;
##   http://info.revolutionanalytics.com/bigdata.html&#xD;
## See big.R for the preprocessing of the data&#xD;
&#xD;
## Load required libraries&#xD;
library("biglm")&#xD;
library("bigmemory")&#xD;
library("biganalytics")&#xD;
library("bigtabulate")&#xD;
&#xD;
## Use parallel processing if available&#xD;
## (Multicore is for "anything-but-Windows" platforms)&#xD;
if ( require("multicore") ) {&#xD;
    library("doMC")&#xD;
    registerDoMC()&#xD;
} else {&#xD;
    warning("Consider registering a multi-core 'foreach' processor.")&#xD;
}&#xD;
&#xD;
day.names &amp;lt;- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",&#xD;
               "Saturday", "Sunday")&#xD;
&#xD;
## Attach to the data&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- attach.big.matrix(dget(descriptor.file))&#xD;
&#xD;
## Replicate Table 5 in the Revolutions document:&#xD;
## Table 5&#xD;
t.5 &amp;lt;- bigtabulate(data,&#xD;
                   ccols = "DayOfWeek",&#xD;
                   summary.cols = "ArrDelay", summary.na.rm = TRUE)&#xD;
## Pretty-fy the outout&#xD;
stat.names &amp;lt;- dimnames(t.5.2$summary[[1]])[2][[1]]&#xD;
t.5.p &amp;lt;- cbind(matrix(unlist(t.5$summary), byrow = TRUE,&#xD;
                      nrow = length(t.5$summary),&#xD;
                      ncol = length(stat.names),&#xD;
                      dimnames = list(day.names, stat.names)),&#xD;
               ValidObs = t.5$table)&#xD;
print(t.5.p)&#xD;
#             min  max     mean       sd    NAs ValidObs&#xD;
# Monday    -1410 1879 6.669515 30.17812 385262 18136111&#xD;
# Tuesday   -1426 2137 5.960421 29.06076 417965 18061938&#xD;
# Wednesday -1405 2598 7.091502 30.37856 405286 18103222&#xD;
# Thursday  -1395 2453 8.945047 32.30101 400077 18083800&#xD;
# Friday    -1437 1808 9.606953 33.07271 384009 18091338&#xD;
# Saturday  -1280 1942 4.187419 28.29972 298328 15915382&#xD;
# Sunday    -1295 2461 6.525040 31.11353 296602 17143178&#xD;
&#xD;
## Figure 1&#xD;
plot(t.5.p[, "mean"], type = "l", ylab="Average arrival delay")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Just like the Revolutions paper.  You can now use &lt;code&gt;biglm.big.matrix&lt;/code&gt; and &lt;code&gt;bigglm.big.matrix&lt;/code&gt; for basic regression and there are also k-means clustering and other functions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here.  It seems kind of sad if they change the syntax to be incompatible with standard R formulas, which is what appears to be happening.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Credit to Michael Kane and Jay Emerson of Yale who showed much of this in their poster &lt;a href="http://stat-computing.org/dataexpo/2009/posters/kane-emerson.pdf"&gt;The Airline Data Set... What's the big deal?&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a d…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/pegHIMxElX0" height="1" width="1"/&gt;</content><published>2010-08-05T08:22:00Z</published><updated>2010-08-05T08:22:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Big-data-for-R.html</feedburner:origLink></entry><entry><title type="text">Area Plots with Intensity Coloring</title><id>urn:uuid:6b83e364-13a9-58b5-9f83-ec94683bf592</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/6pq1Dbge-y0/Area-Plots-with-Intensity-Coloring.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="Click to read full article">
    <img src="http://static.cybaea.net/images/nino-150.png" width="150" height="150" alt="[Graphics output]" />
  </a>
</div>
<p>I am not sure apeescape’s <a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/">ggplot2 area plot with intensity colouring</a> is really the best way of presenting the information, but it had me intrigued enough to replicate it using base <a href="http://www.r-project.org/">R</a> graphics.</p>

<p>The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, <code>lines(..., type="l")</code> does not recycle the colour <code>col=</code> argument, so we end up with rather more loops than I thought would be necessary.</p>

<p>We also get a nice opportunity to use the under-appreciated <code>read.fwf</code> function.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;I am not sure apeescape’s &lt;a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/"&gt;ggplot2 area plot with intensity colouring&lt;/a&gt; is really the best way of presenting the information, but it had me intrigued enough to replicate it using base &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; graphics.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, &lt;code&gt;lines(..., type="l")&lt;/code&gt; does not recycle the colour &lt;code&gt;col=&lt;/code&gt; argument, so we end up with rather more loops than I thought would be necessary.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;(The answer is not to use &lt;code&gt;lines(..., type="h")&lt;/code&gt; which, confusingly, &lt;em&gt;does&lt;/em&gt; recycle the colour &lt;code&gt;col=&lt;/code&gt; argument.  This one had me for a while, but the &lt;code&gt;type=h&lt;/code&gt; lines always start from zero so you do not get the gradient feature.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We also get a nice opportunity to use the under-appreciated &lt;code&gt;read.fwf&lt;/code&gt; function.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;##!/usr/bin/Rscript&#xD;
## nino.R - another version of &lt;a href="http://bit.ly/9P9Gh1"&gt;http://bit.ly/9P9Gh1&lt;/a&gt;&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Get the data from the NOAA server&#xD;
nino &amp;lt;- read.fwf("&lt;a href="http://www.cpc.noaa.gov/data/indices/wksst.for"&gt;http://www.cpc.noaa.gov/data/indices/wksst.for&lt;/a&gt;",&#xD;
                 widths=c(-1, 9, rep(c(-5, 4, 4), 4)),&#xD;
                 skip=4,&#xD;
                 col.names=c("Week",&#xD;
                     paste(rep(c("Nino12","Nino3","Nino34","Nino4"), rep(2, 4)),&#xD;
                           c("SST", "SSTA"), sep=".")))&#xD;
&#xD;
## Make the date column something useful&#xD;
nino$Week &amp;lt;- as.Date(nino$Week, format="%d%b%Y")&#xD;
&#xD;
## Make colour gradients&#xD;
ncol &amp;lt;- 50&#xD;
grad.neg &amp;lt;- hsv(4/6, seq(0, 1, length.out=ncol), 1) # Blue gradient&#xD;
grad.pos &amp;lt;- hsv(  0, seq(0, 1, length.out=ncol), 1) # Red gradient&#xD;
&#xD;
## Make plot&#xD;
plot(Nino34.SSTA ~ Week, data=nino, type="n",&#xD;
     main="Nino34", xlab="Date", ylab="SSTA", axes=FALSE)&#xD;
do.call(function (...) rect(..., col="gray85", border=NA),&#xD;
        as.list(par("usr")[c(1, 3, 2, 4)]))&#xD;
&#xD;
y &amp;lt;- nino$Nino34.SSTA                   # The values we will plot&#xD;
x &amp;lt;- nino$Week&#xD;
&#xD;
axis.Date(1, x=x, tck=1, col="white")&#xD;
axis(2, tck=1, col="white")&#xD;
box()&#xD;
&#xD;
idx &amp;lt;- integer(NROW(nino))&#xD;
idx[y &amp;gt;= 0] &amp;lt;- 1 + round( y[y &amp;gt;= 0] * (ncol - 1) / max( y[y &amp;gt;= 0]), 0)&#xD;
idx[y &amp;lt;  0] &amp;lt;- 1 + round(-y[y &amp;lt;  0] * (ncol - 1) / max(-y[y &amp;lt;  0]), 0)&#xD;
&#xD;
draw.gradient &amp;lt;- function(x, ys, cols) {&#xD;
    xs &amp;lt;- rep(x, 2)&#xD;
    for (i in seq(1, length(ys)-1))&#xD;
        plot.xy(list(x=xs, y=c(ys[i], ys[i+1])), type="l", col=cols[i])&#xD;
}&#xD;
&#xD;
for (i in 1:length(x)) {&#xD;
    ys &amp;lt;- seq(0, y[i], length.out=idx[i]+1)&#xD;
    cols &amp;lt;- (if (y[i] &amp;gt;=0) grad.pos else grad.neg)&#xD;
    draw.gradient(x[i], ys, cols)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;The result is a decent gradient:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/nino-800.png" title="Click for larger version"&gt;&lt;img src="http://static.cybaea.net/images/nino-400.png" width="400" height="400" alt="[Graphics output]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I deliberately omitted the scale legend on the right hand side following Allan’s First Law of Happy Graphics: Thou shall not present the same information twice.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For less dense information, you should increase the line width.  That is left to the reader. (Hint: it is hard to get just right in base graphics, but &lt;code&gt;lwd &amp;lt;- ceiling(par("pin")[1] / dev.size("in")[1] * dev.size("px")[1] / length(x))&lt;/code&gt; could be a starting point for an approximation. We really need gradient-filled polygons in base R.)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.48]" title="[0.48]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages. In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package."&gt;SNA with R: Loading large networks using the igraph library&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the t…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the a…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" title="Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here."&gt;R tips: Installing Rmpi on Fedora Linux&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Sinc…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/6pq1Dbge-y0" height="1" width="1"/&gt;</content><published>2010-07-13T07:47:00Z</published><updated>2010-07-13T07:47:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html</feedburner:origLink></entry><entry><title type="text">Employee productivity as function of number of workers revisited</title><id>urn:uuid:cee42e41-ea6c-5ee6-a0b5-4e4644168052</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mWb7tx4LS94/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<div class="floatRight"><a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="Click for read full article"><img width="150" height="150" src="http://static.cybaea.net/images/ftse100-150.png" alt="[Results of analysis shown in graph]" /></a></div>We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that <a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html">when you treble the number of workers, you halve their individual productivity</a> which is mildly scary.
</p>
<p>
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;when you treble the number of workers, you halve their individual productivity&lt;/a&gt; which is mildly scary.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&amp;amp;P 500 leading American companies that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;we analyzed four years ago&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;We will of course use the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; for our analysis, and once again we are grateful to &lt;a href="http://uk.finance.yahoo.com/"&gt;Yahoo Finance&lt;/a&gt; for providing the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;The analysis script is available as &lt;a href="http://static.cybaea.net/files/ftse100.R"&gt;ftse100.R&lt;/a&gt; and is really simple:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;## ftse100.R - Display employee productivity for FTSE-100 consitituents&#xD;
## Copyright © 2010 Allan Engelhardt &amp;lt;http://www.cybaea.net/&amp;gt;&#xD;
## All Rights Reserved.&#xD;
&#xD;
## Get the index constituents.&#xD;
ftse.100 &amp;lt;- read.csv(file = "http://uk.old.finance.yahoo.com/d/quotes.csv?s=@%5EFTSE&amp;amp;f=s&amp;amp;e=.csv", header = FALSE)&#xD;
names(ftse.100) &amp;lt;- c("symbol")&#xD;
data &amp;lt;- data.frame(symbol=NULL, employees=NULL, profit=NULL, sector=NULL)&#xD;
&#xD;
## For each stock symbol, get employees, profit, and sector&#xD;
for (symbol in ftse.100$symbol) {&#xD;
    profile.url &amp;lt;- paste("http://uk.finance.yahoo.com/q/pr?s=", symbol, sep="")&#xD;
    con &amp;lt;- url(profile.url, open = "r")&#xD;
    text &amp;lt;- readChar(con, 2^24)     # enough bytes&#xD;
    close(con)&#xD;
    x &amp;lt;- sub('.*Number of employees:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([[:digit:],]+).*', "\\1", text, ignore.case = TRUE)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    empl &amp;lt;- tryCatch(as.integer(x), warning = function(x) NA)&#xD;
    x &amp;lt;- sub('.*Net Profit.*?&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([+-]?[[:digit:],]+).*', '\\1', text)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    profit &amp;lt;- tryCatch(as.integer(x)*1e6, warning = function(x) NA)&#xD;
    sector &amp;lt;- sub('.*Sector:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;(.*?)&amp;lt;/td&amp;gt;.*', '\\1', text)&#xD;
    if (any(c(empl, profit) &amp;lt;= 0, is.na(c(empl, profit)))) {&#xD;
        cat("Error parsing symbol", symbol, "see", profile.url, "\n")&#xD;
    } else {&#xD;
        data &amp;lt;- rbind(data, data.frame(symbol=symbol, employees=empl, profit=profit, sector=sector))&#xD;
    }&#xD;
    Sys.sleep(1)&#xD;
}&#xD;
&#xD;
## Save the data so we don't have to hit Yahoo all the time.&#xD;
save(data, file = "data.RData")&#xD;
&#xD;
## Save plot to file:&#xD;
#png(filename="ftse100.png", width=800, height=800, pointsize=14, bg="white", res=100)&#xD;
&#xD;
opar &amp;lt;- par(cex.sub = sqrt(sqrt(2)), font.sub = 3, font.lab = 2)&#xD;
&#xD;
## x and y coordinates of plot and plot limits&#xD;
x &amp;lt;- with(data, employees)&#xD;
y &amp;lt;- with(data, profit/employees)&#xD;
xlim &amp;lt;- c(10^floor(log10(min(x))), 10^ceiling(log10(max(x))))&#xD;
ylim &amp;lt;- c(10^floor(log10(min(y))), 10^ceiling(log10(max(y))))&#xD;
&#xD;
## Set up to display different color and symbols&#xD;
plot_col &amp;lt;- 1&#xD;
plot_pch &amp;lt;- 1&#xD;
markers &amp;lt;- 21:25&#xD;
pchs &amp;lt;- rep(markers, ceiling(length(levels(data$sector))/length(markers)))&#xD;
palette(rainbow(length(levels(data$sector)), start=3/6, end=6/6))&#xD;
&#xD;
# Make empty plot:&#xD;
plot.new()&#xD;
plot(profit/employees ~ employees, data = data[FALSE, ], &#xD;
     type = "p", pch = pchs[plot_pch], col = plot_col,&#xD;
     log="xy", xaxp = c(xlim, 1), yaxp = c(ylim, 1), xlim = xlim, ylim = ylim,&#xD;
     main = "Profit per employee (FTSE 100)", xlab = "Employees", ylab = "Profit per employees (GBP)")&#xD;
&#xD;
## Plot each sector&#xD;
for (sector in levels(data$sector)) {&#xD;
    plot.xy(xy.coords(with(data[data$sector == sector,], employees),&#xD;
                      with(data[data$sector == sector,], profit/employees),&#xD;
                      log = "xy", xlab = "", ylab = ""),&#xD;
            type = "p", pch = pchs[plot_pch], col = plot_col, bg = plot_col)&#xD;
    plot_pch &amp;lt;- plot_pch + 1&#xD;
    plot_col &amp;lt;- plot_col + 1&#xD;
}&#xD;
legend(x = "bottomleft", legend = levels(data$sector), title = "Industry Sectors", &#xD;
       col = palette(), pt.bg = palette(), pch = pchs, cex = 2/3, pt.cex = 1, ncol = 2)&#xD;
&#xD;
## Fit a linear model to the log-log data:&#xD;
m &amp;lt;- lm(log10(y) ~ log10(x))&#xD;
xl &amp;lt;- c(xlim[1]*5, xlim[2]/5)&#xD;
yl &amp;lt;- 10^predict(m, data.frame(x = xl))&#xD;
lines(xl, yl, col = "darkred", lty = "dashed", lwd = 2)&#xD;
t &amp;lt;- sprintf("Power = %0.3g", m$coefficients[2])&#xD;
text(xl[2], yl[2], t, adj = c(0.25, -1.5), col = "darkred", font = 2)&#xD;
&#xD;
## All done.&#xD;
par(opar)&#xD;
dev.off()&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Leave it to run and this is what you get:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse100.png"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse100-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;The power law still broadly holds.  In a large company, the productivity of the individual employee is only ¼ of the productivity in a company with one-tenth of the number of workers.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The analysis for the FTSE All-Share index is easy (&lt;a href="http://static.cybaea.net/files/ftse-all.R" title="Click for full size"&gt;ftse-all.R&lt;/a&gt;) and gives a slope of -0.7605541 for the 301 companies with the required information, which is much worse.  More convincingly, fitting the companies with more than 1,000 employees (to avoid some bias of smaller companies needing to have large profits per employee in order to be big enough to afford a stock market listing) gives a slope of -0.2838.&lt;/p&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse-all.png" title="Click for full size"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse-all-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse-all-big.png" title="Click for full size"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse-all-big-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages. In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package."&gt;SNA with R: Loading large networks using the igraph library&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the t…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html" title="The more employees your company has, the less productive each of these employees are. It is a generalization, of course, but a useful one and one that is confirmed by most people who have worked for growing organizations. As the company grows, so does the internal processes and the layers of bureaucracy, and the time spent on communications grows rapidly. It is, however, useful to look at the actual numbers. How much does productivity decrease as the organization grows? We analyze the S&amp;amp;P 500 constituents and the answers are frankly frighting: when you triple the number of employees, you halve their productivity ."&gt;The 3/2 rule of employee productivity&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The more employees your company has, the less productive each of these employees are. It is a generalization, of course, but a useful one and one that is confirmed by most people who have worked for growing organizations. As the company grows, so does the…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity_sector.html" title="We revisited the 3/2 rule of employee productivity using a larger data set and showing each sector independently. As before, we chose profits per employee as our metric for employee productivity and show it against the number of employees. The resulting per-sector graphs are shown below (click through for a larger version). The data clearly debunk any myths that large companies are more efficient , an oft-quoted statement in merger situations, at least as far as HR is concerned. In total, there is probably a downward trend with size but with a slope of perhaps -0.1 or thereabouts. That still means that when you add 10% employees you lose 1% productivity per employee, which is clearly problematic."&gt;The 3/2 rule revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We revisited the 3/2 rule of employee productivity using a larger data set and showing each sector independently. As before, we chose profits per employee as our metric for employee productivity and show it against the number of employees. The resulting p…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package w…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html" title="David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform . My reply in the comments seems to have disappeared for a while so here is my proposed solution:"&gt;The Knapsack Problem&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform . My reply in the comments seems to have disappeared for a while so here is my proposed solution:&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mWb7tx4LS94" height="1" width="1"/&gt;</content><published>2010-06-22T11:20:00Z</published><updated>2010-06-22T11:20:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html</feedburner:origLink></entry><entry><title type="text">Comparing standard R with Revoutions for performance</title><id>urn:uuid:3293adea-fce4-57ac-844d-8c40497745e3</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/SNu7nI9K28g/Comparing-standard-R-with-Revoutions-for-performance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Following on from my previous post about <a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html">improving performance of R by linking with optimized linear algebra libraries</a>, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their <a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php">Revolutionary Performance</a> pages.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Following on from my previous post about &lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html"&gt;improving performance of R by linking with optimized linear algebra libraries&lt;/a&gt;, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their &lt;a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php"&gt;Revolutionary Performance&lt;/a&gt; pages.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For convenience I collected their tests into a single script &lt;a href="http://static.cybaea.net/files/revolution_benchmark.R"&gt;revolution_benchmark.R&lt;/a&gt; that I can simply run with &lt;code&gt;Rscript --vanilla revolution_benchmark.R&lt;/code&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The results, compared with the speed-up factors Revolution claims for their version:&lt;/p&gt;&#xD;
&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Revolutions benchmarks compared with R on x86_64 system&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;&lt;/th&gt;&lt;th&gt;R&lt;/th&gt;&lt;th&gt;R + ATLAS&lt;/th&gt;&lt;th&gt;Speed-up&lt;/th&gt;&lt;th&gt;Revolution’s&lt;br&gt;claimed speed-up&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Matrix Multiply&lt;/td&gt;&lt;td&gt;360.96&lt;/td&gt;&lt;td&gt;9.30&lt;/td&gt;&lt;td&gt;37.8&lt;/td&gt;&lt;td&gt;41.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Cholesky Factorization&lt;/td&gt;&lt;td&gt;27.28&lt;/td&gt;&lt;td&gt;5.65&lt;/td&gt;&lt;td&gt;3.8&lt;/td&gt;&lt;td&gt;21.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Singular Value Decomposition&lt;/td&gt;&lt;td&gt;98.73&lt;/td&gt;&lt;td&gt;23.57&lt;/td&gt;&lt;td&gt;3.2&lt;/td&gt;&lt;td&gt;12.6&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Principal Components Analysis&lt;/td&gt;&lt;td&gt;454.55&lt;/td&gt;&lt;td&gt;40.92&lt;/td&gt;&lt;td&gt;10.1&lt;/td&gt;&lt;td&gt;15.2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Linear Discriminant Analysis&lt;/td&gt;&lt;td&gt;271.44&lt;/td&gt;&lt;td&gt;79.61&lt;/td&gt;&lt;td&gt;2.4&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;In all instances Revolution’s claimed speed-up is greater, though probably not significantly so for the Matrix Multiply test and hardly so for the Principal Components Analysis.  (Of course, I do not have a copy of Revolution Analytics’ product, so I can’t verify their claims or make a comparable test.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Whether saving 48 seconds on a linear discriminant analysis is enough to justify buying the product is a decision I leave to you: you know what analysis you do.  For me, there are (many) orders of magnitudes to be gained by better algorithms and better variable selections so I am not too worried about factors of 2 or even 10.  For extra raw power, I run R on a cloud service like AWS which scales well for many problems and is easy to do with stock R while I guess there are some sort of license implications if you wanted to do the same with Revolution’s product.  (But I &lt;em&gt;like&lt;/em&gt; Revolution and am still trying to find an excuse to use their product.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Your mileage may vary.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.49]" title="[0.49]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/SNu7nI9K28g" height="1" width="1"/&gt;</content><published>2010-06-17T09:05:00Z</published><updated>2010-06-17T09:05:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html</feedburner:origLink></entry><entry><title type="text">Faster R through better BLAS</title><id>urn:uuid:428f009b-a07d-59dc-a643-50cc9a2b86ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/txxBGby0z-I/Faster-R-through-better-BLAS.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Can we make our analysis using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.</p>
<p>But recently David Smith was <a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries">suggesting</a> that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.</p>
<p>The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Can we make our analysis using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.&lt;/p&gt;&#xD;
&lt;p&gt;But recently David Smith was &lt;a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries"&gt;suggesting&lt;/a&gt; that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.&lt;/p&gt;&#xD;
&lt;p&gt;The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The environment&lt;/h2&gt;&#xD;
&lt;p&gt;I use R on a 64-bit &lt;a href="http://fedoraproject.org/"&gt;Fedora&lt;/a&gt; 12 Linux system.  Fortunately, it is very easy to rebuild R using different libraries on this platform.  For the following, I will assume that you have a working &lt;a href="http://www.rpm.org/max-rpm-snapshot/rpmbuild.8.html"&gt;rpmbuild&lt;/a&gt; environment.  The test system has a quad core Intel Xeon E5420 CPU with each core running at 2.50 GHz.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmarks&lt;/h2&gt;&#xD;
&lt;p&gt;Benchmarking R is complex.  Very complex.  But for this simple test we use two tests from the &lt;a href="http://r.research.att.com/benchmarks/"&gt;R Benchmarks&lt;/a&gt; page: &lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt; and &lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R-benchmark-25.R&lt;/a&gt;.  The first is a simple benchmark using the examples from the MASS package, and has the advantage that it reflects real-world problems and real-world analysis, albeit small problems and short analysis.  The second is a much more artificial example and primarily test matrix operations.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We run the MASS benchmark as:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;/usr/bin/time -p R --vanilla CMD BATCH MASS-ex.R /dev/null&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;While the R-benchmark-25 is simply:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Rscript --vanilla R-benchmark-25.R&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;For the MASS benchmark we simply capture the real elapsed time while the R benchmark 2.5 provides more detailed output for the three classes of tests (matrix calculation, -functions, and program execution) as well as overall summaries.  They are all shown in the table below.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Compiler-optimized R&lt;/h2&gt;&#xD;
&lt;p&gt;For the experiments that follow the first thing to do is to grab copies of the source RPMs for R and for ATLAS:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;cd ~/rpmbuild/SRPMS&#xD;
yumdownloader --source atlas R&#xD;
cd ..&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;At the time I did this, I got &lt;code&gt;R-2.11.0-1.fc12.src.rpm&lt;/code&gt; and &lt;code&gt;atlas-3.8.3-12.fc12.src.rpm&lt;/code&gt;.  I crank up the level of optimization that I do when building from source so the first thing is to edit &lt;code&gt;&lt;a href="http://static.cybaea.net/files/.rpmrc"&gt;~/.rpmrc&lt;/a&gt;&lt;/code&gt; to include the line &lt;code&gt;optflags: x86_64 -O3 -march=native -m64 -g&lt;/code&gt;.  With that in place we can simply do:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild --rebuild SRPMS/R-2.11.0-1.fc12.src.rpm  #  Change version numbers as needed&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We now have a compiler-optimized version of R and we can re-run our tests.  It doesn't make much difference, but that is also good to know.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;ATLAS BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;Now let's try linking to the ATLAS BLAS libraries instead.  I assume you have them installed (&lt;code&gt;yum install atlas&lt;/code&gt; if not) so you can just grab a copy of &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;R-atlas.diff&lt;/a&gt; to change the spec file like this:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpm -ihv SRPMS/R-2.11.0-1.fc12.src.rpm   # Install to your rpmbuild environment&#xD;
cd SPECS&#xD;
wget &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;http://static.cybaea.net/files/R-atlas.diff&lt;/a&gt;&#xD;
patch -o R-atlas.spec R.spec R-atlas.diff&#xD;
cd ..&#xD;
rpmbuild -bb SPECS/R-atlas.spec&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;You now have a version of R that uses the ATLAS BLAS libraries, so you can re-run the tests.  The results are in the table below in the “Optimized R + Standard ATLAS” row.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;As expected, the matrix operations from the &lt;code&gt;R-benchmark-25.R&lt;/code&gt; runs a lot faster: they complete in about 30-40% of the time, much of which comes from the multi-threading so all four CPU cores are used.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;However, for the analysis-heavy code is &lt;code&gt;MASS-ex.R&lt;/code&gt; there is little difference.  If anything, we see a tiny increase in running time.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
  &lt;em&gt;Multi-threaded BLAS libraries make no significant difference to real-world analysis problems using R.&lt;/em&gt;&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Other BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;For good measure we also try an optimized version of ATLAS, but it does not make much difference on the x86_64 architecture:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild -D "enable_native_atlas 1" --rebuild SRPMS/atlas-3.8.3-12.fc12.src.rpm&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/atlas*3.8.3-12*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;And (only) for completeness, we also try the standard Netlib BLAS and LAPACK libraries (&lt;code&gt;yum install blas lapack&lt;/code&gt;) by the same method as the ATLAS library above but with a slightly different change to the SPEC file: &lt;code&gt;&lt;a href="http://static.cybaea.net/files/R-blas.diff"&gt;R-blas.diff&lt;/a&gt;&lt;/code&gt;.  It performs a little better than vanilla R.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For more information about rebuilding R with different BLAS libraries, see the &lt;a href="http://cran.r-project.org/doc/manuals/R-admin.html#Linear-algebra"&gt;linear algebra section in the R Installation and Administration manual&lt;/a&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmark results&lt;/h2&gt;&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Benchmark results for various optimizations of R and the BLAS library&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th rowspan="3"&gt;R version&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;&lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;th colspan="10"&gt;&lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R benchmark 2.5&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th colspan="2"&gt;Real&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Total time&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Overall mean&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅰ. Matrix calc.&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅱ. Matrix functions&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅲ. Program.&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Base install&lt;/td&gt;&#xD;
&lt;td&gt;19.00&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;78.49&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.11&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.32&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;3.86&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;1.05&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R&lt;/td&gt;&#xD;
&lt;td&gt;18.98&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;76.11&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;2.02&lt;/td&gt;&lt;td&gt;0.96&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;3.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Netlib BLAS&lt;/td&gt;&#xD;
&lt;td&gt;18.56&lt;/td&gt;&lt;td&gt;0.98&lt;/td&gt;&lt;td&gt;73.22&lt;/td&gt;&lt;td&gt;0.93&lt;/td&gt;&lt;td&gt;1.81&lt;/td&gt;&lt;td&gt;0.86&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;2.41&lt;/td&gt;&lt;td&gt;0.62&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Standard ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.43&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.74&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;0.39&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.27&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Optimized ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.31&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.36&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&lt;td&gt;0.45&lt;/td&gt;&lt;td&gt;0.84&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.26&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.49]" title="[0.49]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" title="Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages."&gt;Comparing standard R with Revoutions for performance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" title="A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make."&gt;R versus SAS/SPSS in corporations&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" title="Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here."&gt;R tips: Installing Rmpi on Fedora Linux&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Sinc…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.33]" title="[0.33]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/txxBGby0z-I" height="1" width="1"/&gt;</content><published>2010-06-15T10:21:00Z</published><updated>2010-06-15T10:21:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html</feedburner:origLink></entry><entry><title type="text">R: Eliminating observed values with zero variance</title><id>urn:uuid:5394cf3c-2009-5225-955d-1b6c90ae4445</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tQRUNQKxFng/R-Eliminating-observed-values-with-zero-variance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I needed a fast way of eliminating observed values with zero variance from large data sets using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I needed a fast way of eliminating observed values with zero variance from large data sets using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I use the &lt;a href="http://www.kddcup-orange.com/data.php"&gt;KDD Cup 2009 data sets&lt;/a&gt; as my reference for this experiment.  (You will need to register to download the data.)  It is a realistic example of the type of customer data that I usually work with.  It has 50,000 observations of 15,000 variables.  To load it into R you'll need a reasonably beefy machine.  My workstation has 16GB of memory; if yours have less then use a sample of the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We load the data into R and propose a few ways in which we may identify the columns we need:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;#!/usr/bin/Rscript&#xD;
## zero-var.R - find the fastest way of eliminating observations with zero variance&#xD;
## © 2010 Allan Engelhardt, http://www.cybaea.net&#xD;
&#xD;
## Read the data file.&#xD;
## We have already converted it to R format and saved it, so we can do&#xD;
load("train.RData")&#xD;
## instead of something like&#xD;
# train &amp;lt;- read.delim(file="../orange_large_train.data.bz2")&#xD;
&#xD;
## Some suggestions for zero variance functions:&#xD;
zv.1 &amp;lt;- function(x) {&#xD;
    ## The literal approach&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y == 0)&#xD;
}&#xD;
zv.2 &amp;lt;- function(x) {&#xD;
    ## As before, but avoiding direct comparison with zero&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y &amp;lt; .Machine$double.eps ^ 0.5)&#xD;
}&#xD;
zv.3 &amp;lt;- function(x) {&#xD;
    ## Maybe it is faster to check for equality than to compute?&#xD;
    y &amp;lt;- x[!is.na(x)]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
zv.4 &amp;lt;- function(x) {&#xD;
    ## Taking out the special case may speed things up?&#xD;
    ## (At least for this data set where this case is common.)&#xD;
    z &amp;lt;- is.na(x)&#xD;
    if ( all(z) ) return(TRUE);&#xD;
    y &amp;lt;- x[!z]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Now we just have to load the very useful &lt;a href="http://cran.r-project.org/web/packages/rbenchmark/index.html"&gt;rbenchmark&lt;/a&gt; package and let the machine figure it out:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("rbenchmark")&#xD;
&#xD;
cat("Running benchmarks:\n")&#xD;
benchmark(&#xD;
          zv1 = { sapply(train, zv.1) },&#xD;
          zv2 = { sapply(train, zv.2) },&#xD;
          zv3 = { sapply(train, zv.3) },&#xD;
          zv4 = { sapply(train, zv.4) },&#xD;
          replications = 5,&#xD;
          columns = c("test", "elapsed", "relative", "sys.self"),&#xD;
          order = "elapsed"&#xD;
          )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The answer (on my machine) is that it is faster to calculate than to check for equality:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Running benchmarks:&#xD;
  test elapsed relative sys.self&#xD;
1  zv1  78.619 1.000000    6.395&#xD;
2  zv2  79.276 1.008357    6.586&#xD;
3  zv3 113.024 1.437617    1.735&#xD;
4  zv4 118.579 1.508274    1.716&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Can you think of an even faster way to do it?&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.48]" title="[0.48]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package w…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" title="I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is."&gt;R tips: Determine if function is called from specific package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the a…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tQRUNQKxFng" height="1" width="1"/&gt;</content><published>2010-03-08T14:46:00Z</published><updated>2010-03-08T14:46:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html</feedburner:origLink></entry><entry><title type="text">Beautiful Data</title><id>urn:uuid:770cda82-5757-5a26-827a-2aeff8a8a098</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/nsxeLGyxKLM/Beautiful-Data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]" />
  </a>
</div>
<p>
O'Reilly's recent publication <a href="http://oreilly.com/catalog/9780596157111/">Beautiful Data</a> has a chapter by <a href="http://jeffjonas.typepad.com/jeff_jonas/">Jeff Jonas</a> which is enough reason in itself for me to recommend it.  The chapter, <a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf">Data Finds Data</a>, is also available as a PDF download.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O'Reilly's recent publication &lt;a href="http://oreilly.com/catalog/9780596157111/"&gt;Beautiful Data&lt;/a&gt; has a chapter by &lt;a href="http://jeffjonas.typepad.com/jeff_jonas/"&gt;Jeff Jonas&lt;/a&gt; which is enough reason in itself for me to recommend it.  The chapter, &lt;a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf"&gt;Data Finds Data&lt;/a&gt;, is also available as a PDF download.&#xD;
&lt;/p&gt;&#xD;
&lt;div class="floatRight"&gt;&#xD;
  &lt;a href="http://oreilly.com/catalog/9780596157111/" title="Click for book details"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I met Jeff a couple of year ago at an ETech conference, and he is easily one of the smartest people I have ever met who is thinking about data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html" title="I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government."&gt;Data.gov&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/nsxeLGyxKLM" height="1" width="1"/&gt;</content><published>2009-07-27T19:38:00Z</published><updated>2009-07-27T19:38:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Beautiful-Data.html</feedburner:origLink></entry><entry><title type="text">Massively parallel database for analytics</title><id>urn:uuid:a8ba9e43-b837-551c-bd02-a1a7b4506c41</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Apso1Get0Yk/Massively-parallel-database-for-analytics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html">Announcing release of HadoopDB</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;Announcing release of HadoopDB&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;See also:&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;Short version: key bullet points&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;Long version (12 pages, PDF)&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://tech.slashdot.org/story/09/07/21/1747241/Researchers-Create-Database-Hadoop-Hybrid?from=rss"&gt;Slashdot discussion&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/HadoopStreaming/index.html"&gt;R package HadoopStreaming&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Apso1Get0Yk" height="1" width="1"/&gt;</content><published>2009-07-22T13:37:00Z</published><updated>2009-07-22T13:37:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html</feedburner:origLink></entry><entry><title type="text">The Knapsack Problem</title><id>urn:uuid:6efce6d8-6489-5275-aa88-1ddce86d4e65</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/JCEN5oEfIRM/The-Knapsack-Problem.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html">David posts a question</a> about how to solve <a href="http://xkcd.com/287/">this</a> <a href="http://en.wikipedia.org/wiki/Knapsack_problem">knapsack problem </a> using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  My reply in the comments seems to have disappeared for a while so here is my proposed solution:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatCenter" style="width: 640px;"&gt;&#xD;
  &lt;a href="http://xkcd.com/287/"&gt;&#xD;
    &lt;img src="http://imgs.xkcd.com/comics/np_complete.png" width="640" height="414" alt="[Cartoon from XKCD]" title="NP-Complete"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html"&gt;David posts a question&lt;/a&gt; about how to solve this &lt;a href="http://en.wikipedia.org/wiki/Knapsack_problem"&gt;knapsack problem &lt;/a&gt; using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  My reply in the comments seems to have disappeared for a while so here is my proposed solution.  See David’s blog for my earlier proposed solution with a very common error.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html&#xD;
appetizer.solution &amp;lt;- local (&#xD;
function (target) {&#xD;
  app &amp;lt;- c(2.15, 2.75, 3.35, 3.55, 4.20, 5.80)&#xD;
  r &amp;lt;- 2L&#xD;
  repeat {&#xD;
	c &amp;lt;- gtools::combinations(length(app), r=r, v=app, repeats.allowed=TRUE)&#xD;
	s &amp;lt;- rowSums(c)&#xD;
	if ( all(s &amp;gt; target) ) {&#xD;
	  print("No solution found")&#xD;
	  break&#xD;
	}&#xD;
	x &amp;lt;- which( abs(s-target) &amp;lt; 1e-4 )&#xD;
	if ( length(x) &amp;gt; 0L ) {&#xD;
	  cat("Solution found: ", c[x,], "\n")&#xD;
	  break&#xD;
	}&#xD;
	r &amp;lt;- r + 1L&#xD;
  }&#xD;
})&#xD;
&#xD;
appetizer.solution(15.05)&#xD;
# Solution found:  2.15 3.55 3.55 5.8 &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Brute force works, it just doesn’t scale well.  (Note that 7×2.15 is another solution.)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/JCEN5oEfIRM" height="1" width="1"/&gt;</content><published>2009-07-10T20:30:00Z</published><updated>2009-07-10T20:30:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html</feedburner:origLink></entry><entry><title type="text">OECD Statistics</title><id>urn:uuid:43e585f9-9c60-505d-b349-b65d1a20c969</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/OECD-Statistics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/5fn__mpTK8o/OECD-Statistics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am a sucker for good quality data.  I <a href="http://www.cybaea.net/Blogs/Data/Data_gov.html">wrote about data.gov</a>, the US Government data site before, and now I find <a href="http://stats.oecd.org/">OECD Statistics</a> which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am a sucker for good quality data.  I &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html"&gt;wrote about data.gov&lt;/a&gt;, the US Government data site before, and now I find &lt;a href="http://stats.oecd.org/"&gt;OECD Statistics&lt;/a&gt; which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Exports in multiple formats, including Excel, CSV, and &lt;a href="http://sdmx.org/"&gt;SDMX&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/5fn__mpTK8o" height="1" width="1"/&gt;</content><published>2009-07-02T20:33:00Z</published><updated>2009-07-02T20:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/OECD-Statistics.html</feedburner:origLink></entry><entry><title type="text">R tips: Determine if function is called from specific package</title><id>urn:uuid:08988ce4-0f96-564e-9575-7c6f2ff16147</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mqFcMzo8FLQ/R-tips-Determine-if-function-is-called-from-specific-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I like the "multicore" library for a particular task.  I can easily write a combination of<code> if(require("multicore",...))</code> that means that my function will automatically use the parallel <code>mclapply()</code> instead of <code>lapply()</code> where it is available.  Which is grand 99% of the time, except when my function is called from <code>mclapply()</code> (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
</p>
<p>
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I like the "multicore" library for a particular task.  I can easily write a combination of&lt;code&gt; if(require("multicore",...))&lt;/code&gt; that means that my function will automatically use the parallel &lt;code&gt;mclapply()&lt;/code&gt; instead of &lt;code&gt;lapply()&lt;/code&gt; where it is available.  Which is grand 99% of the time, except when my function is called from &lt;code&gt;mclapply()&lt;/code&gt; (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First define a generally useful function:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code" title="is.in.namespace()"&gt;&#xD;
is.in.namespace &amp;lt;-&#xD;
function (ns) {&#xD;
  for ( frame in seq(1, sys.nframe(), 1) ) {&#xD;
	fun &amp;lt;- sys.function(frame);&#xD;
	env &amp;lt;- environment(fun)&#xD;
	n   &amp;lt;- environmentName(env)&#xD;
	if ( n == ns ) return(TRUE);&#xD;
  }&#xD;
  return(FALSE);&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we use it for our purpose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
is.in.multicore &amp;lt;- function (...) { return(is.in.namespace("multicore")) }&#xD;
library("multicore")&#xD;
stopifnot( mclapply(as.list(1), is.in.multicore)[[1]] == TRUE )&#xD;
stopifnot(   lapply(as.list(1), is.in.multicore)[[1]] == FALSE )&#xD;
stopifnot( local( {mclapply &amp;lt;- function(x) return(x); mclapply(is.in.multicore())} ) == FALSE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Easy when you know how.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mqFcMzo8FLQ" height="1" width="1"/&gt;</content><published>2009-06-16T10:27:00Z</published><updated>2009-06-16T10:27:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html</feedburner:origLink></entry><entry><title type="text">R tips: Installing Rmpi on Fedora Linux</title><id>urn:uuid:57259815-f049-5226-bda6-95b15ae0f4f2</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/z6AZUNX1s3Y/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Somebody on the R-help mailing list asked how to get <a href="http://cran.r-project.org/web/packages/Rmpi/index.html">Rmpi</a> working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  Since it is unusually painful to get working, I might as well copy the instructions here.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Somebody on the R-help mailing list asked how to get &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  Since it is unusually painful to get working, I might as well copy the instructions here.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;1. Install Open MPI on Fedora Core&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First install the &lt;a href="http://www.open-mpi.org/"&gt;openmpi&lt;/a&gt; libraries using:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;yum install openmpi openmpi-devel openmpi-libs&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The default installation on Fedora still doesn’t &lt;i&gt;quite&lt;/i&gt; work, so you need to execute the following command as root (only once is required, after installation of the package):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;ldconfig /usr/lib64/openmpi/lib/&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
You are not quite done: for R to work right with the libraries, you need to modify the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variable to include the path to the Open MPI libraries.  I have the following in my &lt;code&gt;~/.bash_profile&lt;/code&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".bash_profile"&gt;export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/usr/lib64/openmpi/lib/"&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Edit your file to contain the same, and execute that line at the command prompt and you are ready to continue.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;2. Install the &lt;code&gt;Rmpi&lt;/code&gt; package for &lt;code&gt;R&lt;/code&gt;&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
Now that your Open MPI libraries are set up, and what you do next depends on what version of &lt;code&gt;Rmpi&lt;/code&gt; you are installing.  Most likely you are installing the latest version in which case the following section applies.  The instructions for older versions are retained in a later section for reference.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.1. Current versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Make sure you have executed the &lt;code&gt;ldconfig&lt;/code&gt; command and set the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variables as described in the previous section before you continue.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Since at least version 0.5-8 of the &lt;code&gt;Rmpi&lt;/code&gt; library you can install it from the &lt;code&gt;R&lt;/code&gt; command line after you have fixed the Open MPI install.  At the &lt;code&gt;R&lt;/code&gt; prompt do:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;install.packages("Rmpi",&#xD;
                 configure.args =&#xD;
                 c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
                   "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
                   "--with-Rmpi-type=OPENMPI"))&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
It should work and install OK.  This is obviously quite a mouthful to remember, but help is at hand through the &lt;code&gt;options()&lt;/code&gt; mechanism in R.  In your &lt;code&gt;~/.Rprofile&lt;/code&gt; you can add something like:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".Rprofile"&gt;local({&#xD;
    my.configure.args &amp;lt;-&#xD;
        list("Rmpi" =&#xD;
             c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
               "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
               "--with-Rmpi-type=OPENMPI"),&#xD;
             ## Not needed for Rmpi but shown to illustrate the format&#xD;
             "ncdf" =&#xD;
             c("-with-netcdf_incdir=/usr/include/netcdf",&#xD;
               "-with-netcdf_libdir=/usr/lib64/")&#xD;
             );&#xD;
    options("configure.args" = my.configure.args)&#xD;
})&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Then you can just type &lt;code&gt;install.packages("Rmpi")&lt;/code&gt; at the R command prompt to install the package.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.2. Older versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
The problem is the configuration file &lt;code&gt;configure.ac&lt;/code&gt; which is, unfortunately, completely brain-damaged with hard-coded assumptions about which subdirectories should contain header and library files and no way of overriding it.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Download the latest &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; package from CRAN and unpack it using &lt;code&gt;tar zxvf Rmpi_0.5-7.tar.gz&lt;/code&gt;.  Go to the new &lt;code&gt;Rmpi&lt;/code&gt; directory and replace the file &lt;code&gt;configure.ac&lt;/code&gt; with the one below (for a x86_64 system; for 32 bit you probably need to change &lt;code&gt;-64&lt;/code&gt; to &lt;code&gt;-32&lt;/code&gt;):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="configure.ac"&gt; Process this file with autoconf to produce a configure script.&#xD;
&#xD;
AC_INIT(DESCRIPTION)&#xD;
&#xD;
AC_PROG_CC&#xD;
&#xD;
MPI_LIBS=`pkg-config --libs openmpi-1.3.1-gcc-64`&#xD;
MPI_INCLUDE=`pkg-config --cflags openmpi-1.3.1-gcc-64`&#xD;
MPITYPE="OPENMPI"&#xD;
MPI_DEPS="-DMPI2"&#xD;
&#xD;
AC_CHECK_LIB(util, openpty, [ MPI_LIBS="$MPI_LIBS -lutil" ])&#xD;
AC_CHECK_LIB(pthread, main, [ MPI_LIBS="$MPI_LIBS -lpthread" ])&#xD;
&#xD;
PKG_LIBS="${MPI_LIBS} -fPIC"&#xD;
PKG_CPPFLAGS="${MPI_INCLUDE} ${MPI_DEPS} -D${MPITYPE} -fPIC"&#xD;
&#xD;
AC_SUBST(PKG_LIBS)&#xD;
AC_SUBST(PKG_CPPFLAGS)&#xD;
AC_SUBST(DEFS)&#xD;
&#xD;
AC_OUTPUT(src/Makevars) &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The number 1.3.1 may change in future releases of Fedora: see &lt;code&gt;/usr/lib64/pkgconfig/openmpi-*.pc&lt;/code&gt; for the current value.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Still in the &lt;code&gt;Rmpi&lt;/code&gt; directory do the following in your shell:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;autoconf&#xD;
cd ..&#xD;
tar zcvf Rmpi_0.5-7-F11.tar.gz Rmpi&#xD;
R CMD INSTALL Rmpi_0.5-7-F11.tar.gz &#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;3. Test it&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;Now &lt;code&gt;Rmpi&lt;/code&gt; should be working in R:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("Rmpi")&#xD;
&amp;gt; mpi.spawn.Rslaves(nslaves=2)&#xD;
    2 slaves are spawned successfully. 0 failed.&#xD;
master (rank 0, comm 1) of size 3 is running on: server&#xD;
slave1 (rank 1, comm 1) of size 3 is running on: server&#xD;
slave2 (rank 2, comm 1) of size 3 is running on: server&#xD;
&amp;gt; x &amp;lt;- c(10,20)&#xD;
&amp;gt; mpi.apply(x,runif)&#xD;
[[1]]&#xD;
 [1] 0.25142616 0.93505554 0.03162852 0.71783194 0.35916139 0.85082154&#xD;
 [7] 0.35404191 0.14221315 0.60063773 0.71805190&#xD;
&#xD;
[[2]]&#xD;
 [1] 0.84157864 0.63481773 0.38217188 0.67839089 0.27827728 0.35429266&#xD;
 [7] 0.04898744 0.96601584 0.25687905 0.77381186 0.69011927 0.37391028&#xD;
[13] 0.19017369 0.51196594 0.51970563 0.15791524 0.21358237 0.69642478&#xD;
[19] 0.12690207 0.44177656&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" title="For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel."&gt;Spreadsheet errors&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Excel_Tip_1.html" title="I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below."&gt;Excel Tip: Array boolean operator&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of the…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/z6AZUNX1s3Y" height="1" width="1"/&gt;</content><published>2009-06-12T10:23:00Z</published><updated>2009-06-12T10:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html</feedburner:origLink></entry><entry><title type="text">Data Mashups in R from O'Reilly</title><id>urn:uuid:edb63dc9-21f9-5664-8b35-afb01d7d6472</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/QV_4TAhfmFU/Data-Mashups-in-R-from-O_Reilly.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" title="Click for full article"><img src="http://static.cybaea.net/images/fc_heat_small.png" width="150" height="150" alt="[Philadelphia County July 2009 Foreclosure Heat Map]" /></a>
</div>
<p>
O’Reilly has published <a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R ">Data Mashups in R</a> as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here.  This is all done with the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O’Reilly has published &lt;a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R "&gt;Data Mashups in R&lt;/a&gt; as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below.  This is all done with the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/fc_heat.png" title="Larger version of Philadelphia County July 2009 Foreclosure Heat Map"&gt;&lt;img src="http://static.cybaea.net/images/fc_heat_medium.png" width="400" height="400" alt="[Philadelphia County July 2009 Foreclosure Heat Map]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They show how to:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Use regular expressions to parse HTML files&lt;/li&gt;&#xD;
&lt;li&gt;Use the &lt;a href="http://cran.r-project.org/web/packages/XML/index.html"&gt;XML&lt;/a&gt; package to parse XML data from a web service (&lt;a href="http://developer.yahoo.com/maps/rest/V1/geocode.html"&gt;Yahoo! Geocode&lt;/a&gt;)&lt;/li&gt;&#xD;
&lt;li&gt;Find ERSI shape files for your maps&lt;/li&gt;&#xD;
&lt;li&gt;Use &lt;a href="http://cran.r-project.org/web/packages/PBSmapping/index.html"&gt;PBSmapping&lt;/a&gt; to process and display geographical data (GIS)&lt;/li&gt;&#xD;
&lt;li&gt;Importing and using US Census data with your maps&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/QV_4TAhfmFU" height="1" width="1"/&gt;</content><published>2009-06-09T11:23:00Z</published><updated>2009-06-09T11:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html</feedburner:origLink></entry><entry><title type="text">How to win the KDD Cup Challenge with R and gbm</title><id>urn:uuid:3fb3545e-ea30-5e3b-8f8b-8902f107b81d</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/iWBVzSGe3Aw/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about <a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html">recently</a>) kindly provides more information about how to win this public challenge using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> on a laptop (!).
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;recently&lt;/a&gt;) kindly provides more information about how to win this public challenge using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; on a laptop (!).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
As a reminder of &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;what we wrote before&lt;/a&gt;, the challenge provided two anonymized data set each of 50,000 mobile teleco customers and each entry having 15,000 variables.  The task was to find the best churn, up-, and cross-sell models.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Hugh summarizes his team’s approach:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
Feature selection was an important first step [we &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;mentioned before&lt;/a&gt; that this is key for all successful data mining projects – AE]. We looked at how effective each individual variable was as a predictor, which also allowed us to reading parts of the data only, &lt;em&gt;as the whole dataset didn’t fit in memory&lt;/em&gt; [my emphasis – AE]. The assessment here was homebrew, making a simple predictor on half the data and measuring performance (by the AUC measure) on the other half:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor&lt;/li&gt;&#xD;
&lt;li&gt;For continuous variables we split the variable up into "bins", as you would a histogram, and again took the average number of 1's in the response for each bin as the predictor.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
From this we came up with a set of about 200 variables for each model, which we continued to tinker with. The main model was a gradient boosted machine which used the "&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/gbm/index.html"&gt;gbm&lt;/a&gt;" package in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimising the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit. The package itself is quite powerful as it gives some useful diagnostics such as relative variable importance, allowing us to exclude some and include others.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;&#xD;
They did this on standard Windows laptops (Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive) against a competition that had more computing clusters available than Imelda Marcos had shoes.  It is not what you’ve got, it’s how you use it &lt;tt&gt;:-)&lt;/tt&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Congratulations to Hugh and his team!&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" title="The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission."&gt;R used by KDD 2009 cup winner of slow challenge&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/iWBVzSGe3Aw" height="1" width="1"/&gt;</content><published>2009-06-01T07:07:00Z</published><updated>2009-06-01T07:07:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html</feedburner:origLink></entry><entry><title type="text">R used by KDD 2009 cup winner of slow challenge</title><id>urn:uuid:23be031b-ddb6-5244-ab24-77042c61951c</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/OqKxuXq79pQ/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
The results from the <a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html">KDD Cup 2009 challenge</a> (which we wrote about before) are in, and the winner of the slow challenge used the <a href="http://www.r-project.org">R statistical computing and analysis platform</a> for their winning submission.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
The results from the &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;KDD Cup 2009 challenge&lt;/a&gt; (which we wrote about before) are in, and the winner of the slow challenge used the &lt;a href="http://www.r-project.org"&gt;R statistical computing and analysis platform&lt;/a&gt; for their winning submission.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;a href="http://www.kddcup-orange.com/factsheet.php?id=21"&gt;write up&lt;/a&gt; (username/password may be required) from &lt;a href="http://www.ms.unimelb.edu.au/Personnel/profile.php?PC_id=590"&gt;Hugh Miller&lt;/a&gt; and team at the University of Melbourne includes these points:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function&lt;/li&gt;&#xD;
&lt;li&gt;Models fit in an hour or so&lt;/li&gt;&#xD;
&lt;li&gt;Used the &lt;a href="http://www.r-project.org"&gt;R statistical package&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
Impressive hardware selection!  Well done R.  Weka was another popular tool among the top entrants.  Key for all of them were clever data preparation and variable substitution.  The fast track winners from IBM document this in some detail:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We tried PCA on the large data set, but it did not seem to help.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" title="Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently ) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!)."&gt;How to win the KDD Cup Challenge with R and gbm&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently ) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/OqKxuXq79pQ" height="1" width="1"/&gt;</content><published>2009-05-31T13:17:00Z</published><updated>2009-05-31T13:17:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html</feedburner:origLink></entry><entry><title type="text">R tips: Use read.table instead of strsplit to split a text column into multiple columns</title><id>urn:uuid:60775fac-6d0b-5d55-9e76-eb21bdde97c1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/uowubtD_s_4/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving <code>strsplit</code>.  But <a href="http://staff.pubhealth.ku.dk/~pd/">Peter Dalgaard</a> comes up with a much nicer method using <code>read.table</code> on a <code>textConnection</code> object:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving &lt;code&gt;strsplit&lt;/code&gt;.  But &lt;a href="http://staff.pubhealth.ku.dk/~pd/"&gt;Peter Dalgaard&lt;/a&gt; comes up with a much nicer method using &lt;code&gt;read.table&lt;/code&gt; on a &lt;code&gt;textConnection&lt;/code&gt; object:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; a &amp;lt;- data.frame(cbind(color=c("yellow","red","blue","red"),&#xD;
                        status=c("no","yes","yes","no"),&#xD;
                        ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))&#xD;
&amp;gt; con &amp;lt;- textConnection(as.character(a$ip))&#xD;
&amp;gt; o &amp;lt;- do.call(order,read.table(con, sep="."))&#xD;
&amp;gt; close(con)&#xD;
&amp;gt; a[o,]&#xD;
   color status            ip&#xD;
3   blue    yes     2.2.58.10&#xD;
2    red    yes   2.131.58.16&#xD;
4    red     no 162.131.58.17&#xD;
1 yellow     no 162.131.58.26&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
That is very, very neat!  Thank you Peter.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/uowubtD_s_4" height="1" width="1"/&gt;</content><published>2009-05-29T10:53:00Z</published><updated>2009-05-29T10:53:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html</feedburner:origLink></entry><entry><title type="text">Data.gov</title><id>urn:uuid:a914e8e4-59f0-5054-a5ce-d0aa76d47247</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data_gov.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CaDId-zLq1A/Data_gov.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am always on the lookout for useful data sources for training in statistics, so I am excited that <a href="http://www.data.gov/">Data.gov</a> has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. 
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am always on the lookout for useful data sources for training in statistics, so I am excited that &lt;a href="http://www.data.gov/"&gt;Data.gov&lt;/a&gt; has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is a great initiative which I look forward to explore when I am not in a tiny airport at 3 am (but hey: they have free wifi) and which I hope other countries will take up.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Are there other catalogues of data sets that you use?&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="OReillys recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data , is also available as a PDF download."&gt;Beautiful Data&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;OReillys recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data , is also available as a PDF download.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CaDId-zLq1A" height="1" width="1"/&gt;</content><published>2009-05-22T02:23:00Z</published><updated>2009-05-22T02:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data_gov.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading large networks using the igraph library</title><id>urn:uuid:8764d0b0-00b6-5d9b-9c45-5d3373bc97a8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/UafsWYtoE_U/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
</p>
<p>
In <a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html">our previous post on SNA</a> we gave up on using the <code>statnet</code> package because it was not able to handle our data volumes.  In this entry we have better success with the <code>igraph</code> package.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html"&gt;our previous post on SNA&lt;/a&gt; we gave up on using the &lt;code&gt;statnet&lt;/code&gt; package because it was not able to handle our data volumes.  In this entry we have better success with the &lt;code&gt;igraph&lt;/code&gt; package.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The task we are considering is still how to load the network data into the R package’s internal representation.  We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Loading the data in the &lt;code&gt;igraph&lt;/code&gt; package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We are able to load the previous test data with 51 million records easily:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), &#xD;
+                  what=integer(0), skip=1, sep=','), &#xD;
+             ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; ### Vertices are numbered from zero in the igraph library&#xD;
&amp;gt; m[,1] &amp;lt;- m[,1]-1; m[,2] &amp;lt;- m[,2]-1&#xD;
&amp;gt; g &amp;lt;- graph.edgelist(m[,c(2,1)])&#xD;
&amp;gt; E(g)$start    &amp;lt;- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")&#xD;
&amp;gt; E(g)$duration &amp;lt;- m[,4]&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Time to up the ante!  We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections.  Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
### Note that R can only handle 2^31-1 elements in a vector (on any&#xD;
### platform, including 64-bit), so we need to read this file as a&#xD;
### list.&#xD;
&amp;gt; s &amp;lt;- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)&#xD;
Read 700466826 records&#xD;
&amp;gt; m &amp;lt;- as.vector(rbind(s[[2]], s[[1]]))&#xD;
&amp;gt; print(length(m))&#xD;
[1] 1400933652&#xD;
&amp;gt; length(m) &amp;lt;- 1e9&#xD;
&amp;gt; g &amp;lt;- graph(m, directed=TRUE)&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&amp;gt; summary(ns)&#xD;
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. &#xD;
   1.00   35.00   40.00   42.92   47.00  101.00 &#xD;
&amp;gt; hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", &#xD;
       sub="From cdr.1e6x1e1.1e9")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatRight"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/neighborhood_hist.png"&gt;&lt;img src="http://static.cybaea.net/images/neighborhood_hist_small.png" width="400" height="400" title="Distribution of neighborhood size" alt="[Distribution of neighborhood size plot]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes.  Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles.  A mix of these two algorithms provide a reasonably good fit to actual customer behaviour.  (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Problems&lt;/h2&gt;&#xD;
&lt;p&gt;However, it is not all perfect.  When we attempt to add the edge parameters in the obvious way it fails:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; length(s[[3]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; length(s[[4]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; E(g)$start     &amp;lt;- s[[3]]&#xD;
Error: cannot allocate vector of size 3.7 Gb&#xD;
Execution halted&#xD;
&amp;gt; E(g)$duration  &amp;lt;- s[[4]]&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
So we are just at the limit.  Probably 100 million records is OK in this environment.  But &lt;a href="http://igraph.sourceforge.net/"&gt;the core igraph library&lt;/a&gt; is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.62]" title="[0.62]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want."&gt;SNA with R: Loading your network data&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/TechNotes/Mason-utf-8-clean.html" title="This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache , mod_perl , and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data. You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl. That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right."&gt;4 easy steps to make Mason utf-8 Unicode clean with Apache, mod_perl, and DBI&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache , mod_perl , and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data. You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl. That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/UafsWYtoE_U" height="1" width="1"/&gt;</content><published>2009-05-06T15:33:00Z</published><updated>2009-05-06T15:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading your network data</title><id>urn:uuid:048bbc8f-cad3-5c39-8dee-7c05fb4204ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tZM7HWtF50M/SNA-with-R-Loading-your-network-data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the <a href="http://csde.washington.edu/statnet/index.shtml">statnet</a> group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.
</p>
<p>
The first task which we consider in this post is to load our data into a <code>network</code> object, which is how all the <code>statnet</code> packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work.  We use here the &lt;a href="http://csde.washington.edu/statnet/index.shtml"&gt;statnet&lt;/a&gt; group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first task which we consider in this post is to load our data into a &lt;code&gt;network&lt;/code&gt; object, which is how all the &lt;code&gt;statnet&lt;/code&gt; packages represent a network.  Typically for R, the documentation is voluminous but not always as helpful as one could want.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (&lt;dfn title="Call Detail Record"&gt;CDR&lt;/dfn&gt;) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Implementation in the &lt;code&gt;network&lt;/code&gt; class&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
In the naive implementation of this data as a network, we would have the sources and destinations (broadly speaking: people) as vertices and the calls as edges.  That broadly seems to make sense: people are connected by the calls they make, and that is the social relationship we wish to model.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In the terminology of the &lt;code&gt;network&lt;/code&gt; class, that means that our network will be &lt;b&gt;directed&lt;/b&gt; (calls and money transfers have a direction &lt;em&gt;from&lt;/em&gt; one person &lt;em&gt;to&lt;/em&gt; another) and will need to allow &lt;b&gt;multiple&lt;/b&gt; edges between the same endpoints (because any one person can, and indeed usually will, make several calls to the same other person).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We could consider dropping the &lt;b&gt;multiple&lt;/b&gt; attribute of the network and instead represent the fact that A has called B with a single edge and perhaps have the number of calls and their total duration as an edge attribute.  We will investigate this another time, but it is surely a less faithful representation of the data that we have (and we would need to drop information like the time of call).&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Mapping customer identifiers to network vertex numbers&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
One thing they seem to forget to tell you in the documentation is that when you import your data your vertex identifiers (which in our case is customer or account numbers) must be changed to number the vertices &lt;em&gt;and&lt;/em&gt; that this numbering must be sequential and start from 1.  Being used to an environment where the vertex identifiers are arbitrary (and arrays usually start from 0), this one had me puzzled for a while.  The error message that tells you your vertex numbering is not what the package expected is spectacularly unhelpful:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; n &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Error in add.edges(g, as.list(x[, 1]), as.list(x[, 2]), edge.check = edge.check) : &#xD;
  (edge check) Illegal vertex reference in addEdges_R.  Exiting.&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
For the discussion that follows, we will assume that you have changed your identifies externally to R.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Loading the data&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
The good news is that our data is essentially in a format that the &lt;code&gt;network&lt;/code&gt; package calls &lt;b&gt;edge list&lt;/b&gt; and which it can import directly.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I say “essentially” because for some strange reason the package expects the destination to come before the source which seems ass-backwards to me.  But assume we have our data in a file &lt;code&gt;cdr.csv&lt;/code&gt; like this (we only have calls here):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="cdr.csv"&gt;       src,      dest,     start,  duration&#xD;
         5,         6,1238510028,        52&#xD;
         6,         9,1238510627,       154&#xD;
         9,         7,1238511103,        48&#xD;
         6,         5,1238511145,        49&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we can load the data into R easily:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.csv", what=integer(0), skip=1, sep=','), ncol=4, byrow=TRUE)&#xD;
Read 1896 items&#xD;
&amp;gt; # &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html"&gt;Swap columns&lt;/a&gt; for ass-backward network package&#xD;
&amp;gt; m[,c(1,2)] &amp;lt;- m[,c(2,1)]&#xD;
&#xD;
&amp;gt; # Create network&#xD;
&amp;gt; net &amp;lt;- network(m, matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
&#xD;
&amp;gt; summary(net)&#xD;
Network attributes:&#xD;
 vertices = 10&#xD;
 directed = TRUE&#xD;
 hyper = FALSE&#xD;
 loops = FALSE&#xD;
 multiple = TRUE&#xD;
 bipartite = FALSE&#xD;
 total edges = 474 &#xD;
   missing edges = 0 &#xD;
   non-missing edges = 474 &#xD;
 density = 5.266667 &#xD;
&#xD;
Vertex attributes:&#xD;
 vertex.names:&#xD;
   character valued attribute&#xD;
   10 valid vertex names&#xD;
&#xD;
No edge attributes&#xD;
&#xD;
Network adjacency matrix:&#xD;
Error in as.matrix.network.adjacency(x = x, attrname = attrname, ...) : &#xD;
  Multigraphs not currently supported in as.matrix.network.adjacency.  Exiting.&#xD;
In addition: Warning message:&#xD;
In network.density(x) :&#xD;
  Network is multiplex - no general way to define density.  Returning value for a non-multiplex network (hope that's what you wanted).&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
OK, that's a lot of warnings, but it basically worked.  We have figured out how to load our network data into the &lt;a href="http://www.jstatsoft.org/v24/i02/"&gt;network&lt;/a&gt; package in R.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Performance&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We can’t do an exhaustive performance review now, but let us at least make sure we can load medium-sized networks.  We change our CDR simulator to emit the desitnation before the source just like &lt;code&gt;network&lt;/code&gt; likes it and let it run.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The first file has 2,645,288 (simulated) CDR lines from 100k customers and it loads OK on our small development workstation even with the default settings:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; n &amp;lt;- network(matrix(scan(file="&lt;a href="http://static.cybaea.net/files/cdr.1e5x1e0.csv.bz2"&gt;cdr.1e5x1e0.csv&lt;/a&gt;", &#xD;
                           what=integer(0), skip=1, sep=','), &#xD;
                      ncol=4, byrow=TRUE), &#xD;
               matrix.type="edgelist", directed=TRUE, multiple=TRUE)&#xD;
Read 10581152 items&#xD;
&amp;gt; proc.time()&#xD;
   user  system elapsed &#xD;
138.304   1.597 140.878 &#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=FALSE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The size of the saved network object is 373MB (only 27MB compressed).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We can potentially save some time and memory by not explicitly not performing the edge check (again: the documentation frustrates us and is silent on what the defaults are for the &lt;code&gt;network&lt;/code&gt; call we used above) so we try this for our next file with 51,316,641 lines of CDR data (again for 100k customers) which also saves us some column swapping:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("network")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(file="cdr.51M.csv", &#xD;
                   what=integer(0), skip=1, sep=','),&#xD;
              ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; num_vert &amp;lt;- max(m[,1], m[,2])&#xD;
&amp;gt; num_vert&#xD;
[1] 100000&#xD;
&amp;gt; n &amp;lt;- network.initialize(n=num_vert, directed=TRUE, multiple=TRUE)&#xD;
&amp;gt; add.edges(n, tail=m[,2], head=m[,1], edge.check=FALSE)&#xD;
&amp;gt; proc.time()&#xD;
&lt;i&gt;(several hours: I’ll let you know when it is done)&lt;/i&gt;&#xD;
&amp;gt; rm(m)&#xD;
&amp;gt; save(n, file="n.RData", ascii=FALSE, compress=TRUE)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Our attempted optimization did not seem to matter and this network is too big for the machine and the &lt;code&gt;network&lt;/code&gt; package.  Building the network was painful as I was working on the workstation at the same time.  The machine has 16GB installed RAM, but it was clearly running out and swapping extensively.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
51 million might be a reasonable size data set for some Financial Services applications but it is clearly a trivial number of records for Telecommunications.  I’ll need to do some more digging around.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Does anybody have any SNA benchmarks?  I like the &lt;a href="http://www.kxen.com/"&gt;KXEN&lt;/a&gt; implementation for its simplicity and speed so I might get a copy and try it out.  Any R performance experts who could make suggestions in the comments?  How big are &lt;em&gt;your&lt;/em&gt; networks?&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.62]" title="[0.62]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages. In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package."&gt;SNA with R: Loading large networks using the igraph library&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are interested in Social Network Analysis using the statistical analysis and computing platform R . The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages. In our previous post on SNA we gave up on using the statnet package because it was not able to handle our data volumes. In this entry we have better success with the igraph package.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tZM7HWtF50M:xgW7YjifVz4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tZM7HWtF50M:xgW7YjifVz4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tZM7HWtF50M" height="1" width="1"/&gt;</content><published>2009-04-01T16:08:00Z</published><updated>2009-04-01T16:08:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html</feedburner:origLink></entry><entry><title type="text">R tips: Swapping columns in a matrix</title><id>urn:uuid:64357922-73f7-5e9a-bb13-1c34f4f98256</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/ez3q0foTWdw/R-tips-Swapping-columns-in-a-matrix.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Swapping two columns in a matrix is really easy: <code>m[ , c(1,2)]  &lt;- m[ , c(2,1)]</code>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, swapping two columns in a matrix is really easy: &lt;code&gt;m[ , c(1,2)]  &amp;lt;- m[ , c(2,1)]&lt;/code&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Note, however, that this does not swap the column names (if you have any) but only the values.  You could do something like &lt;code&gt;colnames(m)[c(1,2)] &amp;lt;- colnames(m)[c(2,1)]&lt;/code&gt; if you need the names changed as well, but better is perhaps just to assign:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;m &amp;lt;- m[ , c(2, 1, 3:ncol(m)) ]&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
&lt;/p&gt;&#xD;
&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=ez3q0foTWdw:DJjK1El5AbU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=ez3q0foTWdw:DJjK1El5AbU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/ez3q0foTWdw" height="1" width="1"/&gt;</content><published>2009-03-31T15:59:00Z</published><updated>2009-03-31T15:59:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Swapping-columns-in-a-matrix.html</feedburner:origLink></entry><entry><title type="text">R tips: Eliminating the “save workspace image” prompt on exit</title><id>urn:uuid:9db2db2b-1dc0-55ff-b83d-db44ca9cb2b6</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mZpStaDMWdA/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
When using <a href="http://www.r-project.org/">R</a>, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
When using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit.  This is how I turn it off.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I wish there was an option to change the default of the &lt;code&gt;q&lt;/code&gt;/&lt;code&gt;quit&lt;/code&gt; functions.  I start and stop R frequently and so the exit question which I &lt;em&gt;have&lt;/em&gt; to answer &lt;em&gt;every time&lt;/em&gt; is really annoying:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Save workspace image? [y/n/c]:&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Why is there no R option to disable this prompt?  If I want to save the image, I have already saved it.  And I don’t like the default name anyhow, preferring to give my own with &lt;code&gt;save.image(file=...)&lt;/code&gt;.  For a while, I had a function defined in my &lt;code&gt;~/.Rprofile&lt;/code&gt; that terminated the session without prompting.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.Rprofile"&gt;exit &amp;lt;- function() { q("no") }&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
While this means I can type &lt;code&gt;exit()&lt;/code&gt; and avoid the annoying prompt, in practice I normally type Control-D to end the session which still calls the normal &lt;code&gt;q&lt;/code&gt; function with its annoying prompt.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So instead I use the &lt;code&gt;alias&lt;/code&gt; functionality of my (bash) shell to change the default.  In my &lt;code&gt;~/.bashrc&lt;/code&gt; I now have&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="~/.bashrc"&gt;alias R="$(/usr/bin/which R) --no-save"&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
And finally I am happy.  But I still think R should have an option (accessible through &lt;code&gt;options&lt;/code&gt;) to change the default behavior.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" title="Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages."&gt;Comparing standard R with Revoutions for performance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a d…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.33]" title="[0.33]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mZpStaDMWdA:kSaP-_aoHhI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mZpStaDMWdA:kSaP-_aoHhI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mZpStaDMWdA" height="1" width="1"/&gt;</content><published>2009-03-26T08:14:00Z</published><updated>2009-03-26T08:14:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html</feedburner:origLink></entry></feed>

