<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.cybaea.net/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"><title>CYBAEA Data and Analysis</title><rights>Copyright by the author(s). All rights reserved.</rights><logo>http://static.cybaea.net/logo2011/cybaea-data-200.png</logo><subtitle type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Read the CYBAEA Data and Analysis blog for in-depth coverage of selected topics in data analysis, data mining, statistics, causal inference, and related topics.</p><p>This is the blog for practising data analysts and theoretical statisticians.  The business conclusions of any analysis would normally be discussed in the CYBAEA Journal while this blog may contain the details of the analysis.</p></div></subtitle><updated>2012-03-13T16:57:24Z</updated><id>urn:uuid:259dced6-9721-5b16-a8aa-d91dc8e40f56</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/" /><link rel="alternate" type="text/html" href="http://www.cybaea.net/Blogs/Data/" /><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><generator uri="http://www.cybaea.net/atom/feed.pl?short_name=Data" version="$Revision: 97 $">feed.pl</generator><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.cybaea.net/CybaeaData" /><feedburner:info uri="cybaeadata" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>CybaeaData</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry><title type="text">R code for Chapter 2 of Non-Life Insurance Pricing with GLM</title><id>urn:uuid:0e5cf672-a81c-599c-a340-821c2cb700fd</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/t_3H9Qgjiow/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as <cite>Non-Life Insurance Pricing with Generalized Linear Models</cite> by Esbjörn Ohlsson and Börn Johansson (Amazon 
<a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;tag=cybaea-21&amp;linkCode=as2&amp;camp=1634&amp;creative=6738&amp;creativeASIN=3642107907">UK</a><img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;l=as2&amp;o=2&amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> |
<a href="http://www.amazon.com/gp/product/3642107907/ref=as_li_ss_tl?ie=UTF8&amp;tag=allanengelhardt&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=3642107907">US</a><img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;l=as2&amp;o=1&amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />).</p>
<p>At this stage, our purpose is to reproduce the analysis from the book using the <a href="http://www.r-project.org/">R</a> statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatRight" style="width:110px"&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_il?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;&lt;img border="0" src="http://ws.assoc-amazon.co.uk/widgets/q?_encoding=UTF8&amp;amp;Format=_SL160_&amp;amp;ASIN=3642107907&amp;amp;MarketPlace=GB&amp;amp;ID=AsinImage&amp;amp;WS=1&amp;amp;tag=cybaea-21&amp;amp;ServiceVersion=20070822"&gt;&lt;/img&gt;&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Amazon &lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/3642107907/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=3642107907"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as &lt;cite&gt;Non-Life Insurance Pricing with Generalized Linear Models&lt;/cite&gt; by Esbjörn Ohlsson and Börn Johansson (Amazon &#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/3642107907/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=3642107907"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;).&lt;/p&gt;&#xD;
&lt;p&gt;At this stage, our purpose is to reproduce the analysis from the book using the &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;In the following, we will assume that the reader has a copy of the book and a working installation of R.  We will be using the &lt;a href="http://cran.r-project.org/web/packages/data.table/index.html"&gt;data.table&lt;/a&gt;, &lt;a href="http://cran.r-project.org/web/packages/foreach/index.html"&gt;foreach&lt;/a&gt;, and &lt;a href="http://cran.r-project.org/web/packages/ggplot2/index.html"&gt;ggplot2&lt;/a&gt; packages which are not part of the standard distribution, so the reader should install them first (e.g. by executing &lt;code&gt;install.packages(c("data.table", "foreach", "ggplot2"), dependencies = TRUE)&lt;/code&gt; from within an R session).&lt;/p&gt;&#xD;
&lt;p&gt;The beginning of the code is not so important, but here goes:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## PricingGLM-2.r - Code for Chapter 2 of "Non-Life Insurance Pricing with GLM"&#xD;
## Copyright © 2012 CYBAEA Limited (http://www.cybaea.net)&#xD;
&#xD;
## @book{ohlsson2010non,&#xD;
##   title={Non-Life Insurance Pricing with Generalized Linear Models},&#xD;
##   author={Ohlsson, E. and Johansson, B.},&#xD;
##   isbn={9783642107900},&#xD;
##   series={Eaa Series: Textbook},&#xD;
##   url={http://books.google.com/books?id=l4rjeflJ\_bIC},&#xD;
##   year={2010},&#xD;
##   publisher={Springer Verlag}&#xD;
## }&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;Example 2.5: Moped insurance continued&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;We continue the moped insurance example, and we use the data that we saved in our &lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html"&gt;Chapter 1 session&lt;/a&gt;. The goal is to reproduce Table 2.7 so we start building that as a data frame after loading the data.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Load the data from last.&#xD;
if (!exists("table.1.2"))&#xD;
    load("table.1.2.RData")&#xD;
&#xD;
library("foreach")&#xD;
&#xD;
## We are looking to reproduce table 2.7 which we start building here,&#xD;
## adding columns as we go.&#xD;
table.2.7 &amp;lt;-&#xD;
    data.frame(rating.factor =&#xD;
               c(rep("Vehicle class", nlevels(table.1.2$premiekl)),&#xD;
                 rep("Vehicle age",   nlevels(table.1.2$moptva)),&#xD;
                 rep("Zone",          nlevels(table.1.2$zon))),&#xD;
               class =&#xD;
               c(levels(table.1.2$premiekl),&#xD;
                 levels(table.1.2$moptva),&#xD;
                 levels(table.1.2$zon)),&#xD;
               stringsAsFactors = FALSE)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We next calculate the duration and number of claims for each level of each rating factor. We also set the contrasts for the levels, using the same idiom as in our &lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html"&gt;Chapter 1 session&lt;/a&gt;. The foreach package is convenient to use here, but you can of course do it with a normal loop and a couple of variables.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Calculate duration per rating factor level and also set the&#xD;
## contrasts (using the same idiom as in the code for the previous&#xD;
## chapter). We use foreach here to execute the loop both for its&#xD;
## side-effect (setting the contrasts) and to accumulate the sums.&#xD;
new.cols &amp;lt;-&#xD;
    foreach (rating.factor = c("premiekl", "moptva", "zon"),&#xD;
             .combine = rbind) %do%&#xD;
{&#xD;
    nclaims &amp;lt;- tapply(table.1.2$antskad, table.1.2[[rating.factor]], sum)&#xD;
    sums &amp;lt;- tapply(table.1.2$dur, table.1.2[[rating.factor]], sum)&#xD;
    n.levels &amp;lt;- nlevels(table.1.2[[rating.factor]])&#xD;
    contrasts(table.1.2[[rating.factor]]) &amp;lt;-&#xD;
        contr.treatment(n.levels)[rank(-sums, ties.method = "first"), ]&#xD;
    data.frame(duration = sums, n.claims = nclaims)&#xD;
}&#xD;
table.2.7 &amp;lt;- cbind(table.2.7, new.cols)&#xD;
rm(new.cols)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Next, we need to build the frequency and severity models separately, as in the discussion at the beginning of section 2.3.4 on page 34.&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;Frequency model&lt;/h3&gt;&#xD;
&#xD;
&lt;div class="floatRight" style="width:100px"&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/0387954570/ref=as_li_ss_il?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=19450&amp;amp;creativeASIN=0387954570"&gt;&lt;img border="0" src="http://ws.assoc-amazon.co.uk/widgets/q?_encoding=UTF8&amp;amp;Format=_SL160_&amp;amp;ASIN=0387954570&amp;amp;MarketPlace=GB&amp;amp;ID=AsinImage&amp;amp;WS=1&amp;amp;tag=cybaea-21&amp;amp;ServiceVersion=20070822"&gt;&lt;/img&gt;&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=0387954570" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;&#xD;
Amazon &#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/0387954570/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=19450&amp;amp;creativeASIN=0387954570"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=0387954570" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/1441930086/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=1441930086"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=1441930086" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&#xD;
&lt;p&gt;Note here the use of the &lt;code&gt;offset()&lt;/code&gt; term as opposed to the &lt;code&gt;weights=&lt;/code&gt; argument we have used before. The offset is &lt;code&gt;log(dur)&lt;/code&gt; because our link function is &lt;code&gt;log&lt;/code&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;See &lt;code&gt;help("Insurance", package = "MASS")&lt;/code&gt; for a similar example and sections 7.1 (p. 189--190) and 7.3 of &lt;cite&gt;Modern Applied Statistics with S&lt;/cite&gt;&#xD;
(Amazon &#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/0387954570/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=19450&amp;amp;creativeASIN=0387954570"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=0387954570" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/1441930086/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=1441930086"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=1441930086" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
)&#xD;
for a (very brief) discussion.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document" style="clear:both"&gt;model.frequency &amp;lt;-&#xD;
    glm(antskad ~ premiekl + moptva + zon + offset(log(dur)),&#xD;
        data = table.1.2, family = poisson)&#xD;
&#xD;
rels &amp;lt;- coef( model.frequency )&#xD;
rels &amp;lt;- exp( rels[1] + rels[-1] ) / exp( rels[1] )&#xD;
table.2.7$rels.frequency &amp;lt;-&#xD;
    c(c(1, rels[1])[rank(-table.2.7$duration[1:2], ties.method = "first")],&#xD;
      c(1, rels[2])[rank(-table.2.7$duration[3:4], ties.method = "first")],&#xD;
      c(1, rels[3:8])[rank(-table.2.7$duration[5:11], ties.method = "first")])&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h3&gt;Severity model&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;There are a couple of points to note here:&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;We will model using a Gamma distribution for the errors, following the discussion on page 20 and also pages 33-34. Note the point that this is only one of several plausible candidate distributions.&lt;/li&gt;&#xD;
&lt;li&gt;Because we are using the Gamma distribution we need to remove the zero values from the data; we do this using the &lt;code&gt;table.1.2[table.1.2$medskad &amp;gt; 0, ]&lt;/code&gt; construct.&lt;/li&gt;&#xD;
&lt;li&gt;To reproduce the values from the book, we use the non-canonical &lt;code&gt;"log"&lt;/code&gt; link function even though the canonical function (&lt;code&gt;"inverse")&lt;/code&gt; gives a slightly better fit (residual deviance 5.9 versus 8.0 on 16 degrees of freedom). This follows the approach discussed in Example 2.3 on page 30.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;model.severity &amp;lt;-&#xD;
    glm(medskad ~ premiekl + moptva + zon,&#xD;
        data = table.1.2[table.1.2$medskad &amp;gt; 0, ],&#xD;
        family = Gamma("log"), weights = antskad)&#xD;
&#xD;
rels &amp;lt;- coef( model.severity )&#xD;
rels &amp;lt;- exp( rels[1] + rels[-1] ) / exp( rels[1] )&#xD;
## Aside: For the canonical link function use&#xD;
## rels &amp;lt;- rels[1] / (rels[1] + rels[-1])&#xD;
&#xD;
table.2.7$rels.severity &amp;lt;-&#xD;
    c(c(1, rels[1])[rank(-table.2.7$duration[1:2], ties.method = "first")],&#xD;
      c(1, rels[2])[rank(-table.2.7$duration[3:4], ties.method = "first")],&#xD;
      c(1, rels[3:8])[rank(-table.2.7$duration[5:11], ties.method = "first")])&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h3&gt;Combining the models&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;Now it is trivial to combine and display the results.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;table.2.7$rels.pure.premium &amp;lt;- with(table.2.7, rels.frequency * rels.severity)&#xD;
print(table.2.7, digits = 2)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; rating.factor &lt;/th&gt; &lt;th&gt; class &lt;/th&gt; &lt;th&gt; duration &lt;/th&gt; &lt;th&gt; n.claims &lt;/th&gt; &lt;th&gt; rels.frequency &lt;/th&gt; &lt;th&gt; rels.severity &lt;/th&gt; &lt;th&gt; rels.pure.premium &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; Vehicle class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 9833.20 &lt;/td&gt; &lt;td align="right"&gt; 391 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; Vehicle class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 8825.10 &lt;/td&gt; &lt;td align="right"&gt; 395 &lt;/td&gt; &lt;td align="right"&gt; 0.78 &lt;/td&gt; &lt;td align="right"&gt; 0.55 &lt;/td&gt; &lt;td align="right"&gt; 0.42 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 1918.40 &lt;/td&gt; &lt;td align="right"&gt; 141 &lt;/td&gt; &lt;td align="right"&gt; 1.55 &lt;/td&gt; &lt;td align="right"&gt; 1.79 &lt;/td&gt; &lt;td align="right"&gt; 2.78 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 21 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 16739.90 &lt;/td&gt; &lt;td align="right"&gt; 645 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 12 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 1451.40 &lt;/td&gt; &lt;td align="right"&gt; 206 &lt;/td&gt; &lt;td align="right"&gt; 7.10 &lt;/td&gt; &lt;td align="right"&gt; 1.21 &lt;/td&gt; &lt;td align="right"&gt; 8.62 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 22 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 2486.30 &lt;/td&gt; &lt;td align="right"&gt; 209 &lt;/td&gt; &lt;td align="right"&gt; 4.17 &lt;/td&gt; &lt;td align="right"&gt; 1.07 &lt;/td&gt; &lt;td align="right"&gt; 4.48 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 2888.70 &lt;/td&gt; &lt;td align="right"&gt; 132 &lt;/td&gt; &lt;td align="right"&gt; 2.23 &lt;/td&gt; &lt;td align="right"&gt; 1.07 &lt;/td&gt; &lt;td align="right"&gt; 2.38 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 10069.10 &lt;/td&gt; &lt;td align="right"&gt; 207 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 246.10 &lt;/td&gt; &lt;td align="right"&gt;   6 &lt;/td&gt; &lt;td align="right"&gt; 1.20 &lt;/td&gt; &lt;td align="right"&gt; 1.21 &lt;/td&gt; &lt;td align="right"&gt; 1.46 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 1369.20 &lt;/td&gt; &lt;td align="right"&gt;  23 &lt;/td&gt; &lt;td align="right"&gt; 0.79 &lt;/td&gt; &lt;td align="right"&gt; 0.98 &lt;/td&gt; &lt;td align="right"&gt; 0.78 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 147.50 &lt;/td&gt; &lt;td align="right"&gt;   3 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.20 &lt;/td&gt; &lt;td align="right"&gt; 1.20 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&#xD;
&lt;hr&gt;&lt;/hr&gt;&#xD;
&#xD;
&lt;h2&gt;2.4 Case Study: Motorcycle Insurance&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;The authors use the term Case Study for larger exercises.  There is quite a lot to cover here, and some of the questions are more about discussing the results. We will focus on getting the results.&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;The Data&lt;/h3&gt;&#xD;
&#xD;
&lt;h4&gt;Obtaining the data&lt;/h4&gt;&#xD;
&#xD;
&lt;p&gt;First we read the (fixed width) data. The column names are the same (based on Swedish language) as in the book.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;columns &amp;lt;- c(agarald = 2L, kon = 1L, zon = 1L, mcklass = 1L, fordald = 2L,&#xD;
             bonuskl = 1L, duration = 8L, antskad = 4L, skadkost = 8L)&#xD;
column.classes &amp;lt;- c("integer", rep("factor", 3), "integer",&#xD;
                    "factor", "numeric", rep("integer", 2))&#xD;
stopifnot(length(columns) == length(column.classes))&#xD;
con &amp;lt;- url("http://www2.math.su.se/~esbj/GLMbook/mccase.txt")&#xD;
mccase &amp;lt;- read.fwf(con, widths = columns, header = FALSE,&#xD;
                   col.names = names(columns),&#xD;
                   colClasses = column.classes,&#xD;
                   na.strings = NULL, comment.char = "")&#xD;
try(close(con), silent = TRUE)&#xD;
rm(columns, column.classes, con)&#xD;
mccase$mcklass &amp;lt;- ordered(mccase$mcklass)&#xD;
mccase$bonuskl &amp;lt;- ordered(mccase$bonuskl)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h4&gt;Adding meta-data information&lt;/h4&gt;&#xD;
&#xD;
&lt;p&gt;We are more than a little obsessed with documenting our data, so here goes. See the book for the details.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;comment(mccase) &amp;lt;-&#xD;
    c("Title: Partial casco insurance for motorcycles from Wasa, 1994--1998",&#xD;
      "Source: http://www2.math.su.se/~esbj/GLMbook/mccase.txt",&#xD;
      "Copyright: http://www2.math.su.se/~esbj/GLMbook/")&#xD;
comment(mccase$agarald) &amp;lt;-&#xD;
    c("The owner's age, between 0 and 99",&#xD;
      "Name: Age of Owner")&#xD;
comment(mccase$kon) &amp;lt;-&#xD;
    c("Name: Gender of Owner",&#xD;
      "Code: M=Male",&#xD;
      "Code: K=Female")&#xD;
comment(mccase$zon) &amp;lt;-&#xD;
    c("Name: Geographic Zone",&#xD;
      "Code: 1=Central and semi-central parts of Sweden's three largest cities",&#xD;
      "Code: 2=suburbs and middle-sized towns",&#xD;
      "Code: 3=Lesser towns, except those in 5 or 7",&#xD;
      "Code: 4=Small towns and countryside, except 5--7",&#xD;
      "Code: 5=Northern towns",&#xD;
      "Code: 6=Northern countryside",&#xD;
      "Code: 7=Gotland (Sweden's largest island)")&#xD;
comment(mccase$mcklass) &amp;lt;-&#xD;
    c("Name: MC Class",&#xD;
      "Description: A classification by the EV ratio, defined as (engine power in kW × 100) / (vehicle weight in kg + 75), rounded to the nearest lower integer.",&#xD;
      "Code: 1=EV ratio &amp;lt;= 5",&#xD;
      "Code: 2=EV ratio 6--8",&#xD;
      "Code: 3=EV ratio 9--12",&#xD;
      "Code: 4=EV ratio 13--15",&#xD;
      "Code: 5=EV ratio 16--19",&#xD;
      "Code: 6=EV ratio 20--24",&#xD;
      "Code: 7=EV ratio &amp;gt;= 25")&#xD;
comment(mccase$fordald) &amp;lt;-&#xD;
    c("Vehicle age, between 0 and 99",&#xD;
      "Name: Vehicle Age")&#xD;
comment(mccase$bonuskl) &amp;lt;-&#xD;
    c("Name: Bonus Class",&#xD;
      "Description: A driver starts with bonus class 1; for each claim-free year the bonus class is increased by 1. After the first claim the bonus is decreased by 2; the driver cannot return to class 7 with less than 6 consecutive claim free years.")&#xD;
comment(mccase$duration) &amp;lt;-&#xD;
    c("Name: Duration",&#xD;
      "Comment: The number of policy years",&#xD;
      "Unit: year")&#xD;
comment(mccase$antskad) &amp;lt;-&#xD;
    c("Name: Number of Claims")&#xD;
comment(mccase$skadkost) &amp;lt;-&#xD;
    c("Name: Cost of Claims",&#xD;
      "Unit: SEK")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h4&gt;Rating factors&lt;/h4&gt;&#xD;
&#xD;
&lt;p&gt;We finally add the rating factors from the book.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;mccase$rating.1 &amp;lt;- mccase$zon&#xD;
mccase$rating.2 &amp;lt;- mccase$mcklass&#xD;
mccase$rating.3 &amp;lt;-&#xD;
    cut(mccase$fordald, breaks = c(0, 1, 4, 99),&#xD;
        labels = as.character(1:3), include.lowest = TRUE,&#xD;
        ordered_result = TRUE)&#xD;
mccase$rating.4 &amp;lt;- ordered(mccase$bonuskl) # Drop comments&#xD;
levels(mccase$rating.4) &amp;lt;-                 # Combine levels&#xD;
    c("1", "1", "2", "2", rep("3", 3))&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h4&gt;Save the data&lt;/h4&gt;&#xD;
&#xD;
&lt;p&gt;Never forget to save.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;save(mccase, file = "mccase.RData")&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Now we can start tackling the exercises.&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;Problem 1: Aggregate to cells of current tariff&lt;/h3&gt;&#xD;
&#xD;
&lt;blockquote&gt;&#xD;
  &lt;p&gt;Aggregate the data to the cells of the current&#xD;
tariff. Compute the empirical claim frequency and severity at this&#xD;
level.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt;We really like the &lt;code&gt;data.table&lt;/code&gt; package. The syntax is much easier on the eye (and the hand). As a bonus, it scales well to much larger data sets than &lt;code&gt;data.frame&lt;/code&gt;. If you are not already using it, now is the time to start.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;if (!exists("mccase"))&#xD;
    load("mccase.RData")&#xD;
&#xD;
## Conver to data.table&#xD;
library("data.table")&#xD;
mccase &amp;lt;- data.table(mccase, key = paste("rating", 1:4, sep = "."))&#xD;
&#xD;
## Aggregate to levels of current rating factors&#xD;
mccase.current &amp;lt;-&#xD;
    mccase[,&#xD;
           list(duration = sum(duration),&#xD;
                antskad = sum(antskad),&#xD;
                skadkost = sum(skadkost),&#xD;
                num.policies = .N),&#xD;
           by = key(mccase)]             &#xD;
&#xD;
## Claim frequency and severity. Change NaN to NA.&#xD;
mccase.current$claim.freq &amp;lt;-&#xD;
    with(mccase.current, ifelse(duration != 0, antskad / duration, NA_real_))&#xD;
mccase.current$severity &amp;lt;-&#xD;
    with(mccase.current, ifelse(antskad != 0, skadkost / antskad, NA_real_))&#xD;
&#xD;
## Save&#xD;
save(mccase.current, file = "mccase.current.RData")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h3&gt;Problem 2: Determine how the duration and number of claims is&#xD;
distributed&lt;/h3&gt;&#xD;
&#xD;
&lt;blockquote&gt;&#xD;
  &lt;p&gt;Determine how the duration and number of claims is&#xD;
distributed for each of the rating factor classes, as an indication&#xD;
of the accuracy of the statistical analysis.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt;This is one of those 'there is no right answer' questions, but let us have a look at the data. First load it if needed and also load the libraries we will be using.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Load data if needed&#xD;
library("data.table")&#xD;
if (!exists("mccase"))&#xD;
    load("mccase.RData")&#xD;
if (!exists("mccase.current"))&#xD;
    load("mccase.current.RData")&#xD;
if (!is(mccase, "data.table"))&#xD;
    mccase &amp;lt;- data.table(mccase, key = paste("rating", 1:4, sep = "."))&#xD;
&#xD;
library("grid")&#xD;
library("ggplot2")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h4&gt;1. Number of claims (antskad)&lt;/h4&gt;&#xD;
&#xD;
&lt;p&gt;Let us start with the number of claims for the undelying data. First we plot the number of policies for each number of claims (range of number of claims is {0, 1, 2}); this is one way to do it:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;plot.titles &amp;lt;- c("Geo. zone", "MC class", "Vehicle age", "Bonus class")&#xD;
plots &amp;lt;-&#xD;
    lapply(1:4,&#xD;
           function(i)&#xD;
           ggplot(mccase, aes(antskad))&#xD;
           + geom_histogram(aes(weight = duration))&#xD;
           + scale_x_discrete(limits = c(0, 2))&#xD;
           + scale_y_log10()&#xD;
           + facet_grid(paste("rating.", i, " ~ .", sep = ""),&#xD;
                        scales = "fixed")&#xD;
           ## We drop the axis titles to make more room for the data&#xD;
           + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(),&#xD;
                  axis.text.x = theme_blank(),  axis.text.y = theme_blank(),&#xD;
                  axis.title.y = theme_blank(), axis.ticks = theme_blank(),&#xD;
                  title = plot.titles[i])&#xD;
           )&#xD;
&#xD;
grid.newpage()&#xD;
pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4)))&#xD;
## We can ignore the warnings from displaying the plots for now&#xD;
for (i in 1:4)&#xD;
    print(plots[[i]], vp = viewport(layout.pos.col = i))&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-1.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;There is certainly some variation there, especially on the first rating factor.&lt;/p&gt;&#xD;
&lt;p&gt;Recall from the discussion on page 18 the assumption that the&#xD;
number of claims for a single policy is a Poisson process and the&#xD;
distribution of the number of claims in a tariff cell is Poisson&#xD;
distributed.  Let us first look at the whole data set (and note&#xD;
that the data.table notation is used; if using data.frame be sure&#xD;
to have drop=FALSE):&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;data &amp;lt;- mccase[order(antskad), list(N = .N, w = sum(duration)), by = antskad]&#xD;
M &amp;lt;- glm(N ~ antskad, family = poisson(), weights = w, data = data)&#xD;
data$predicted &amp;lt;- round(predict(M, data[, list(antskad)], type = "response"))&#xD;
print(data[, list(antskad, N, predicted)], digits = 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; antskad &lt;/th&gt; &lt;th&gt; n &lt;/th&gt; &lt;th&gt; predicted &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 63878 &lt;/td&gt; &lt;td align="right"&gt; 63878.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 643 &lt;/td&gt; &lt;td align="right"&gt; 646.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt; 27 &lt;/td&gt; &lt;td align="right"&gt; 7.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;Not too bad, but then it can't really be with that heavy weighting&#xD;
to the first value of antskad. We can show the same split by the&#xD;
levels of the first rating factor as an example:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;data &amp;lt;-&#xD;
    mccase[order(rating.1, antskad),&#xD;
           list(N = .N, w = sum(duration)),&#xD;
           by = list(rating.1, antskad)]&#xD;
modelPoisson &amp;lt;- function(data) {&#xD;
    M &amp;lt;- glm(N ~ antskad, family = poisson(), weights = w, data = data)&#xD;
    return(M)&#xD;
}&#xD;
predictionPoisson &amp;lt;- function (data) {&#xD;
    M &amp;lt;- modelPoisson(data)&#xD;
    p &amp;lt;- predict(M, data[, list(antskad)], type = "response")&#xD;
    return(p)&#xD;
}&#xD;
data$predicted &amp;lt;-&#xD;
    unlist(lapply(levels(data$rating.1),&#xD;
                  function (l) round(predictionPoisson(data[rating.1 == l]))))&#xD;
print(data[, list(rating.1, antskad, N, predicted)], digits = 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; rating.1 &lt;/th&gt; &lt;th&gt; antskad &lt;/th&gt; &lt;th&gt; n &lt;/th&gt; &lt;th&gt; predicted &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 8409 &lt;/td&gt; &lt;td align="right"&gt; 8409.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 163 &lt;/td&gt; &lt;td align="right"&gt; 164.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td align="right"&gt; 3.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 11632 &lt;/td&gt; &lt;td align="right"&gt; 11632.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 157 &lt;/td&gt; &lt;td align="right"&gt; 157.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt;  5 &lt;/td&gt; &lt;td align="right"&gt; 2.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 12604 &lt;/td&gt; &lt;td align="right"&gt; 12604.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 8 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 113 &lt;/td&gt; &lt;td align="right"&gt; 113.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 9 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt;  5 &lt;/td&gt; &lt;td align="right"&gt; 1.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 24626 &lt;/td&gt; &lt;td align="right"&gt; 24626.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 184 &lt;/td&gt; &lt;td align="right"&gt; 185.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 12 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt;  6 &lt;/td&gt; &lt;td align="right"&gt; 1.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 13 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 2368 &lt;/td&gt; &lt;td align="right"&gt; 2368.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 14 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt;  9 &lt;/td&gt; &lt;td align="right"&gt; 9.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 15 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 3867 &lt;/td&gt; &lt;td align="right"&gt; 3867.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 16 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 16 &lt;/td&gt; &lt;td align="right"&gt; 16.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 17 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt;  2 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 0.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 18 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt;  0 &lt;/td&gt; &lt;td align="right"&gt; 372 &lt;/td&gt; &lt;td align="right"&gt; 372.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 19 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt;  1 &lt;/td&gt; &lt;td align="right"&gt; 1.0 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;You get the idea....&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;If we can't really see very much at the individual claims level,&#xD;
how does it appear when we look at the levels aggregated to the&#xD;
current tariff cells? Here, of course, we have much more data with&#xD;
up to 33 claims in one cell, but we will limit the display to the&#xD;
first few number of claims.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;high.limit &amp;lt;- 7L                 # Only display up to this many claims&#xD;
plots &amp;lt;-&#xD;
    lapply(1:4,&#xD;
           function(i)&#xD;
           ggplot(mccase.current, aes(antskad))&#xD;
           + geom_histogram(breaks = 0:high.limit, aes(weight = duration))&#xD;
           + scale_x_discrete(limits = c(0, high.limit), breaks = 0:high.limit)&#xD;
           ## Following xlim() needed for scale_x_discrete only; see&#xD;
           ## https://groups.google.com/d/msg/ggplot2/wLWGCUz8K6k/DeVudyfXyKgJ&#xD;
           + xlim(0, high.limit)&#xD;
           + scale_y_log10()&#xD;
           + facet_grid(paste("rating.", i, " ~ .", sep = ""),&#xD;
                        scales = "fixed")&#xD;
           ## We drop the axis titles to make more room for the data&#xD;
           + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(),&#xD;
                  axis.text.x = theme_blank(),  axis.text.y = theme_blank(),&#xD;
                  axis.title.y = theme_blank(), axis.ticks = theme_blank(),&#xD;
                  title = plot.titles[i])&#xD;
           )&#xD;
&#xD;
grid.newpage()&#xD;
pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4)))&#xD;
for (i in 1:4)&#xD;
    print(plots[[i]], vp = viewport(layout.pos.col = i))&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-2.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
There is certainly something going on for low bonus classes and&#xD;
high vehicle ages that is not straightforward Poisson&#xD;
distribution. Looking at the two in combination we see the old&#xD;
motorcycle - low bonus class (rating.3 = 3 and rating.4 = 1) is the&#xD;
main culpit, but also note that there is relatively little data in&#xD;
these cells.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;ggplot(mccase.current, aes(antskad)) +&#xD;
    geom_histogram(breaks = 0:high.limit, aes(weight = duration)) +&#xD;
    scale_x_discrete(limits = c(0, high.limit), breaks = 0:high.limit) +&#xD;
    xlim(0, high.limit) +&#xD;
    scale_y_log10() +&#xD;
    facet_grid(rating.4 ~ rating.3, scales = "fixed", labeller = "label_both")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-3.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We can fit the poisson distribution using the same&#xD;
predictionPoisson() function as before. We also look at the fit in&#xD;
a simple way (the - much! - better approach is to use the residuals from the&#xD;
fitted model)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## rating.3 is the vehicle age.&#xD;
data &amp;lt;-&#xD;
    mccase.current[order(rating.3, antskad),&#xD;
                   list(N = .N, w = sum(duration)),&#xD;
                   by = list(rating.3, antskad)]&#xD;
data$predicted &amp;lt;-&#xD;
    unlist(lapply(levels(data$rating.3),&#xD;
                  function (l) round(predictionPoisson(data[rating.3 == l]))))&#xD;
## Show the fit&#xD;
print(data[antskad &amp;lt;= high.limit,&#xD;
           list(rating.3, antskad, N, predicted)], digits = 1)&#xD;
## Show simplistic residuals per level of the rating factor&#xD;
print(data[antskad &amp;lt;= high.limit,&#xD;
           list(res = sum(abs(predicted - N))),&#xD;
           by = rating.3])&#xD;
&#xD;
## rating.4 is the bonus class&#xD;
data &amp;lt;-&#xD;
    mccase.current[order(rating.4, antskad),&#xD;
                   list(N = .N, w = sum(duration)),&#xD;
                   by = list(rating.4, antskad)]&#xD;
data$predicted &amp;lt;-&#xD;
    unlist(lapply(levels(data$rating.4),&#xD;
                  function (l) round(predictionPoisson(data[rating.4 == l]))))&#xD;
## Show the fit&#xD;
print(data[antskad &amp;lt;= high.limit,&#xD;
           list(rating.4, antskad, N, predicted)], digits = 1)&#xD;
## Show simplistic residuals per rating factor&#xD;
print(data[antskad &amp;lt;= high.limit,&#xD;
           list(res = sum(abs(predicted - N))),&#xD;
           by = rating.4])&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We will not show the tables here (as we said: there is a better way). Instead, we can show the fitted Poisson distribution:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;ggplot(data, aes(x = antskad, y = N)) +&#xD;
    geom_bar(breaks = 0L:high.limit, stat = "identity") +&#xD;
    facet_wrap( ~ rating.4) +&#xD;
    geom_line(aes(y = predicted), data = data, colour = "red", size = 1) +&#xD;
    xlim(-0.5, high.limit + 0.5)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-4.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;h3&gt;2. Duration&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We can plot the distribution of duration as before. Here 440 is the&#xD;
90% quantile for the duration (412) rounded up to the next&#xD;
multiplier of binwidth.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;plots &amp;lt;-&#xD;
    lapply(1:4,&#xD;
           function(i)&#xD;
           ggplot(mccase.current, aes(duration))&#xD;
           + geom_histogram(binwidth = 40)&#xD;
           + xlim(0, 440)&#xD;
           + scale_y_log10()&#xD;
           + facet_grid(paste("rating.", i, " ~ .", sep = ""),&#xD;
                        scales = "fixed")&#xD;
           ## We drop the axis titles to make more room for the data&#xD;
           + opts(axis.title.x = theme_blank(), axis.title.y = theme_blank(),&#xD;
                  axis.text.x = theme_blank(),  axis.text.y = theme_blank(),&#xD;
                  axis.title.y = theme_blank(), axis.ticks = theme_blank(),&#xD;
                  title = plot.titles[i])&#xD;
           )&#xD;
&#xD;
grid.newpage()&#xD;
pushViewport(viewport(layout = grid.layout(nrow = 1, ncol = 4)))&#xD;
## We can ignore the warnings from displaying the plots for now&#xD;
for (i in 1:4)&#xD;
    print(plots[[i]], vp = viewport(layout.pos.col = i))&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-5.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
However, in this case it is much more illuminating to look at the overall distribution:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;ggplot(mccase, aes(duration)) +&#xD;
    geom_histogram(binwidth = 0.05) +&#xD;
    xlim(0, 3) +&#xD;
    opts(title = "Duration of motorcycle policies") +&#xD;
    annotate("text", 3.0, 1e4, label = "Histogram bin width = 0.05",&#xD;
             size = 3, hjust = 1, vjust = 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width:400px"&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/PricingGLM/PricingGLM-2-6.png" width="400" height="400"&gt;&lt;/img&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Big spikes near 1/2 and 1 year (and in general spikes around 1/2&#xD;
year multiples) suggests that either (1) this is not a random&#xD;
sample of policies or (2) there is a strong seasonal effect in the&#xD;
time of year people initially sign up for the policy.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&#xD;
&#xD;
&lt;h3&gt;Problem 3: Determine the relativities for claim frequency and severity&lt;/h3&gt;&#xD;
&#xD;
&lt;blockquote&gt;&#xD;
  &lt;p&gt;Determine the relativities for claim frequency and&#xD;
severity separately, by using GLMs; use the results to get&#xD;
relativities for the pure premium.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt;We first load the data (if needed) and create a data frame to hold the output, following the structure of Table 2.8 in the book.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;library("data.table")&#xD;
if (!exists("mccase.current"))&#xD;
    load("mccase.current.RData")&#xD;
&#xD;
case.2.4 &amp;lt;-&#xD;
    data.frame(rating.factor =&#xD;
               c(rep("Zone",        nlevels(mccase.current$rating.1)),&#xD;
                 rep("MC class",    nlevels(mccase.current$rating.2)),&#xD;
                 rep("Vehicle age", nlevels(mccase.current$rating.3)),&#xD;
                 rep("Bonus class", nlevels(mccase.current$rating.4))),&#xD;
               class =&#xD;
               with(mccase.current,&#xD;
                    c(levels(rating.1), levels(rating.2),&#xD;
                      levels(rating.3), levels(rating.4))),&#xD;
               ## These are the values from Table 2.8 in the book:&#xD;
               relativity =&#xD;
               c(7.678, 4.227, 1.336, 1.000, 1.734, 1.402, 1.402,&#xD;
                 0.625, 0.769, 1.000, 1.406, 1.875, 4.062, 6.873,&#xD;
                 2.000, 1.200, 1.000,&#xD;
                 1.250, 1.125, 1.000),&#xD;
               stringsAsFactors = FALSE)&#xD;
print(case.2.4, digits = 3)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; rating.factor &lt;/th&gt; &lt;th&gt; class &lt;/th&gt; &lt;th&gt; relativity &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 7.678 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 4.227 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1.336 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 1.734 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 1.402 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; zone &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 1.402 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 8 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 0.625 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 9 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 0.769 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 1.406 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 12 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 1.875 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 13 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 4.062 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 14 &lt;/td&gt; &lt;td&gt; mc class &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 6.873 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 15 &lt;/td&gt; &lt;td&gt; vehicle age &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 2.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 16 &lt;/td&gt; &lt;td&gt; vehicle age &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 1.200 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 17 &lt;/td&gt; &lt;td&gt; vehicle age &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 18 &lt;/td&gt; &lt;td&gt; bonus class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 1.250 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 19 &lt;/td&gt; &lt;td&gt; bonus class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 1.125 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 20 &lt;/td&gt; &lt;td&gt; bonus class &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;Close enough to the book.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
First we set the contrasts so the baseline for the models is the&#xD;
level with the highest duration. This is the same approach we used&#xD;
before.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;library("foreach")&#xD;
new.cols &amp;lt;-&#xD;
    foreach (rating.factor = paste("rating", 1:4, sep = "."),&#xD;
             .combine = rbind) %do%&#xD;
{&#xD;
    totals &amp;lt;- mccase.current[, list(D = sum(duration),&#xD;
                                    N = sum(antskad),&#xD;
                                    C = sum(skadkost)),&#xD;
                             by = rating.factor]&#xD;
    n.levels &amp;lt;- nlevels(mccase.current[[rating.factor]])&#xD;
    contrasts(mccase.current[[rating.factor]]) &amp;lt;-&#xD;
        contr.treatment(n.levels)[rank(-totals[["D"]], ties.method = "first"), ]&#xD;
    data.frame(duration = totals[["D"]],&#xD;
               n.claims = totals[["N"]],&#xD;
               skadkost = totals[["C"]])&#xD;
}&#xD;
case.2.4 &amp;lt;- cbind(case.2.4, new.cols)&#xD;
rm(new.cols)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We next fit the models and combine the results as in Example 2.5 above. Here we also make a note of a simple measure of the goodness of fit: the residual deviance and the degrees of freedom. The frequency model is a good fit, the severity model is not.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Model the frequency&#xD;
&#xD;
model.frequency &amp;lt;-&#xD;
    glm(antskad ~&#xD;
        rating.1 + rating.2 + rating.3 + rating.4 + offset(log(duration)),&#xD;
        data = mccase.current[duration &amp;gt; 0], family = poisson)&#xD;
## Res. dev. 360 on 389 dof&#xD;
&#xD;
rels &amp;lt;- coef( model.frequency )&#xD;
rels &amp;lt;- exp( rels[1] + rels[-1] ) / exp( rels[1] )&#xD;
case.2.4$rels.frequency &amp;lt;-&#xD;
    c(c(1, rels[1:6])[rank(-case.2.4$duration[1:7], ties.method = "first")],&#xD;
      c(1, rels[7:12])[rank(-case.2.4$duration[8:14], ties.method = "first")],&#xD;
      c(1, rels[13:14])[rank(-case.2.4$duration[15:17], ties.method = "first")],&#xD;
      c(1, rels[15:16])[rank(-case.2.4$duration[18:20], ties.method = "first")])&#xD;
&#xD;
## Model the severity. We stick with the non-canonical link function&#xD;
## for the time being.&#xD;
&#xD;
model.severity &amp;lt;-&#xD;
    glm(skadkost ~ rating.1 + rating.2 + rating.3 + rating.4,&#xD;
        data = mccase.current[skadkost &amp;gt; 0,],&#xD;
        family = Gamma("log"), weights = antskad)&#xD;
## Res.dev. 516 on 164 dof&#xD;
&#xD;
rels &amp;lt;- coef( model.severity )&#xD;
rels &amp;lt;- exp( rels[1] + rels[-1] ) / exp( rels[1] )&#xD;
case.2.4$rels.severity &amp;lt;-&#xD;
    c(c(1, rels[1:6])[rank(-case.2.4$duration[1:7], ties.method = "first")],&#xD;
      c(1, rels[7:12])[rank(-case.2.4$duration[8:14], ties.method = "first")],&#xD;
      c(1, rels[13:14])[rank(-case.2.4$duration[15:17], ties.method = "first")],&#xD;
      c(1, rels[15:16])[rank(-case.2.4$duration[18:20], ties.method = "first")])&#xD;
&#xD;
## Combine the frequency and severity&#xD;
case.2.4$rels.pure.prem &amp;lt;- with(case.2.4, rels.frequency * rels.severity)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Finally we convert, save, and compare with the current values.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Convert to data.table&#xD;
library("data.table")&#xD;
case.2.4 &amp;lt;- data.table(case.2.4)&#xD;
&#xD;
## Save&#xD;
save(case.2.4, file = "case.2.4.RData")&#xD;
&#xD;
## Compare with current values&#xD;
print(case.2.4[,&#xD;
               list(rating.factor, class, duration, n.claims,&#xD;
                    skadkostK = round(skadkost / 1e3),&#xD;
                    relativity, rels.pure.prem)],&#xD;
      digits = 3)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; rating.factor &lt;/th&gt; &lt;th&gt; class &lt;/th&gt; &lt;th&gt; duration &lt;/th&gt; &lt;th&gt; n.claims &lt;/th&gt; &lt;th&gt; skadkostK &lt;/th&gt; &lt;th&gt; relativity &lt;/th&gt; &lt;th&gt; rels.pure.prem &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 6205.310 &lt;/td&gt; &lt;td align="right"&gt;  183 &lt;/td&gt; &lt;td align="right"&gt; 5540.000 &lt;/td&gt; &lt;td align="right"&gt; 7.678 &lt;/td&gt; &lt;td align="right"&gt; 6.239 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 10103.090 &lt;/td&gt; &lt;td align="right"&gt;  167 &lt;/td&gt; &lt;td align="right"&gt; 4811.000 &lt;/td&gt; &lt;td align="right"&gt; 4.227 &lt;/td&gt; &lt;td align="right"&gt; 3.179 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 11676.573 &lt;/td&gt; &lt;td align="right"&gt;  123 &lt;/td&gt; &lt;td align="right"&gt; 2523.000 &lt;/td&gt; &lt;td align="right"&gt; 1.336 &lt;/td&gt; &lt;td align="right"&gt; 0.993 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 32628.493 &lt;/td&gt; &lt;td align="right"&gt;  196 &lt;/td&gt; &lt;td align="right"&gt; 3775.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 1582.112 &lt;/td&gt; &lt;td align="right"&gt;    9 &lt;/td&gt; &lt;td align="right"&gt; 105.000 &lt;/td&gt; &lt;td align="right"&gt; 1.734 &lt;/td&gt; &lt;td align="right"&gt; 0.155 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 2799.945 &lt;/td&gt; &lt;td align="right"&gt;   18 &lt;/td&gt; &lt;td align="right"&gt; 288.000 &lt;/td&gt; &lt;td align="right"&gt; 1.402 &lt;/td&gt; &lt;td align="right"&gt; 0.223 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 241.288 &lt;/td&gt; &lt;td align="right"&gt;    1 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;td align="right"&gt; 1.402 &lt;/td&gt; &lt;td align="right"&gt; 0.002 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 8 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 5190.351 &lt;/td&gt; &lt;td align="right"&gt;   46 &lt;/td&gt; &lt;td align="right"&gt; 993.000 &lt;/td&gt; &lt;td align="right"&gt; 0.625 &lt;/td&gt; &lt;td align="right"&gt; 0.395 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 9 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 3990.115 &lt;/td&gt; &lt;td align="right"&gt;   57 &lt;/td&gt; &lt;td align="right"&gt; 883.000 &lt;/td&gt; &lt;td align="right"&gt; 0.769 &lt;/td&gt; &lt;td align="right"&gt; 0.536 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 21665.679 &lt;/td&gt; &lt;td align="right"&gt;  166 &lt;/td&gt; &lt;td align="right"&gt; 5372.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 11739.882 &lt;/td&gt; &lt;td align="right"&gt;   98 &lt;/td&gt; &lt;td align="right"&gt; 2192.000 &lt;/td&gt; &lt;td align="right"&gt; 1.406 &lt;/td&gt; &lt;td align="right"&gt; 0.574 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 12 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 13439.926 &lt;/td&gt; &lt;td align="right"&gt;  149 &lt;/td&gt; &lt;td align="right"&gt; 3297.000 &lt;/td&gt; &lt;td align="right"&gt; 1.875 &lt;/td&gt; &lt;td align="right"&gt; 1.413 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 13 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 8880.134 &lt;/td&gt; &lt;td align="right"&gt;  175 &lt;/td&gt; &lt;td align="right"&gt; 4161.000 &lt;/td&gt; &lt;td align="right"&gt; 4.062 &lt;/td&gt; &lt;td align="right"&gt; 4.876 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 14 &lt;/td&gt; &lt;td&gt; MC class &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 330.723 &lt;/td&gt; &lt;td align="right"&gt;    6 &lt;/td&gt; &lt;td align="right"&gt; 145.000 &lt;/td&gt; &lt;td align="right"&gt; 6.873 &lt;/td&gt; &lt;td align="right"&gt; 0.602 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 15 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 4955.403 &lt;/td&gt; &lt;td align="right"&gt;  126 &lt;/td&gt; &lt;td align="right"&gt; 4964.000 &lt;/td&gt; &lt;td align="right"&gt; 2.000 &lt;/td&gt; &lt;td align="right"&gt; 4.715 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 16 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 9753.811 &lt;/td&gt; &lt;td align="right"&gt;  145 &lt;/td&gt; &lt;td align="right"&gt; 5507.000 &lt;/td&gt; &lt;td align="right"&gt; 1.200 &lt;/td&gt; &lt;td align="right"&gt; 2.021 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 17 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 50527.597 &lt;/td&gt; &lt;td align="right"&gt;  426 &lt;/td&gt; &lt;td align="right"&gt; 6570.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 18 &lt;/td&gt; &lt;td&gt; Bonus class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 19893.370 &lt;/td&gt; &lt;td align="right"&gt;  207 &lt;/td&gt; &lt;td align="right"&gt; 4558.000 &lt;/td&gt; &lt;td align="right"&gt; 1.250 &lt;/td&gt; &lt;td align="right"&gt; 0.797 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 19 &lt;/td&gt; &lt;td&gt; Bonus class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 9615.764 &lt;/td&gt; &lt;td align="right"&gt;  121 &lt;/td&gt; &lt;td align="right"&gt; 3627.000 &lt;/td&gt; &lt;td align="right"&gt; 1.125 &lt;/td&gt; &lt;td align="right"&gt; 0.604 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 20 &lt;/td&gt; &lt;td&gt; Bonus class &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 35727.677 &lt;/td&gt; &lt;td align="right"&gt;  369 &lt;/td&gt; &lt;td align="right"&gt; 8857.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;td align="right"&gt; 1.000 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
It is really rather different from the current relativity, as we will also see below. But remember that the severity model was a poor fit: that would be my first place to start looking if staying within the insurance pricing framework outlined in this book.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;Problem 4: Discussions&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;Just note here that the ratio for the relativity we calculated and for the existing one can be found as&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;with(case.2.4, max(rels.pure.prem) / min(rels.pure.prem)) # 3552&#xD;
with(case.2.4, max(relativity) / min(relativity))         #   12&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Note the huge range of relativities in our new model (3552 versus 12).&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.83]" title="[0.83]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.54]" title="[0.54]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.42]" title="[0.42]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" title="For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel."&gt;Spreadsheet errors&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=t_3H9Qgjiow:YLdcXsL7sDc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=t_3H9Qgjiow:YLdcXsL7sDc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=t_3H9Qgjiow:YLdcXsL7sDc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=t_3H9Qgjiow:YLdcXsL7sDc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/t_3H9Qgjiow" height="1" width="1"/&gt;</content><published>2012-03-13T16:57:00Z</published><updated>2012-03-13T16:57:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html</feedburner:origLink></entry><entry><title type="text">R code for Chapter 1 of Non-Life Insurance Pricing with GLM</title><id>urn:uuid:714f195a-4055-505f-93f7-56313d44013e</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/slPno6edH6w/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Insurance pricing is backwards and primitive, harking back to an era before computers.  One standard (and good) textbook on the topic is <cite>Non-Life Insurance Pricing with Generalized Linear Models</cite> by Esbjorn Ohlsson and Born Johansson.  We have been doing some work in this area recently.  Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to <a href="http://www.r-project.org">R</a>, the statistical computing and analysis platform.  This is part of a series of posts containing elements of the R code.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatRight" style="width:110px"&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_il?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;&lt;img border="0" src="http://ws.assoc-amazon.co.uk/widgets/q?_encoding=UTF8&amp;amp;Format=_SL160_&amp;amp;ASIN=3642107907&amp;amp;MarketPlace=GB&amp;amp;ID=AsinImage&amp;amp;WS=1&amp;amp;tag=cybaea-21&amp;amp;ServiceVersion=20070822"&gt;&lt;/img&gt;&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Amazon &lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/3642107907/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=3642107907"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;Insurance pricing is backwards and primitive, harking back to an era before computers.  One standard (and good) textbook on the topic is &lt;cite&gt;Non-Life Insurance Pricing with Generalized Linear Models&lt;/cite&gt; by Esbjorn Ohlsson and Born Johansson (Amazon &#xD;
&lt;a href="http://www.amazon.co.uk/gp/product/3642107907/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=6738&amp;amp;creativeASIN=3642107907"&gt;UK&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=cybaea-21&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt; |&#xD;
&lt;a href="http://www.amazon.com/gp/product/3642107907/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=allanengelhardt&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=3642107907"&gt;US&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=allanengelhardt&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=3642107907" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;).  We have been doing some work in this area recently.  Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to &lt;a href="http://www.r-project.org"&gt;R&lt;/a&gt;, the statistical computing and analysis platform.  This is part of a series of posts containing elements of the R code.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document" style="clear:both"&gt;#!/usr/bin/Rscript&#xD;
## PricingGLM-1.r - Code for Chapter 1 of "&lt;cite&gt;Non-Life Insurance Pricing with GLM&lt;/cite&gt;"&#xD;
## Copyright © 2012 CYBAEA Limited (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## @book{ohlsson2010non,&#xD;
##   title={Non-Life Insurance Pricing with Generalized Linear Models},&#xD;
##   author={Ohlsson, E. and Johansson, B.},&#xD;
##   isbn={9783642107900},&#xD;
##   series={Eaa Series: Textbook},&#xD;
##   url={http://books.google.com/books?id=l4rjeflJ\_bIC},&#xD;
##   year={2010},&#xD;
##   publisher={Springer Verlag}&#xD;
## }&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;With the preliminaries out of the way, let us get started.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Example 1.2&lt;/h2&gt;&#xD;
&lt;p&gt;We grab the data for Table 1.2 from the book's web site and store it as an R object with lots of good meta information.&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;################&#xD;
### Example 1.2&#xD;
con &amp;lt;- url("&lt;a href="http://www2.math.su.se/~esbj/GLMbook/moppe.sas"&gt;http://www2.math.su.se/~esbj/GLMbook/moppe.sas&lt;/a&gt;")&#xD;
data &amp;lt;- readLines(con, n = 200L, warn = FALSE, encoding = "unknown")&#xD;
close(con)&#xD;
## Find the data range&#xD;
data.start &amp;lt;- grep("^cards;", data) + 1L&#xD;
data.end   &amp;lt;- grep("^;", data[data.start:999L]) + data.start - 2L&#xD;
table.1.2  &amp;lt;- read.table(text = data[data.start:data.end],&#xD;
                         header = FALSE, sep = "", quote = "",&#xD;
                         col.names = c("premiekl", "moptva", "zon", "dur",&#xD;
                             "medskad", "antskad", "riskpre", "helpre", "cell"),&#xD;
                         na.strings = NULL,&#xD;
                         colClasses = c(rep("factor", 3), "numeric",&#xD;
                             rep("integer", 4), "NULL"),&#xD;
                         comment.char = "")&#xD;
rm(con, data, data.start, data.end)     # Cleanup&#xD;
comment(table.1.2) &amp;lt;-&#xD;
    c("Title: Partial casco moped insurance from Wasa insurance, 1994--1999",&#xD;
      "Source: http://www2.math.su.se/~esbj/GLMbook/moppe.sas",&#xD;
      "Copyright: &lt;a href="http://www2.math.su.se/~esbj/GLMbook/"&gt;http://www2.math.su.se/~esbj/GLMbook/&lt;/a&gt;")&#xD;
## See the SAS code for this derived field&#xD;
table.1.2$skadfre = with(table.1.2, antskad / dur)&#xD;
## English language column names as comments:&#xD;
comment(table.1.2$premiekl) &amp;lt;-&#xD;
    c("Name: Class",&#xD;
      "Code: 1=Weight over 60kg and more than 2 gears",&#xD;
      "Code: 2=Other")&#xD;
comment(table.1.2$moptva)   &amp;lt;-&#xD;
    c("Name: Age",&#xD;
      "Code: 1=At most 1 year",&#xD;
      "Code: 2=2 years or more")&#xD;
comment(table.1.2$zon)      &amp;lt;-&#xD;
    c("Name: Zone",&#xD;
      "Code: 1=Central and semi-central parts of Sweden's three largest cities",&#xD;
      "Code: 2=suburbs and middle-sized towns",&#xD;
      "Code: 3=Lesser towns, except those in 5 or 7",&#xD;
      "Code: 4=Small towns and countryside, except 5--7",&#xD;
      "Code: 5=Northern towns",&#xD;
      "Code: 6=Northern countryside",&#xD;
      "Code: 7=Gotland (Sweden's largest island)")&#xD;
comment(table.1.2$dur)      &amp;lt;-&#xD;
    c("Name: Duration",&#xD;
      "Unit: year")&#xD;
comment(table.1.2$medskad)  &amp;lt;-&#xD;
    c("Name: Claim severity",&#xD;
      "Unit: SEK")&#xD;
comment(table.1.2$antskad)  &amp;lt;- "Name: No. claims"&#xD;
comment(table.1.2$riskpre)  &amp;lt;-&#xD;
    c("Name: Pure premium",&#xD;
      "Unit: SEK")&#xD;
comment(table.1.2$helpre)   &amp;lt;-&#xD;
    c("Name: Actual premium",&#xD;
      "Note: The premium for one year according to the tariff in force 1999",&#xD;
      "Unit: SEK")&#xD;
comment(table.1.2$skadfre)  &amp;lt;-&#xD;
    c("Name: Claim frequency",&#xD;
      "Unit: /year")&#xD;
## Save results for later&#xD;
save(table.1.2, file = "table.1.2.RData")&#xD;
## Print the table (not as pretty as the book)&#xD;
print(table.1.2)&#xD;
################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; premiekl &lt;/th&gt; &lt;th&gt; moptva &lt;/th&gt; &lt;th&gt; zon &lt;/th&gt; &lt;th&gt; dur &lt;/th&gt; &lt;th&gt; medskad &lt;/th&gt; &lt;th&gt; antskad &lt;/th&gt; &lt;th&gt; riskpre &lt;/th&gt; &lt;th&gt; helpre &lt;/th&gt; &lt;th&gt; skadfre &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 62.9000 &lt;/td&gt; &lt;td align="right"&gt; 18256 &lt;/td&gt; &lt;td align="right"&gt;    17 &lt;/td&gt; &lt;td align="right"&gt;  4936 &lt;/td&gt; &lt;td align="right"&gt;  2049 &lt;/td&gt; &lt;td align="right"&gt; 0.2703 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 112.9000 &lt;/td&gt; &lt;td align="right"&gt; 13632 &lt;/td&gt; &lt;td align="right"&gt;     7 &lt;/td&gt; &lt;td align="right"&gt;   845 &lt;/td&gt; &lt;td align="right"&gt;  1230 &lt;/td&gt; &lt;td align="right"&gt; 0.0620 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 133.1000 &lt;/td&gt; &lt;td align="right"&gt; 20877 &lt;/td&gt; &lt;td align="right"&gt;     9 &lt;/td&gt; &lt;td align="right"&gt;  1411 &lt;/td&gt; &lt;td align="right"&gt;   762 &lt;/td&gt; &lt;td align="right"&gt; 0.0676 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 376.6000 &lt;/td&gt; &lt;td align="right"&gt; 13045 &lt;/td&gt; &lt;td align="right"&gt;     7 &lt;/td&gt; &lt;td align="right"&gt;   242 &lt;/td&gt; &lt;td align="right"&gt;   396 &lt;/td&gt; &lt;td align="right"&gt; 0.0186 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 9.4000 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;   990 &lt;/td&gt; &lt;td align="right"&gt; 0.0000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 70.8000 &lt;/td&gt; &lt;td align="right"&gt; 15000 &lt;/td&gt; &lt;td align="right"&gt;     1 &lt;/td&gt; &lt;td align="right"&gt;   212 &lt;/td&gt; &lt;td align="right"&gt;   594 &lt;/td&gt; &lt;td align="right"&gt; 0.0141 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 4.4000 &lt;/td&gt; &lt;td align="right"&gt;  8018 &lt;/td&gt; &lt;td align="right"&gt;     1 &lt;/td&gt; &lt;td align="right"&gt;  1829 &lt;/td&gt; &lt;td align="right"&gt;   396 &lt;/td&gt; &lt;td align="right"&gt; 0.2273 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 8 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 352.1000 &lt;/td&gt; &lt;td align="right"&gt;  8232 &lt;/td&gt; &lt;td align="right"&gt;    52 &lt;/td&gt; &lt;td align="right"&gt;  1216 &lt;/td&gt; &lt;td align="right"&gt;  1229 &lt;/td&gt; &lt;td align="right"&gt; 0.1477 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 9 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 840.1000 &lt;/td&gt; &lt;td align="right"&gt;  7418 &lt;/td&gt; &lt;td align="right"&gt;    69 &lt;/td&gt; &lt;td align="right"&gt;   609 &lt;/td&gt; &lt;td align="right"&gt;   738 &lt;/td&gt; &lt;td align="right"&gt; 0.0821 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1378.3000 &lt;/td&gt; &lt;td align="right"&gt;  7318 &lt;/td&gt; &lt;td align="right"&gt;    75 &lt;/td&gt; &lt;td align="right"&gt;   398 &lt;/td&gt; &lt;td align="right"&gt;   457 &lt;/td&gt; &lt;td align="right"&gt; 0.0544 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 5505.3000 &lt;/td&gt; &lt;td align="right"&gt;  6922 &lt;/td&gt; &lt;td align="right"&gt;   136 &lt;/td&gt; &lt;td align="right"&gt;   171 &lt;/td&gt; &lt;td align="right"&gt;   238 &lt;/td&gt; &lt;td align="right"&gt; 0.0247 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 12 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 114.1000 &lt;/td&gt; &lt;td align="right"&gt; 11131 &lt;/td&gt; &lt;td align="right"&gt;     2 &lt;/td&gt; &lt;td align="right"&gt;   195 &lt;/td&gt; &lt;td align="right"&gt;   594 &lt;/td&gt; &lt;td align="right"&gt; 0.0175 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 13 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 810.9000 &lt;/td&gt; &lt;td align="right"&gt;  5970 &lt;/td&gt; &lt;td align="right"&gt;    14 &lt;/td&gt; &lt;td align="right"&gt;   103 &lt;/td&gt; &lt;td align="right"&gt;   356 &lt;/td&gt; &lt;td align="right"&gt; 0.0173 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 14 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 62.3000 &lt;/td&gt; &lt;td align="right"&gt;  6500 &lt;/td&gt; &lt;td align="right"&gt;     1 &lt;/td&gt; &lt;td align="right"&gt;   104 &lt;/td&gt; &lt;td align="right"&gt;   238 &lt;/td&gt; &lt;td align="right"&gt; 0.0161 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 15 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 191.6000 &lt;/td&gt; &lt;td align="right"&gt;  7754 &lt;/td&gt; &lt;td align="right"&gt;    43 &lt;/td&gt; &lt;td align="right"&gt;  1740 &lt;/td&gt; &lt;td align="right"&gt;  1024 &lt;/td&gt; &lt;td align="right"&gt; 0.2244 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 16 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 237.3000 &lt;/td&gt; &lt;td align="right"&gt;  6933 &lt;/td&gt; &lt;td align="right"&gt;    34 &lt;/td&gt; &lt;td align="right"&gt;   993 &lt;/td&gt; &lt;td align="right"&gt;   615 &lt;/td&gt; &lt;td align="right"&gt; 0.1433 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 17 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 162.4000 &lt;/td&gt; &lt;td align="right"&gt;  4402 &lt;/td&gt; &lt;td align="right"&gt;    11 &lt;/td&gt; &lt;td align="right"&gt;   298 &lt;/td&gt; &lt;td align="right"&gt;   381 &lt;/td&gt; &lt;td align="right"&gt; 0.0677 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 18 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 446.5000 &lt;/td&gt; &lt;td align="right"&gt;  8214 &lt;/td&gt; &lt;td align="right"&gt;     8 &lt;/td&gt; &lt;td align="right"&gt;   147 &lt;/td&gt; &lt;td align="right"&gt;   198 &lt;/td&gt; &lt;td align="right"&gt; 0.0179 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 19 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 13.2000 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;   495 &lt;/td&gt; &lt;td align="right"&gt; 0.0000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 20 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 82.8000 &lt;/td&gt; &lt;td align="right"&gt;  5830 &lt;/td&gt; &lt;td align="right"&gt;     3 &lt;/td&gt; &lt;td align="right"&gt;   211 &lt;/td&gt; &lt;td align="right"&gt;   297 &lt;/td&gt; &lt;td align="right"&gt; 0.0362 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 21 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 14.5000 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;     0 &lt;/td&gt; &lt;td align="right"&gt;   198 &lt;/td&gt; &lt;td align="right"&gt; 0.0000 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 22 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 844.8000 &lt;/td&gt; &lt;td align="right"&gt;  4728 &lt;/td&gt; &lt;td align="right"&gt;    94 &lt;/td&gt; &lt;td align="right"&gt;   526 &lt;/td&gt; &lt;td align="right"&gt;   614 &lt;/td&gt; &lt;td align="right"&gt; 0.1113 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 23 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 1296.0000 &lt;/td&gt; &lt;td align="right"&gt;  4252 &lt;/td&gt; &lt;td align="right"&gt;    99 &lt;/td&gt; &lt;td align="right"&gt;   325 &lt;/td&gt; &lt;td align="right"&gt;   369 &lt;/td&gt; &lt;td align="right"&gt; 0.0764 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 24 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 1214.9000 &lt;/td&gt; &lt;td align="right"&gt;  4212 &lt;/td&gt; &lt;td align="right"&gt;    37 &lt;/td&gt; &lt;td align="right"&gt;   128 &lt;/td&gt; &lt;td align="right"&gt;   229 &lt;/td&gt; &lt;td align="right"&gt; 0.0305 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 25 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 3740.7000 &lt;/td&gt; &lt;td align="right"&gt;  3846 &lt;/td&gt; &lt;td align="right"&gt;    56 &lt;/td&gt; &lt;td align="right"&gt;    58 &lt;/td&gt; &lt;td align="right"&gt;   119 &lt;/td&gt; &lt;td align="right"&gt; 0.0150 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 26 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 109.4000 &lt;/td&gt; &lt;td align="right"&gt;  3925 &lt;/td&gt; &lt;td align="right"&gt;     4 &lt;/td&gt; &lt;td align="right"&gt;   144 &lt;/td&gt; &lt;td align="right"&gt;   297 &lt;/td&gt; &lt;td align="right"&gt; 0.0366 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 27 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 404.7000 &lt;/td&gt; &lt;td align="right"&gt;  5280 &lt;/td&gt; &lt;td align="right"&gt;     5 &lt;/td&gt; &lt;td align="right"&gt;    65 &lt;/td&gt; &lt;td align="right"&gt;   178 &lt;/td&gt; &lt;td align="right"&gt; 0.0124 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 28 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 66.3000 &lt;/td&gt; &lt;td align="right"&gt;  7795 &lt;/td&gt; &lt;td align="right"&gt;     1 &lt;/td&gt; &lt;td align="right"&gt;   118 &lt;/td&gt; &lt;td align="right"&gt;   119 &lt;/td&gt; &lt;td align="right"&gt; 0.0151 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&#xD;
&lt;p&gt; That was easy.  Now for something a little harder.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Example 1.3&lt;/h2&gt;&#xD;
&lt;p&gt;Here we are concerned with replicating Table 1.4.  We do it slowly, step-by-step, for pedagogical reasons.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;################&#xD;
### Example 1.3&#xD;
if (!exists("table.1.2"))&#xD;
    load("table.1.2.RData")&#xD;
## We calculate each of the columns individually and slowly here&#xD;
## to show each step&#xD;
&#xD;
## First we have simply the labels of the table&#xD;
rating.factor &amp;lt;-&#xD;
    with(table.1.2,&#xD;
         c(rep("Vehicle class", nlevels(premiekl)),&#xD;
           rep("Vehicle age", nlevels(moptva)),&#xD;
           rep("Zone", nlevels(zon))))&#xD;
&#xD;
## The Class column&#xD;
class.num &amp;lt;- with(table.1.2, c(levels(premiekl), levels(moptva), levels(zon)))&#xD;
&#xD;
## The Duration is the sum of durations within each class&#xD;
duration.total &amp;lt;-&#xD;
    c(with(table.1.2, tapply(dur, premiekl, sum)),&#xD;
      with(table.1.2, tapply(dur, moptva, sum)),&#xD;
      with(table.1.2, tapply(dur, zon, sum)))&#xD;
&#xD;
## Calculate relativities in the tariff&#xD;
## The denominator of the fraction is the class with the highest exposure&#xD;
## (i.e. the maximum total duration): we make that explicit with the&#xD;
## which.max() construct.  We also set the contrasts to use this as the base,&#xD;
## which will be useful for the glm() model later.&#xD;
class.base &amp;lt;- which.max(duration.total[1:2])&#xD;
age.base   &amp;lt;- which.max(duration.total[3:4])&#xD;
zone.base  &amp;lt;- which.max(duration.total[5:11])&#xD;
&#xD;
rt.class &amp;lt;- with(table.1.2, tapply(helpre, premiekl, sum))&#xD;
rt.class &amp;lt;- rt.class / rt.class[class.base]&#xD;
rt.age   &amp;lt;- with(table.1.2, tapply(helpre, moptva, sum))&#xD;
rt.age   &amp;lt;- rt.age / rt.age[age.base]&#xD;
rt.zone  &amp;lt;- with(table.1.2, tapply(helpre, zon, sum))&#xD;
rt.zone  &amp;lt;- rt.zone / rt.zone[zone.base]&#xD;
&#xD;
contrasts(table.1.2$premiekl) &amp;lt;-&#xD;
    contr.treatment(nlevels(table.1.2$premiekl))[rank(-duration.total[1:2],&#xD;
                                                      ties.method = "first"), ]&#xD;
contrasts(table.1.2$moptva) &amp;lt;-&#xD;
    contr.treatment(nlevels(table.1.2$moptva))[rank(-duration.total[3:4],&#xD;
                                                    ties.method = "first"), ]&#xD;
contrasts(table.1.2$zon) &amp;lt;-&#xD;
    contr.treatment(nlevels(table.1.2$zon))[rank(-duration.total[5:11],&#xD;
                                                 ties.method = "first"), ]&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;The contrasts could also have been set with the &lt;code&gt;base=&lt;/code&gt; argument, e.g. &lt;code&gt;contrasts(table.1.2$zon) &amp;lt;- contr.treatment(nlevels(table.1.2$zon), base = zone.base)&lt;/code&gt;, which would be closer in spirit to the SAS code. But I like the idiom presented here where we follow the duration order; it also extends well to other (i.e. not treatment) contrasts. I just wish &lt;code&gt;rank()&lt;/code&gt; had an &lt;code&gt;decreasing=&lt;/code&gt; argument like &lt;code&gt;order()&lt;/code&gt; which I think would be clearer than using &lt;code&gt;rank(-x)&lt;/code&gt; to get a decreasing sort order.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;That was the easy part.  At this stage in the book you are not really expected to understand the next step so do not despair!  We just show how easy it is to replicate the SAS code in R.  An alternative approach using direct optimization is outlined in Exercise 1.3 below.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;## Relativities of MMT; we use the glm approach here as per the book's&#xD;
## SAS code at &lt;a href="http://www2.math.su.se/~esbj/GLMbook/moppe.sas"&gt;http://www2.math.su.se/~esbj/GLMbook/moppe.sas&lt;/a&gt;&#xD;
m &amp;lt;- glm(riskpre ~ premiekl + moptva + zon, data = table.1.2,&#xD;
         family = poisson("log"), weights = dur)&#xD;
&#xD;
## If the next line is a mystery then you need to&#xD;
## (1) read up on contrasts or&#xD;
## (2) remember that the link function is log() which is why we use exp here&#xD;
rels &amp;lt;- exp( coef(m)[1] + coef(m)[-1] ) / exp(coef(m)[1])&#xD;
&#xD;
rm.class &amp;lt;- c(1, rels[1])               # See rm.zone below for the&#xD;
rm.age   &amp;lt;- c(rels[2], 1)               # general approach&#xD;
rm.zone  &amp;lt;- c(1, rels[3:8])[rank(-duration.total[5:11], ties.method = "first")]&#xD;
&#xD;
## Create and save the data frame&#xD;
table.1.4 &amp;lt;-&#xD;
    data.frame(Rating.factor = rating.factor, Class = class.num,&#xD;
               Duration = duration.total,&#xD;
               Rel.tariff = c(rt.class, rt.age, rt.zone),&#xD;
               Rel.MMT    = c(rm.class, rm.age, rm.zone))&#xD;
save(table.1.4, file = "table.1.4.RData")&#xD;
print(table.1.4, digits = 3)&#xD;
rm(rating.factor, class.num, duration.total, class.base, age.base, zone.base,&#xD;
   rt.class, rt.age, rt.zone, rm.class, rm.age, rm.zone, m, rels)&#xD;
################&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;The result is something like this:&lt;/p&gt;&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; Rating.factor &lt;/th&gt; &lt;th&gt; Class &lt;/th&gt; &lt;th&gt; Duration &lt;/th&gt; &lt;th&gt; Rel.tariff &lt;/th&gt; &lt;th&gt; Rel.MMT &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 1 &lt;/td&gt; &lt;td&gt; Vehicle class &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 9833.20 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 2 &lt;/td&gt; &lt;td&gt; Vehicle class &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 8825.10 &lt;/td&gt; &lt;td align="right"&gt; 0.50 &lt;/td&gt; &lt;td align="right"&gt; 0.43 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 3 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 1918.40 &lt;/td&gt; &lt;td align="right"&gt; 1.67 &lt;/td&gt; &lt;td align="right"&gt; 2.73 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 4 &lt;/td&gt; &lt;td&gt; Vehicle age &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 16739.90 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 5 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 1 &lt;/td&gt; &lt;td align="right"&gt; 1451.40 &lt;/td&gt; &lt;td align="right"&gt; 5.17 &lt;/td&gt; &lt;td align="right"&gt; 8.97 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 6 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 2 &lt;/td&gt; &lt;td align="right"&gt; 2486.30 &lt;/td&gt; &lt;td align="right"&gt; 3.10 &lt;/td&gt; &lt;td align="right"&gt; 4.19 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 7 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 3 &lt;/td&gt; &lt;td align="right"&gt; 2888.70 &lt;/td&gt; &lt;td align="right"&gt; 1.92 &lt;/td&gt; &lt;td align="right"&gt; 2.52 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 8 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 4 &lt;/td&gt; &lt;td align="right"&gt; 10069.10 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 9 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 5 &lt;/td&gt; &lt;td align="right"&gt; 246.10 &lt;/td&gt; &lt;td align="right"&gt; 2.50 &lt;/td&gt; &lt;td align="right"&gt; 1.24 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 10 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 6 &lt;/td&gt; &lt;td align="right"&gt; 1369.20 &lt;/td&gt; &lt;td align="right"&gt; 1.50 &lt;/td&gt; &lt;td align="right"&gt; 0.74 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; 11 &lt;/td&gt; &lt;td&gt; Zone &lt;/td&gt; &lt;td&gt; 7 &lt;/td&gt; &lt;td align="right"&gt; 147.50 &lt;/td&gt; &lt;td align="right"&gt; 1.00 &lt;/td&gt; &lt;td align="right"&gt; 1.23 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&lt;p&gt;Note the rather unusual and apparently inconsistent rounding in the book: 147, 1.66, and 5.16 would be better as 148 (the value is 147.5), 1.67, and 5.17.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Exercise 1.3&lt;/h2&gt;&#xD;
&lt;p&gt;Here it gets interesting as we get a different value from the authors.  Possibly a small bug on our part but at least we provide the code for you to check.  So if you spot a problem let us know in the comments.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;################&#xD;
## Exercise 1.3&#xD;
&#xD;
## The values from the book&#xD;
g0  &amp;lt;- 0.03305&#xD;
g12 &amp;lt;- 2.01231&#xD;
g22 &amp;lt;- 0.74288&#xD;
dim.names &amp;lt;- list(Milage = c("Low", "High"),&#xD;
                  Age = c("New", "Old"))&#xD;
pyears &amp;lt;- matrix(c(47039, 56455, 190513, 28612), nrow = 2,&#xD;
                 dimnames = dim.names)&#xD;
claims &amp;lt;- matrix(c(0.033, 0.067, 0.025, 0.049), nrow = 2,&#xD;
                 dimnames = dim.names)&#xD;
&#xD;
## Function to calculate the error of the estimate&#xD;
GvalsError &amp;lt;- function (gvals) {&#xD;
    ## The current estimates&#xD;
    g0  &amp;lt;- gvals[1]&#xD;
    g12 &amp;lt;- gvals[2]&#xD;
    g22 &amp;lt;- gvals[3]&#xD;
    ## The current estimates in convenient matrix form&#xD;
    G  &amp;lt;- matrix(c(1, 1, g12, g22), nrow = 2)&#xD;
    G1 &amp;lt;- matrix(c(1, g12), nrow = 2, ncol = 2)&#xD;
    G2 &amp;lt;- matrix(c(1, g22), nrow = 2, ncol = 2, byrow = TRUE)&#xD;
    ## The calculated values&#xD;
    G0  &amp;lt;- addmargins(claims * pyears)["Sum", "Sum"] / ( sum(pyears * G1 * G2) )&#xD;
    G12 &amp;lt;- addmargins(claims * pyears)["High", "Sum"] /&#xD;
        ( g0 * addmargins(pyears * G2)["High", "Sum"] )&#xD;
    G22 &amp;lt;- addmargins(claims * pyears)["Sum", "Old"] /&#xD;
        ( g0 * addmargins(pyears * G1)["Sum", "Old"] )&#xD;
    ## The sum of squared errors&#xD;
    error &amp;lt;- (g0 - G0)^2 + (g12 - G12)^2 + (g22 - G22)^2&#xD;
    return(error)&#xD;
}&#xD;
&#xD;
## Minimize the error function to obtain our estimate&#xD;
gamma &amp;lt;- optim(c(g0, g12, g22), GvalsError)&#xD;
stopifnot(gamma$convergence == 0)&#xD;
gamma &amp;lt;- gamma$par&#xD;
&#xD;
values &amp;lt;- data.frame(legend = c("Our calculation", "Book value"),&#xD;
                     g0  = c(gamma[1], g0),&#xD;
                     g12 = c(gamma[2], g12),&#xD;
                     g22 = c(gamma[3], g22),&#xD;
                     row.names = "legend")&#xD;
print(values, digits = 4)&#xD;
&#xD;
## Close, but not the same.&#xD;
&#xD;
rm(g0, g12, g22, dim.names, pyears, claims, gamma, values)&#xD;
################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;The resulting table is something like:&lt;/p&gt;&#xD;
&#xD;
&lt;table&gt;&#xD;
&lt;tr&gt; &lt;th&gt;  &lt;/th&gt; &lt;th&gt; g0 &lt;/th&gt; &lt;th&gt; g12 &lt;/th&gt; &lt;th&gt; g22 &lt;/th&gt;  &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; Our calculation &lt;/td&gt; &lt;td align="right"&gt; 0.0334 &lt;/td&gt; &lt;td align="right"&gt; 1.9951 &lt;/td&gt; &lt;td align="right"&gt; 0.7452 &lt;/td&gt; &lt;/tr&gt;&#xD;
  &lt;tr&gt; &lt;td align="right"&gt; Book value &lt;/td&gt; &lt;td align="right"&gt; 0.0331 &lt;/td&gt; &lt;td align="right"&gt; 2.0123 &lt;/td&gt; &lt;td align="right"&gt; 0.7429 &lt;/td&gt; &lt;/tr&gt;&#xD;
   &lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;Close, but not the same.  Perhaps they used a different error function.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.83]" title="[0.83]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.56]" title="[0.56]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Excel_Tip_1.html" title="I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below."&gt;Excel Tip: Array boolean operator&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=slPno6edH6w:dnfdBBqOuwY:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=slPno6edH6w:dnfdBBqOuwY:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=slPno6edH6w:dnfdBBqOuwY:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=slPno6edH6w:dnfdBBqOuwY:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/slPno6edH6w" height="1" width="1"/&gt;</content><published>2012-03-01T18:11:00Z</published><updated>2012-03-12T19:53:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html</feedburner:origLink></entry><entry><title type="text">doSMP pulled</title><id>urn:uuid:e16429f3-c3ee-5128-9715-c6bf45bef2fa</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/doSMP-pulled.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/p-MI9MfeXSc/doSMP-pulled.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>They have finally pulled that buggy unreliable piece of code that was <a href="cran.r-project.org/web/packages/doSMP/" rev="vote-against">doSMP</a> from the CRAN mirrors while (I hear) Revolutions are re-writing it.  To use all your cores for analysis on the Windows platform, you can try <a href="http://cran.r-project.org/web/packages/doSNOW/" rev="vote-abstain">doSNOW</a> instead; my code is something like the fragment below.  Neither option is as attractive as <a href="http://cran.r-project.org/web/packages/doMC/" rev="vote-for">doMC</a> on anything-but-Windows platforms, but sometimes you have to work with legacy systems.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;They have finally pulled that buggy unreliable piece of code that was &lt;a href="cran.r-project.org/web/packages/doSMP/" rev="vote-against"&gt;doSMP&lt;/a&gt; from the CRAN mirrors while (I hear) Revolutions are re-writing it.  To use all your cores for analysis on the Windows platform, you can try &lt;a href="http://cran.r-project.org/web/packages/doSNOW/" rev="vote-abstain"&gt;doSNOW&lt;/a&gt; instead; my code is something like the fragment below.  Neither option is as attractive as &lt;a href="http://cran.r-project.org/web/packages/doMC/" rev="vote-for"&gt;doMC&lt;/a&gt; on anything-but-Windows platforms, but sometimes you have to work with legacy systems.&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;&#xD;
library("foreach")&#xD;
if (.Platform$OS.type != "windows" &amp;amp;&amp;amp; require("multicore")) {&#xD;
    registerDoMC()&#xD;
} else if (FALSE &amp;amp;&amp;amp;                     # doSMP is buggy&#xD;
           require("doSMP")) {&#xD;
    w &amp;lt;- startWorkers()&#xD;
    on.exit(stopWorkers(w), add = TRUE)&#xD;
    registerDoSMP(w)&#xD;
} else if (require("doSNOW")) {&#xD;
    cl &amp;lt;- snow::makeCluster(4, type = "SOCK")&#xD;
    on.exit(snow::stopCluster(cl), add = TRUE)&#xD;
    registerDoSNOW(cl)&#xD;
} else {&#xD;
    registerDoSEQ()&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Change the number 4 to the number of cores that you want to use on the machine.  The explicit name space (&lt;code&gt;snow::&lt;/code&gt;) is to avoid confusion if you load the "parallel" package or any of the other packages that also define a &lt;code&gt;makeCluster()&lt;/code&gt; function.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I hope Revolutions does a good job on the new version: it needs some love.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=p-MI9MfeXSc:BSbfu4tCX9Q:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=p-MI9MfeXSc:BSbfu4tCX9Q:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=p-MI9MfeXSc:BSbfu4tCX9Q:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=p-MI9MfeXSc:BSbfu4tCX9Q:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/p-MI9MfeXSc" height="1" width="1"/&gt;</content><published>2012-03-01T09:16:00Z</published><updated>2012-03-01T09:16:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/doSMP-pulled.html</feedburner:origLink></entry><entry><title type="text">R versus SAS/SPSS in corporations</title><id>urn:uuid:96da848e-e1c0-526a-b214-213b613df848</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Jo8p0HAP-iI/R-versus-SAS_SPSS-in-corporations.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/graph_151-150.png" width="150" height="150" alt="[graph]" title="Graph from http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=151" />
  </a>
</div>
<p>A recent question on one of the LinkedIn groups about the advantages of using <a href="http://www.r-project.org/">R</a> over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R.  We like R a lot and we use it extensively, but I also wanted to balance the discussion.  R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;A recent question on one of the LinkedIn groups about the advantages of using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R.  We like R a lot and we use it extensively, but I also wanted to balance the discussion.  R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Background&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We have created and managed analytics teams in commercial organizations (mainly telecommunications) across Europe.  The teams were using SAS or SPSS.  Our company now has a commercial analytics as a service offering and we mainly use R.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
The benefits of R is productivity.  We want to spend time on the actions from the analytical insights, not the coding, and we choose our tool accordingly.  Being a consulting type organization it is easier for us to attract and retain talent.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The advantages of SAS/SPSS in a commercial environment&lt;/h2&gt;&#xD;
&#xD;
&#xD;
&lt;h3&gt;1. You can buy the tool for money.&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Big corporations have procurement departments who do not have a process for free software.  Also software spend goes on the balance sheet in a way that the CFO prefers to people but something like R will take a little talent to set up initally.  (And yes, we know the Revolutions guys well, but they are not really credible in Europe yet.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
This will change as (a) companies become more mature in their procurement and as (b) commercial support for R improves.  (On the latter point, &lt;a href="http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html"&gt;Oracle’s R integration&lt;/a&gt; to the database is great news.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2. You can recruit for the commercial tool&lt;/h3&gt;&#xD;
&#xD;
&lt;ol&gt;&#xD;
&#xD;
&lt;li&gt;Recruiters are familiar with SAS and SPSS but not with R so it is easier to brief them and to get good quality CVs.  This will change and R becomes ever more popular and prevalent.  [And yes, we could in theory change recruiters to someone clued in, but again in large corporations there are procurement processes to be followed and existing agreements to be honoured so it will all take months or years.]&#xD;
&lt;/li&gt;&#xD;
&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;There are recognised training programmes for SPSS and (especially) SAS which makes it easier to recruit the technical skills.  How do you know what somebody knows when they say they “know R”?  How do you even &lt;em&gt;begin&lt;/em&gt; to quantify it from a CV?  How do you separate the guy who downloaded the tool and just read “&lt;a href="http://cran.r-project.org/doc/manuals/R-intro.html"&gt;An Introduction to R&lt;/a&gt;” from the Frank Harrells of this world?&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Yes,  I would argue (and in fact have argued in &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html"&gt;Commercial Analytics: The Capabilities&lt;/a&gt;) that technical skill is not the most important in an analyst (and can be learned anyhow) but it does help filter the CVs and, you guessed it, fits well with the corporation’s processes.&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
(One reason we use R internally is that we find that it is, on average, a more interesting type of analyst who is proficient in that tool.  It seems to encourage curiosity and love or learning in a way that menu-based tools do not.)&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I think the commercial R companies are really missing a trick here to provide recognised certification.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
You can’t search for R.  Seriously: try searching for R on LinkedIn (tip: there is &lt;a href="http://www.linkedin.com/skills/skill/R"&gt;another way&lt;/a&gt;).  Much easier to find SAS / SPSS skills in a large CV database (like LinkedIn where this discussion started).&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&#xD;
&lt;h3&gt;3. You can recruit for the commercial tool.&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Yes I know I already said that but there is another reason why this is critical.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
R takes talent to use.  (That is kind of why we like it.)  It takes talent to maintain.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
My problem as the manager of a commercial analytical insights team is that it is very hard for me to retain that talent.  Think about it: what can I offer in terms of career progression?  If you are an analyst you might become a senior analyst but you will always be an analyst.  There are no examples of a way up the organization (except perhaps out through IT and then up to CIO).    [This too will change with time.]  And new challenges: yes, some, but we are not a research university and it tends to be the same few problem types that we are always working on.  So if you are an analyst looking for new challenges and more pay, the best thing – the logical and rational thing to do – is to get a new job.  And your time with Big Corporation will look good on your CV and you will probably land the job easily.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;We can help&lt;/h2&gt;&#xD;
&lt;div class="floatRight" style="width: 150px"&gt;&#xD;
  &lt;p&gt;&#xD;
    &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html" title="Click to read Commercial Analytics: The Capabilities"&gt;&#xD;
      &lt;img src="http://static.cybaea.net/files/CCA/commercial-analytics-150.png" width="150" height="150" alt="[capabilities]"&gt;&lt;/img&gt;&#xD;
    &lt;/a&gt;&#xD;
  &lt;/p&gt;&#xD;
  &lt;p class="caption"&gt;Our &lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html"&gt;commercial analytics capabilities model&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
If you want to set up a commercial analytical group we can help you get it right first time.  The right people, the right processes, the right infrastructure and most importantly the right results.  We have done it before and are not tied to any specific tool or vendor.&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
If you want to improve or enhance your existing analytical teams, then we can &lt;a href="http://www.cybaea.net/Services/Reboot.html"&gt;Reboot your Analytics&lt;/a&gt; to deliver both rapid and sustained commercial results.&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
And if you just want the results we can provide commercial analytics as a service where we provide the insights and then work with you to turn those insights into commercial actions and better understanding of your business, markets, and customers, leaving you to focus on what you do best.&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p class="link"&gt;&#xD;
&lt;a href="http://www.cybaea.net/Contact/"&gt;Contact us&lt;/a&gt; now and get results from your analytics.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.35]" title="[0.35]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Commercial-Analytics-The-Capabilities.html" title="Commercial Analytics is the kind that makes money. From data to dollars, insights to income, this is all about how to run the business better. To do it and to do it well you need certain capabilities in place. This article builds a map of those business capabilities to help you assess, understand, and plan your business. Usually we talk about this and we are happy to talk to you about it (just contact us ) but we recently had occasion to make a slide pack that covered some of the materials as a stand-alone presentation. This article is based on that pack which is also available for download."&gt;Commercial Analytics: The Capabilities&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Commercial Analytics is the kind that makes money. From data to dollars, insights to income, this is all about how to run the business better. To do it and to do it well you need certain capabilities in place. This article builds a map of those business c…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/5-common-pitfalls-of-commercial-analytics-projects.html" title="We have seen data mining and other analytics projects fail; we have seen insights teams unable to deliver the insights needed to actually improve the business; we have seen marketing teams unable to use data effectively to guide and quantify their activities; we have seen business leaders who are sitting on piles of data but are effectively flying blind because they can not get from the data to the knowledge they need to inform their decisions. Below we have listed five common pitfalls of analytics in a commercial environment, their warning signs, and what you can do differently."&gt;5 common pitfalls of commercial analytics projects&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Jo8p0HAP-iI:7X0oQFW7_Mo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Jo8p0HAP-iI:7X0oQFW7_Mo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Jo8p0HAP-iI" height="1" width="1"/&gt;</content><published>2011-10-28T11:10:00Z</published><updated>2011-10-28T11:10:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html</feedburner:origLink></entry><entry><title type="text">Friday quote: what is the question to which this number is the answer?</title><id>urn:uuid:4c6c66d3-6fac-53f0-88fe-87e29b0488f6</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Friday-quote-20110826.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/sbNH3HpbijQ/Friday-quote-20110826.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>John Kay <a href="http://www.johnkay.com/2011/08/24/sex-lies-and-pitfalls-of-overblown-statistics">muses</a> on interpreting statistical data:</p>
<blockquote>
<p>Always ask of such data “<b>what is the question to which this number is the answer?</b>”. “<i>Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs</i>” is the answer to the question “<i>what is the highest profit number we can present without attracting flat disbelief?</i>”.</p>
</blockquote></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;John Kay &lt;a href="http://www.johnkay.com/2011/08/24/sex-lies-and-pitfalls-of-overblown-statistics"&gt;muses&lt;/a&gt; on interpreting statistical data:&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;Always ask of such data “&lt;b&gt;what is the question to which this number is the answer?&lt;/b&gt;”. “&lt;i&gt;Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs&lt;/i&gt;” is the answer to the question “&lt;i&gt;what is the highest profit number we can present without attracting flat disbelief?&lt;/i&gt;”.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;And on the pitfalls of powerful data analysis tools:&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;When the data seem to point to an unexpected finding, always consider the possibility that the problem is a feature of the data, rather than a feature of the world.  […] It is now easy to import data into a computer program without thought. The unwarranted precision of the projected growth in rail traffic – a 96 per cent increase, rather than a doubling – is a clue that the number was generated by a computer, not a skilled interpreter of evidence.&lt;/p&gt;&#xD;
&lt;p&gt;Statistics are only as valid as the sources from which they are drawn and the abilities of those who use them. When I discover something surprising in data, the most common explanation is that I made a mistake.&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=sbNH3HpbijQ:r0fPHH1nWlo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=sbNH3HpbijQ:r0fPHH1nWlo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/sbNH3HpbijQ" height="1" width="1"/&gt;</content><published>2011-08-26T09:05:00Z</published><updated>2011-08-26T09:05:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Friday-quote-20110826.html</feedburner:origLink></entry><entry><title type="text">A warning on the R save format</title><id>urn:uuid:52d4ca53-07ff-59e3-92cb-54f97d3dd30e</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CwF2gIjFK2Y/A-warning-on-the-R-save-format.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>The <code>save()</code> function in the <a href="http://www.r-project.org/">R platform for statistical computing</a> is very convenient and I suspect many of us use it a lot.  But I was recently bitten by a “feature” of the format which meant I could not recover my data.</p>
<p>I recommend that you save data in a data format (e.g. CSV or CDF), not using the <code>save()</code> function which is really for objects (data and code).  What is your approach?</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;The &lt;code&gt;save()&lt;/code&gt; function in the &lt;a href="http://www.r-project.org/"&gt;R platform for statistical computing&lt;/a&gt; is very convenient and I suspect many of us use it a lot.  But I was recently bitten by a “feature” of the format which meant I could not recover my data.&lt;/p&gt;&#xD;
&lt;h2&gt;How to lose your data with &lt;code&gt;save()&lt;/code&gt;&lt;/h2&gt;&#xD;
&lt;p&gt;I am using Windows on my travel laptop and Linux on my workstation.  To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.&lt;/p&gt;&#xD;
&lt;p&gt;To illustrate the problem with the save file format, I created a file on the Linux machine simply as:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("multicore")&#xD;
a &amp;lt;- list(data = 1:10, fun = mclapply)&#xD;
save(a, file = "a.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;What could be simpler?  The &lt;code&gt;mclapply&lt;/code&gt; is a function from the ‘multicore’ package but it clearly has no impact on the stored data.  (We will show a more realistic example below ­– work with me here.)&lt;/p&gt;&#xD;
&lt;p&gt;But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Error in loadNamespace(name) : there is no package called 'multicore'&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&lt;strong&gt;There is no way of getting to your precious data&lt;/strong&gt; without installing the missing package.&lt;/p&gt;&#xD;
&lt;p&gt;If the package has been withdrawn or is no longer available then your data is basically lost.&lt;/p&gt;&#xD;
&lt;h2&gt;What can you do?&lt;/h2&gt;&#xD;
&lt;p&gt;Some suggestions from the helpful people on R-help:&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;(&lt;a href="http://www.statistik.tu-dortmund.de/ligges.html"&gt;Uwe Ligges&lt;/a&gt;): You could try to rewrite &lt;code&gt;./src/main/saveload.R&lt;/code&gt; and &lt;code&gt;serialize.R&lt;/code&gt; to extract only the parts you need.  “This is probably not worth the effort.”&lt;/li&gt;&#xD;
&lt;li&gt;(&lt;a href="http://www.stats.ox.ac.uk/~ripley/"&gt;Prof. Brian Ripley&lt;/a&gt;): You could try installing the missing package; &lt;code&gt;R CMD INSTALL --fake&lt;/code&gt; should be sufficient to let you load the data.  Also suggests that the proposal above would be very hard indeed.&lt;/li&gt;&#xD;
&lt;li&gt;(&lt;a href="http://blog.revolutionanalytics.com/2011/05/the-r-files-martin-morgan.html"&gt;Martin Morgan&lt;/a&gt;): Don't store package functions with your code.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;That is three good answers from three of the heavy-weights in the R community.  Thank you all!&lt;/p&gt;&#xD;
&lt;p&gt;Martin’s comment is worth expanding.  We can change the above example to:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("multicore")&#xD;
computeFunction &amp;lt;- function(...) {&#xD;
    if (require(multicore)) mclapply(...)&#xD;
    else lapply(...) &#xD;
}&#xD;
a &amp;lt;- list(data = 1:10, fun = computeFunction)&#xD;
save(a, file = "a.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Now everything works fine!  No data is horribly lost: the file loads fine on the ‘multicore’-less machine.&lt;/p&gt;&#xD;
&lt;p&gt;And for the more realistic example, I had been using &lt;code&gt;&lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt;::rfe&lt;/code&gt; as Martin knew in the example he provided:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("&lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt;")&#xD;
data(BloodBrain)&#xD;
&#xD;
x &amp;lt;- scale(bbbDescr[,-nearZeroVar(bbbDescr)])&#xD;
x &amp;lt;- x[, -findCorrelation(cor(x), .8)]&#xD;
x &amp;lt;- as.data.frame(x)&#xD;
&#xD;
set.seed(1)&#xD;
lmProfile &amp;lt;- rfe(x, logBBB,&#xD;
                 sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65),&#xD;
                 rfeControl = rfeControl(functions = lmFuncs,&#xD;
                   number = 5,&#xD;
                   computeFunction=mclapply))&#xD;
save(lmProfile, file = "lmProfile.RData")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.&lt;/p&gt;&#xD;
&lt;p&gt;For old files I will use the &lt;code&gt;R CMD INSTALL --fake&lt;/code&gt; suggestion, but for new data I am going with the last approach and using a &lt;code&gt;computeFunction&lt;/code&gt; like this:&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl &#xD;
### that does not leave a reference to the multicore package in the save file&#xD;
MCCompute &amp;lt;- function(X, FUN, ...) {&#xD;
    FUN &amp;lt;- match.fun(FUN)&#xD;
    if (!is.vector(X) || is.object(X)) &#xD;
        X &amp;lt;- as.list(X)&#xD;
    if (require("multicore")) mclapply(X, FUN, ...)&#xD;
    else lapply(X, FUN, ...)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;I know that Max Kuhn is rewriting the &lt;a ref="http://cran.r-project.org/web/packages/caret/index.html"&gt;caret&lt;/a&gt; package which should make this a moot point in the near future for that specific case.  But the indirection approach is generally useful and will also be relevant in other situations.&lt;/p&gt;&#xD;
&lt;h2&gt;Recommendations&lt;/h2&gt;&#xD;
&lt;p&gt;My recommendations:&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&lt;strong&gt;Save data in a data format, not using the &lt;code&gt;save()&lt;/code&gt; function which is really for objects (data and code)&lt;/strong&gt;.  Suitable formats include CSV and variants, &lt;a href="http://cran.r-project.org/web/packages/hdf5/index.html"&gt;HDF5&lt;/a&gt;, and &lt;a href="http://cran.r-project.org/web/packages/ncdf4/index.html"&gt;CDF&lt;/a&gt;, as well as others.&lt;/li&gt;&#xD;
&lt;li&gt;Avoid references to packages in your objects by using the one level indirection trick exemplified by the &lt;code&gt;MCCompute&lt;/code&gt; function shown.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;What is your approach?  Suggestions in the comments below, please.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" title="I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is."&gt;R tips: Determine if function is called from specific package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.42]" title="[0.42]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CwF2gIjFK2Y:g_OungjbLuI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CwF2gIjFK2Y:g_OungjbLuI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CwF2gIjFK2Y" height="1" width="1"/&gt;</content><published>2011-08-23T07:20:00Z</published><updated>2011-08-23T07:20:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html</feedburner:origLink></entry><entry><title type="text">Friday quote: the handmaiden and the whore</title><id>urn:uuid:11daa2ff-5d4a-534e-aef8-66ce1e157cd8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Friday-quote-20110819.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/KFA3sPOOdCI/Friday-quote-20110819.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Because it is Friday and because we collect quotes:</p>
<blockquote>
  <p>If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship.  Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers.  They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.</p>
</blockquote></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Because it is Friday and because we collect quotes.&lt;/p&gt;&#xD;
&lt;blockquote&gt;&lt;p&gt;If mathematics is the handmaiden of science, statistics is the whore: all that scientists are looking for is a quick fix without the encumbrance of a meaningful relationship.  Statisticians are second-class mathematicians, third-rate scientists and fourth-rate thinkers.  They are the hyenas, jackals and vultures of the scientific ecology: picking over the bones and carcasses of the game that the big cats, the biologists, the physicists and the chemists, have brought down.&lt;/p&gt;&#xD;
&lt;p&gt;Statistics is a wonderful discipline.  It has it all: mathematics and philosophy, analysis and empiricism, as well as applicability, relevance and the fascination of data.  It demands clear thinking, good judgement and flair.  Statisticians are engaged in an exhausting but exhilarating struggle with the biggest challenge that philosophy makes to science: how do we translate information into knowledge?&lt;/p&gt;&#xD;
&lt;p&gt;―Stephen Senn: &lt;a href="http://www.amazon.co.uk/gp/product/0521540232/ref=as_li_ss_tl?ie=UTF8&amp;amp;tag=cybaea-21&amp;amp;linkCode=as2&amp;amp;camp=1634&amp;amp;creative=19450&amp;amp;creativeASIN=0521540232"&gt;Dicing with Death: Chance, Risk and Health&lt;/a&gt;&lt;img src="http://www.assoc-amazon.co.uk/e/ir?t=&amp;amp;l=as2&amp;amp;o=2&amp;amp;a=0521540232" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"&gt;&lt;/img&gt;&#xD;
&lt;/p&gt;&lt;/blockquote&gt;&#xD;
&lt;p&gt;Which one of the two views are closest to your opinion?&lt;/p&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=KFA3sPOOdCI:cU203wvQpIk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=KFA3sPOOdCI:cU203wvQpIk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/KFA3sPOOdCI" height="1" width="1"/&gt;</content><published>2011-08-19T12:04:00Z</published><updated>2011-08-19T12:04:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Friday-quote-20110819.html</feedburner:origLink></entry><entry><title type="text">Spreadsheet errors</title><id>urn:uuid:17669694-59c5-5798-a85d-ebb7c8d5802b</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/RcDqZYZa4mM/Spreadsheet-errors.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" title="Read full article"><img src="http://static.cybaea.net/files/GS-spreadsheet-error-thumb.png" width="150" height="150" alt="[Click for article]" /></a>
</div>
<p>For my sins, I have done more than my fair share of analysis in Excel.  I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client).  Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation.  But I don’t like it and let’s have a look at one reason why.  In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;For my sins, I have done more than my fair share of analysis in Excel.  I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client).  Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation.  But I don’t like it and let’s have a look at one reason why.  In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;div style="float: right; margin-left: 1em; overflow: scroll; height: 30em"&gt;&#xD;
&lt;table class="excel"&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;Y&lt;/th&gt;&lt;th&gt;X1&lt;/th&gt;&lt;th&gt;X2&lt;/th&gt;&lt;th&gt;X3&lt;/th&gt;&lt;th&gt;X4&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5.88&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.56&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;11.11&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.79&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;15.6&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.7&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8.49&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;51.2&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;14.2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7.14&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4.2&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6.15&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10.46&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10.42&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;17.36&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;13.41&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;41.67&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.78&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.98&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.62&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4.65&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.13&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;24.58&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5.56&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.26&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.13&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7.56&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9.93&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;16.67&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;16.89&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;13.71&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6.35&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.5&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.47&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;21.74&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;23.6&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;11.11&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3.57&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.9&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.94&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.42&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;18.75&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;0.00&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2.27&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
Spreadsheets are good for some things, but analysing data is not one of them.  The example data in the table on the right is from  Jeffrey S. Simonoff, “&lt;a href="http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf" title="Statistical analysis using Microsoft Excel"&gt;Statistical analysis using Microsoft Excel&lt;/a&gt;” (2008), and looks at first (and maybe even second) glance like a reasonable set of observations.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, the predictors are (accidentally) collinear so no meaningful fit is possible, unless one of them are dropped.  We see that very easily if we try to do the analysis using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; d &amp;lt;- read.delim("clipboard")  # Read DATA range from clipboard&#xD;
&amp;gt; summary(lm(Y ~ ., data = d))&#xD;
&#xD;
Call:&#xD;
lm(formula = Y ~ ., data = d)&#xD;
&#xD;
Residuals:&#xD;
    Min      1Q  Median      3Q     Max &#xD;
-11.222  -5.821  -2.546   3.171  40.750 &#xD;
&#xD;
Coefficients: &lt;strong&gt;(1 not defined because of singularities)&lt;/strong&gt;&#xD;
            Estimate Std. Error t value Pr(&amp;gt;|t|)&#xD;
(Intercept)   4.1945     3.9749   1.055    0.296&#xD;
X1            0.3862     0.5652   0.683    0.497&#xD;
X2            0.2308     3.1590   0.073    0.942&#xD;
X3            3.7072     2.9922   1.239    0.221&#xD;
X4                NA         NA      NA       NA&#xD;
&#xD;
Residual standard error: 10.14 on 50 degrees of freedom&#xD;
Multiple R-squared: 0.04767,	Adjusted R-squared: -0.009466 &#xD;
F-statistic: 0.8343 on 3 and 50 DF,  p-value: 0.4814 &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
We have highlighted the message that R has automatically dropped one of the predictors.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Everybody likes to pick on Excel, so let us load the data into version 3.3.2 of &lt;a href="http://www.libreoffice.org/"&gt;LibreOffice&lt;/a&gt;, the free Open Source personal productivity suite, instead.  It faithfully implements many of the worst features of Excel.  You can grab a copy of the spreadsheet &lt;a href="http://static.cybaea.net/files/GS-spreadsheet-error.ods"&gt;GS-spreadsheet-error.ods&lt;/a&gt; yourself and see the results.  The relevant function in both Excel and LibreOffice for linear regression is LINEST and applying it to the data set give us:&#xD;
&lt;/p&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/GS-spreadsheet-error-1.png" width="723" height="119" alt="[Screenshot 1]"&gt;&lt;/img&gt;&#xD;
&lt;p&gt;&#xD;
Of the 16 values returned by the function, fully 12 of them are incorrect (highlighted in red), and the '#VALUE!' entries are the only thing that suggests we may have a problem.  (The '#N/A' values are a feature of the function and not a problem.)  Excluding the X4 values from the function call gives meaningful (and correct) results:&#xD;
&lt;/p&gt;&#xD;
&lt;img src="http://static.cybaea.net/files/GS-spreadsheet-error-2.png" width="602" height="119" alt="[Screenshot 2]"&gt;&lt;/img&gt;&#xD;
&lt;p&gt;&#xD;
There is so much wrong with doing even this trivial analysis in a spreadsheet that it is hard to know where to start.  Some of the problems:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;dl&gt;&#xD;
&lt;dt&gt;Garbage results instead of errors&lt;/dt&gt;&lt;dd&gt;Instead of giving meaningful errors or warnings, the spreadsheets simply produce garbage results.  This is nearly impossible to debug.&lt;/dd&gt;&#xD;
&lt;dt&gt;No help on how to correct the problem&lt;/dt&gt;&lt;dd&gt;In the erroneous results of the first figure, there is no clue, no hint, no help to figure out how to correct the problem.  You could argue about R correcting the issue ”automagically”, but at least it finds a solution to the problem and tells you about it.&lt;/dd&gt;&#xD;
&lt;dt&gt;Error prone output formats&lt;/dt&gt;&lt;dd&gt;I put in the row and column headings because otherwise it is just too hard to read the data.  Where does the function stuff the F statistics again?&lt;/dd&gt;&#xD;
&lt;/dl&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
And don’t get me started on version control and documentation.  Don’t even mention that the maths in Excel are wrong.  Remember: Friends do not let friends do data analysis in spreadsheets.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Excel_Tip_1.html" title="I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below."&gt;Excel Tip: Array boolean operator&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" title="Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here."&gt;R tips: Installing Rmpi on Fedora Linux&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.42]" title="[0.42]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Bubble 2.0.html" title="We are seeing the same thing, if a little less and a little delayed. Does it have to be like this? I dont think it is just the tech industry but any new and hot growth area. Fred Wilson writes in Bubble 2.0 that we are heading for a new bubble, similar to the one that ended around the year 2000. “ But increasingly money is being made the way we made it from 1998 to early 2000; [momentum] investing, speculation, fast money chasing deals, caution being thrown to the wind, and amateurs jumping in on the action. Its hard to say no to a good party. I am struggling with the temptations myself. ” I am in two minds about how it will go this time...."&gt;Bubble 2.0&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are seeing the same thing, if a little less and a little delayed. Does it have to be like this? I dont think it is just the tech industry but any new and hot growth area. Fred Wilson writes in Bubble 2.0 that we are heading for a new bubble, similar to…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=RcDqZYZa4mM:U0fHu-pflNM:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=RcDqZYZa4mM:U0fHu-pflNM:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/RcDqZYZa4mM" height="1" width="1"/&gt;</content><published>2011-04-20T11:19:00Z</published><updated>2011-04-20T11:19:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html</feedburner:origLink></entry><entry><title type="text">Getting started with the Heritage Health Price competition</title><id>urn:uuid:7e9f3d60-249c-5df1-9c75-a584492c0fa1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/DoFsYQmBMRM/Getting-started-with-HHP.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>The US$ 3 million <a href="http://www.heritagehealthprize.com/">Heritage Health Price</a> competition is on so we take a look at how to get started using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;The US$ 3 million &lt;a href="http://www.heritagehealthprize.com/"&gt;Heritage Health Price&lt;/a&gt; competition is on so we take a look at how to get started using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We do not have the full set of data yet, so this is a simple warm-up session to predict the days in hospital in year 2 based on the year 1 data.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Prerequisites&lt;/h2&gt;&#xD;
&lt;p&gt;Obviously you need to have R installed, and you should also have signed up for the competition (be sure to read the terms carefully) and downloaded and extracted the release 1 data file.&lt;/p&gt;&#xD;
&#xD;
&lt;h2 id="h2DataPrep"&gt;Data preparation&lt;/h2&gt;&#xD;
&lt;p&gt;Let’s load the data into R and do some basic housekeeping:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
##############################&#xD;
#### DATA PREPARATION&#xD;
&#xD;
##++++&#xD;
## Members&#xD;
members   &amp;lt;- read.csv(file = "HHP_release1/Members_Y1.csv",&#xD;
                      colClasses = rep("factor", 3),&#xD;
                      comment.char = "")&#xD;
##----&#xD;
##++++&#xD;
## Claims&#xD;
claims.Y1 &amp;lt;- read.csv(file = "HHP_release1/Claims_Y1.csv",&#xD;
                      colClasses = c(&#xD;
                          rep("factor", 7),&#xD;
                          "integer",    # paydelay&#xD;
                          "character",  # LengthOfStay&#xD;
                          "character",  # dsfs&#xD;
                          "factor",     # PrimaryConditionGroup&#xD;
                          "character"   # CharlsonIndex&#xD;
                          ),&#xD;
                      comment.char = "")&#xD;
## Utility function&#xD;
make.numeric &amp;lt;- function (cv, FUN = mean) {&#xD;
### make a character vector numeric by splitting on '-'&#xD;
    sapply(strsplit(gsub("[^[:digit:]]+",&#xD;
                         " ",&#xD;
                         cv,&#xD;
                         perl = TRUE),&#xD;
                    " ",&#xD;
                    fixed = TRUE),&#xD;
           function (x) FUN(as.numeric(x)))&#xD;
}&#xD;
## Length of stay as days&#xD;
{&#xD;
    z &amp;lt;- make.numeric(claims.Y1$LengthOfStay)&#xD;
    z.week &amp;lt;- grepl("week", claims.Y1$LengthOfStay, fixed = TRUE)&#xD;
    z[z.week] &amp;lt;- z[z.week] * 7          # Weeks are 7 days&#xD;
    z[is.nan(z)] &amp;lt;- 0&#xD;
    claims.Y1$LengthOfStay.days &amp;lt;- z&#xD;
}&#xD;
los.levels &amp;lt;- c("", "1 day", sprintf("%d days", 2:6),&#xD;
                "1- 2 weeks", "2- 4 weeks", "4- 8 weeks", "8-12 weeks",&#xD;
                "12-26 weeks", "26+ weeks")&#xD;
stopifnot(all(claims.Y1$LengthOfStay %in% los.levels))&#xD;
claims.Y1$LengthOfStay &amp;lt;- factor(claims.Y1$LengthOfStay,&#xD;
                                 levels = los.levels,&#xD;
                                 labels = c("0 days", los.levels[-1]),&#xD;
                                 ordered = TRUE)&#xD;
## Months since first claim&#xD;
claims.Y1$dsfs.months &amp;lt;- make.numeric(claims.Y1$dsfs)&#xD;
## dsfs is an ordered factor and gives the ordering of the claims&#xD;
dsfs.levels &amp;lt;- c("0- 1 month", sprintf("%d-%2d months", 1:11, 2:12))&#xD;
claims.Y1$dsfs &amp;lt;- factor(claims.Y1$dsfs, levels = dsfs.levels, ordered = TRUE)&#xD;
## Index as numeric&#xD;
claims.Y1$CharlsonIndex.numeric &amp;lt;- make.numeric(claims.Y1$CharlsonIndex)&#xD;
claims.Y1$CharlsonIndex &amp;lt;- factor(claims.Y1$CharlsonIndex, ordered = TRUE)&#xD;
##----&#xD;
##++++&#xD;
## Days in hospital&#xD;
dih.Y2    &amp;lt;- read.csv(file = "HHP_release1/DayInHospital_Y2.csv",&#xD;
                      colClasses = c("factor", "integer"),&#xD;
                      comment.char = "")&#xD;
names(dih.Y2)[1] &amp;lt;- "MemberID"          # Fix broken file&#xD;
##----&#xD;
save(members, claims.Y1, dih.Y2,&#xD;
     file = "HHPR1.RData")&#xD;
##############################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2 id="h2Score"&gt;Scoring&lt;/h2&gt;&#xD;
&lt;p&gt;We will need a function to score our predictions &lt;code&gt;p&lt;/code&gt; against the actual values &lt;code&gt;a&lt;/code&gt;.  The formula is on the &lt;a href="http://www.heritagehealthprize.com/c/hhp/Details/Evaluation"&gt;evaluation page&lt;/a&gt; and we implement it as:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
##############################&#xD;
#### FUNCTION TO CALCULATE SCORE&#xD;
HPPScore &amp;lt;- function (p, a) {&#xD;
### Scorng function after&#xD;
### http://www.heritagehealthprize.com/c/hhp/Details/Evaluation&#xD;
### Base 10 log from http://www.heritagehealthprize.com/forums/default.aspx?g=posts&amp;amp;m=2226#post2226&#xD;
    sqrt(mean((log(1+p, 10) - log(1+a, 10))^2))&#xD;
}&#xD;
##############################&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;The simplest benchmarks&lt;/h2&gt;&#xD;
&lt;p&gt;The simplest models don’t really model at all: they just use the average and are simple benchmarks.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
&#xD;
y &amp;lt;- dih.Y2$DaysInHospital_Y2           # Actual&#xD;
p &amp;lt;- rep(mean(y), NROW(dih.Y2))&#xD;
cat(sprintf("Score using mean  : %8.6f\n", HPPScore(p, y)))&#xD;
# Score using mean  : 0.278725&#xD;
&#xD;
p &amp;lt;- rep(median(y), NROW(dih.Y2))&#xD;
cat(sprintf("Score using median: %8.6f\n", HPPScore(p, y)))&#xD;
# Score using median: 0.267969&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;Simple single-variable linear models&lt;/h2&gt;&#xD;
&lt;p&gt;OK, a model that doesn’t use past data isn’t much of a model, so let’s improve on that:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;#!/usr/bin/Rscript&#xD;
## example001.R - simple benchmarks for the HHP&#xD;
## Copyright © 2011 CYBAEA Limited - &lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;&#xD;
library("reshape2")&#xD;
&#xD;
vars &amp;lt;- dcast(claims.Y1, MemberID ~ ., sum, value_var = "LengthOfStay.days")&#xD;
names(vars)[2] &amp;lt;- "LengthOfStay"&#xD;
data &amp;lt;- merge(vars, dih.Y2)&#xD;
&#xD;
model &amp;lt;- lm(DaysInHospital_Y2 ~ LengthOfStay, data = data)&#xD;
p &amp;lt;- predict(model)&#xD;
cat(sprintf("Score using lm(LengthOfStay): %8.6f\n", HPPScore(p, y)))&#xD;
# Score using lm(LengthOfStay): 0.279062&#xD;
&#xD;
model &amp;lt;- glm(DaysInHospital_Y2 ~ LengthOfStay,&#xD;
             family = quasipoisson(),&#xD;
             data = data)&#xD;
p &amp;lt;- predict(model, type="response")&#xD;
cat(sprintf("Score using glm(LengthOfStay): %8.6f\n", HPPScore(p, y)))&#xD;
# Score using glm(LengthOfStay): 0.278914&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Let the competition begin.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=DoFsYQmBMRM:_PKWgScWvpo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=DoFsYQmBMRM:_PKWgScWvpo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/DoFsYQmBMRM" height="1" width="1"/&gt;</content><published>2011-04-08T08:39:00Z</published><updated>2011-04-08T08:39:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html</feedburner:origLink></entry><entry><title type="text">Benchmarking feature selection with Boruta and caret</title><id>urn:uuid:1a953ff9-7aa7-5db9-9a49-ec6e3ba6872f</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/G4iSTwL88Q0/Benchmarking-feature-selection-with-Boruta-and-caret.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Click for full article"><img src="http://static.cybaea.net/images/Boruta-feature-benchmark-150.png" width="150" height="150" alt="[Performance of Boruta feature selection]" /></a>
</div>
<p>
<dfn>Feature selection</dfn> is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering.  For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process.  And since we often work on very large data sets the performance of our process is very important to us.
</p>
<p>
Having looked at <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html">feature selection using the Boruta package</a> and <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html">feature selection using the caret package</a> separately, we now consider the performance of the two approaches.
</p>
<p>
Neither approach is suitable out of the box for the sizes of data sets that we normally work with.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
&lt;dfn&gt;Feature selection&lt;/dfn&gt; is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering.  For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process.  And since we often work on very large data sets the performance of our process is very important to us.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Having looked at &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;feature selection using the Boruta package&lt;/a&gt; and &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html"&gt;feature selection using the caret package&lt;/a&gt; separately, we now consider the performance of the two approaches.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For our tests we will use an artificially constructed trivial data sets that the automated process should have no problems with (but we will be disappointed later on this expectation, as we will see).  The data set has an equal number of normal and uniform random variables with mean 0 and variance 1 of which 20% are used for the target variable.  There are 10 time as many observations as variables.  We create a function to set this up:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;make.data &amp;lt;- function (n.var, m.rand = 5, m.obs = 10) {&#xD;
    n.col &amp;lt;- n.var * m.rand&#xD;
    n.obs &amp;lt;- n.col * m.obs * 2&#xD;
    x &amp;lt;- data.frame(N = matrix(rnorm(n = n.col*n.obs),&#xD;
                        nrow = n.obs, ncol = n.col),&#xD;
                    U = matrix(runif(n = n.col*n.obs,&#xD;
                        min = -sqrt(3), max = sqrt(3)), n.obs, n.col))&#xD;
    deps.n &amp;lt;- 1:n.var&#xD;
    deps.u &amp;lt;- (1+n.col):(n.var+n.col)&#xD;
    y &amp;lt;- rowSums(as.matrix(x[, c(deps.n, deps.u)]))&#xD;
    x &amp;lt;- cbind(x, Y = factor(y &amp;gt;= 0, labels=c("N", "P")))&#xD;
    attr(x, "vars") &amp;lt;- names(x)[c(deps.n, deps.u)]&#xD;
    return(x)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;The Boruta package&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
Then we run a test using the Boruta package for different sizes:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;#!/usr/bin/Rscript&#xD;
## bench.R - benchmark Boruta package&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
run.name &amp;lt;- "bench-1"&#xD;
library("Boruta")&#xD;
&#xD;
set.seed(1)&#xD;
&#xD;
sizes &amp;lt;- c(1:10, 10*(2:10), 100*(2:10), 1e3*(2:10))&#xD;
n.sizes &amp;lt;- length(sizes)&#xD;
bench &amp;lt;- data.frame(n.vars = sizes, elapsed = NA, right = NA, wrong = NA)&#xD;
file.name &amp;lt;- paste(run.name, "RData", sep = ".")&#xD;
&#xD;
for (n in 1:length(sizes)) {&#xD;
    size &amp;lt;- sizes[n]&#xD;
    cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))&#xD;
    tries &amp;lt;- max(3, round(10/size, 0))&#xD;
    n.right &amp;lt;- 0&#xD;
    n.wrong &amp;lt;- 0&#xD;
    elapsed &amp;lt;- 0&#xD;
    for (try in 1:tries) {&#xD;
        cat(tries-try, ".", sep = "")&#xD;
        x &amp;lt;- make.data(size)&#xD;
        x.vars &amp;lt;- attr(x, "vars")&#xD;
        elapsed &amp;lt;- elapsed +&#xD;
            system.time({b &amp;lt;- Boruta(x[,-NCOL(x)], x[,NCOL(x)])}&#xD;
                        )["elapsed"]&#xD;
        b.vars &amp;lt;- names(b$finalDecision)[b$finalDecision!="Rejected"]&#xD;
        n.right &amp;lt;- n.right + length(intersect(b.vars, x.vars))&#xD;
        n.wrong &amp;lt;- n.wrong + length(setdiff(b.vars, x.vars))&#xD;
    }&#xD;
    elapsed &amp;lt;- elapsed / tries&#xD;
    cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")&#xD;
    n.right &amp;lt;- n.right / tries&#xD;
    n.wrong &amp;lt;- n.wrong / tries&#xD;
    bench[n, ] &amp;lt;- c(size, elapsed, n.right, n.wrong)&#xD;
    save(bench, file = file.name, ascii = FALSE, compress = FALSE)&#xD;
}&#xD;
&#xD;
print(bench)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
As it turned out, our expectations for the size of data set we could handle were wildly optimistic and we killed the process at size 30.  We add to the data set a field with the total number of variables in the &lt;code&gt;x&lt;/code&gt; data set and plot the results.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;load(file = "bench-1.RData")&#xD;
bench &amp;lt;- na.omit(bench)&#xD;
bench$n.elem &amp;lt;- bench$n.var^2 * 1e3&#xD;
plot(elapsed ~ n.elem, data = bench, type = "b",&#xD;
     main = "Feature selections with Boruta",&#xD;
     sub = "Elapsed time versus number of data elements",&#xD;
     log = "xy",&#xD;
     xlab = "Elements in data set", ylab = "Elapsed time (seconds)")&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/Boruta-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/Boruta-feature-benchmark-400.png" width="400" height="400" alt="[Boruta feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with Boruta package shows linear scaling (slope is 1.01 with standard error 0.025 and adjusted R² 0.993)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;A quick check using &lt;code&gt;summary(lm(log(elapsed) ~ log(n.elem), data = bench))&lt;/code&gt; shows us a linear scaling with the number of elements (slope is 1.01 with standard error 0.025 and adjusted R² 0.993).  The algorithm selects all the right features up to &lt;code&gt;n.vars = 10&lt;/code&gt; when it starts to miss some of them:&#xD;
&lt;/p&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for Boruta package&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2.00000&lt;/td&gt;&lt;td&gt;1.1000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;4.00000&lt;/td&gt;&lt;td&gt;1.2000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;6.00000&lt;/td&gt;&lt;td&gt;1.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;8.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;10.00000&lt;/td&gt;&lt;td&gt;1.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;12.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;14.00000&lt;/td&gt;&lt;td&gt;1.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;16.00000&lt;/td&gt;&lt;td&gt;1.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;18.00000&lt;/td&gt;&lt;td&gt;0.6666667&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;20.00000&lt;/td&gt;&lt;td&gt;0.3333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;39.33333&lt;/td&gt;&lt;td&gt;0.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;56.33333&lt;/td&gt;&lt;td&gt;0.0000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;p&gt;&#xD;
A higher accuracy in the feature selection for the larger problems could presumably be achieved by adjusting the &lt;code&gt;maxRuns&lt;/code&gt; and perhaps &lt;code&gt;confidence&lt;/code&gt; parameters on the &lt;code&gt;Boruta&lt;/code&gt; call.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In summary, the Boruta package performs well up to about 20 features out of 100 (&lt;code&gt;n.vars = 10&lt;/code&gt;) which runs in about 11 minutes on my machine.  If we changed the technical implementation to support multicore, MPI, and other parallel frameworks, then the out of the box settings would be useful up to &lt;code&gt;n.vars&lt;/code&gt; of 20 or 30 (40-60 features out of 200-300) which an 8-core machine should be able to complete in 20 minutes or so.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is still a lot less than the size of data sets we normally work with.  (Our usual benchmark is 15,000 variables and 50,000 observations.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The caret package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
One of the nice features of the caret package is that is supports most parallel processing frameworks out of the box, but for comparison with the previous analysis we will (somewhat unfairly) not use that here.  The setup is then quite simple, using the same &lt;code&gt;make.data&lt;/code&gt; function as before.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;#!/usr/bin/Rscript&#xD;
## bench.R - benchmark caret package&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
run.name &amp;lt;- "bench-2"&#xD;
library("caret")&#xD;
library("randomForest")&#xD;
set.seed(1)&#xD;
&#xD;
control &amp;lt;- rfeControl(functions = rfFuncs, verbose = FALSE,&#xD;
                      returnResamp = "final")&#xD;
&#xD;
## if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) {&#xD;
##     control$workers &amp;lt;- multicore:::detectCores()&#xD;
##     control$computeFunction &amp;lt;- mclapply&#xD;
##     control$computeArgs &amp;lt;- list(mc.preschedule = FALSE, mc.set.seed = FALSE)&#xD;
## }&#xD;
&#xD;
our.sizes &amp;lt;- c(2:10, 10*(2:10), 100*(2:10), 1e3*(2:10))&#xD;
n.sizes &amp;lt;- length(our.sizes)&#xD;
bench &amp;lt;- data.frame(n.vars = our.sizes, elapsed = NA, right = NA, wrong = NA)&#xD;
file.name &amp;lt;- paste(run.name, "RData", sep = ".")&#xD;
&#xD;
for (n in 1:length(our.sizes)) {&#xD;
    size &amp;lt;- our.sizes[n]&#xD;
    cat(sprintf("[%s] Size = %3d: ", as.character(Sys.time()), size))&#xD;
    tries &amp;lt;- max(3, round(10/size, 0))&#xD;
    n.right &amp;lt;- 0&#xD;
    n.wrong &amp;lt;- 0&#xD;
    elapsed &amp;lt;- 0&#xD;
    for (try in 1:tries) {&#xD;
        cat(tries-try, ".", sep = "")&#xD;
        x &amp;lt;- make.data(size)&#xD;
        x.vars &amp;lt;- attr(x, "vars")&#xD;
        elapsed &amp;lt;- elapsed + &#xD;
            system.time({p &amp;lt;- rfe(x[,-NCOL(x)], x[,NCOL(x)],&#xD;
                                  sizes = 1:(2*size), rfeControl = control)}&#xD;
                        )["elapsed"]&#xD;
        p.vars &amp;lt;- predictors(p)&#xD;
        n.right &amp;lt;- n.right + length(intersect(p.vars, x.vars))&#xD;
        n.wrong &amp;lt;- n.wrong + length(setdiff(p.vars, x.vars))&#xD;
    }&#xD;
    elapsed &amp;lt;- elapsed / tries&#xD;
    cat(" Elapsed = ", round(elapsed, 0), " seconds\n", sep = "")&#xD;
    n.right &amp;lt;- n.right / tries&#xD;
    n.wrong &amp;lt;- n.wrong / tries&#xD;
    bench[n, ] &amp;lt;- c(size, elapsed, n.right, n.wrong)&#xD;
    save(bench, file = file.name, ascii = FALSE, compress = FALSE)&#xD;
}&#xD;
&#xD;
print(bench)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
This uses the &lt;code&gt;randomForest&lt;/code&gt; classifier from the package of the same name.  To use the &lt;code&gt;ipredbagg&lt;/code&gt; bagging classifier from Andrea Peters and Torsten Hothorn's &lt;a href="http://CRAN.R-project.org/package=ipred"&gt;ipred: Improved Predictors&lt;/a&gt; package we simply change the &lt;code&gt;control&lt;/code&gt; object to:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;control &amp;lt;- rfeControl(functions = treebagFuncs, verbose = FALSE,&#xD;
                      returnResamp = "final")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
As usual, we were widely optimistic in our guesses for the size of problems we could handle, and had to abort the run.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;div class="floatCenter"&gt;&#xD;
&lt;div style="width: 400px; margin-right: 10px; display: inline-block;"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/caret-rf-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/caret-rf-feature-benchmark-400.png" width="400" height="400" alt="[caret feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with caret package using randomForest classifier (slope is 1.17 with standard error 0.024 and adjusted R² 0.996)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div style="width: 400px; display: inline-block;"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/caret-treebag-feature-benchmark.png"&gt;&lt;img src="http://static.cybaea.net/images/caret-treebag-feature-benchmark-400.png" width="400" height="400" alt="[caret feature selection benchmark]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Benchmarking results for feature selection with caret package using treebag classifier shows non-power behaviour (nevertheless, a linear log-log fit gives a slope of 1.12 with standard error 0.067 and adjusted R² 0.96)&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;div class="floatCenter"&gt;&#xD;
&lt;div style="width: 400px; margin-right: 10px; display: inline-block;"&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for caret package using randomForest classifier&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3.20000&lt;/td&gt;&lt;td&gt;3.200000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;5.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;7.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;9.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;11.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;13.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;14.66667&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;16.66667&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;19.00000&lt;/td&gt;&lt;td&gt;0.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;38.66667&lt;/td&gt;&lt;td&gt;1.333333&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;54.00000&lt;/td&gt;&lt;td&gt;86.000000&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div style="width: 400px; display: inline-block;"&gt;&#xD;
&lt;table&gt;&#xD;
&lt;caption&gt;Benchmark results for caret package using ipredbagg classifier&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&lt;th&gt;n.vars&lt;/th&gt;&lt;th&gt;right&lt;/th&gt;&lt;th&gt;wrong&lt;/th&gt;&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody style="text-align: right"&gt;&#xD;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;3.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;5.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;7.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;9.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;10.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;13.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;14.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;16.00000&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;10&lt;/td&gt;&lt;td&gt;18.66667&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;35.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;30&lt;/td&gt;&lt;td&gt;54.33333&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;69.66667&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Remember that the right number of significant features are &lt;code&gt;2 * n.vars&lt;/code&gt; and we see that the caret package apparently always miss one feature in its selection, which is very odd and possibly a bug.  It is less likely to select the wrong features than Boruta, but that could be partially due to "Tentative" data in Boruta.  Timing-wise, performance is a little worse in the non-parallel situation but realistically of course a lot better than Boruta depending on the number of cores on your processor or nodes in your cluster.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.52]" title="[0.52]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=G4iSTwL88Q0:G-MFRqmO24E:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=G4iSTwL88Q0:G-MFRqmO24E:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/G4iSTwL88Q0" height="1" width="1"/&gt;</content><published>2010-11-25T13:43:00Z</published><updated>2010-11-25T13:43:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html</feedburner:origLink></entry><entry><title type="text">Feature selection: Using the caret package</title><id>urn:uuid:1dda2c01-4d41-54a6-b70c-8d9c5be380fc</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/O6IQ4h7grTk/Feature-selection-Using-the-caret-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  In a previous post we looked at <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html">all-relevant feature selection using the Boruta package</a> while in this post we consider the same (artificial, toy) examples using the <a href="http://CRAN.R-project.org/package=caret">caret</a> package.  Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  In a previous post we looked at &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;all-relevant feature selection using the Boruta package&lt;/a&gt; while in this post we consider the same (artificial, toy) examples using the &lt;a href="http://CRAN.R-project.org/package=caret"&gt;caret&lt;/a&gt; package.  Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The caret package provides a very flexible framework for the analysis as we shall see, but first we set up the artificial test data set as in the previous article.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Feature-bc.R - Compare Boruta and caret feature selection&#xD;
## Copyright © 2010 Allan Engelhardt (http://www.cybaea.net/)&#xD;
run.name &amp;lt;- "feature-bc"&#xD;
library("caret")&#xD;
&#xD;
## Load early to get the warnings out of the way:&#xD;
library("randomForest")&#xD;
library("ipred")&#xD;
library("gbm")&#xD;
&#xD;
set.seed(1)&#xD;
&#xD;
## Set up artificial test data for our analysis&#xD;
n.var &amp;lt;- 20&#xD;
n.obs &amp;lt;- 200&#xD;
x &amp;lt;- data.frame(V = matrix(rnorm(n.var*n.obs), n.obs, n.var))&#xD;
n.dep &amp;lt;- floor(n.var/5)&#xD;
cat( "Number of dependent variables is", n.dep, "\n")&#xD;
m &amp;lt;- diag(n.dep:1)&#xD;
&#xD;
## These are our four test targets&#xD;
y.1 &amp;lt;- factor( ifelse( x$V.1 &amp;gt;= 0, 'A', 'B' ) )&#xD;
y.2 &amp;lt;- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) &amp;gt;= 0, "A", "B" )&#xD;
y.2 &amp;lt;- factor(y.2)&#xD;
y.3 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0))&#xD;
y.4 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0) %% 2)&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The flexibility of the caret package is to a large extent implemented by using control objects.  Here we specify to use the &lt;code&gt;randomForest&lt;/code&gt; classification algorithm (which is also what Boruta uses) and if the multicore package is available then we use that for extra perfomance (you can also use MPI etc ­– see the documentation):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;control &amp;lt;- rfeControl(functions = rfFuncs, method = "boot", verbose = FALSE,&#xD;
                      returnResamp = "final", number = 50)&#xD;
&#xD;
if ( require("multicore", quietly = TRUE, warn.conflicts = FALSE) ) {&#xD;
    control$workers &amp;lt;- multicore:::detectCores()&#xD;
    control$computeFunction &amp;lt;- mclapply&#xD;
    control$computeArgs &amp;lt;- list(mc.preschedule = FALSE, mc.set.seed = FALSE)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
We will consider from one to six features (using the &lt;code&gt;sizes&lt;/code&gt; variable) and then we simply let it lose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;sizes &amp;lt;- 1:6&#xD;
&#xD;
## Use randomForest for prediction&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control)&#xD;
cat( "rf     : Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The results are:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;rf     : Profile 1 predictors: V.1 V.16 V.6&#xD;
rf     : Profile 2 predictors: V.1 V.2&#xD;
rf     : Profile 3 predictors: V.4 V.1 V.2&#xD;
rf     : Profile 4 predictors: V.10 V.11 V.7&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
If you recall the &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html"&gt;feature selection with Boruta&lt;/a&gt; article, then the results there were&#xD;
&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;Profile 1: &lt;code&gt;V.1, (V.16, V.17)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 2: &lt;code&gt;V.1, V.2, V,3, (V.8, V.9, V.4)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 3: &lt;code&gt;V.1, V.4, V.3, V.2, (V.7, V.6)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Profile 4: &lt;code&gt;V.10, (V.11, V.13)&lt;/code&gt;&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;To show the flexibility of caret, we can run the analysis with another of the built-in classifiers:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Use ipred::ipredbag for prediction&#xD;
control$functions &amp;lt;- treebagFuncs&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control)&#xD;
cat( "treebag: Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;This gives:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;treebag: Profile 1 predictors: V.1 V.16&#xD;
treebag: Profile 2 predictors: V.2 V.1&#xD;
treebag: Profile 3 predictors: V.1 V.3 V.2&#xD;
treebag: Profile 4 predictors: V.10 V.11 V.1 V.7 V.13&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;And of course, if you have your own favourite model class that is not already implemented, then you can easily do that yourself.  We like &lt;code&gt;gbm&lt;/code&gt; from the package of the same name, which is kind of silly to use here because it provides variable importance automatically as part of the fitting process, but may still be useful.  It needs numeric predictors so we do:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Use gbm for prediction&#xD;
y.1 &amp;lt;- as.numeric(y.1)-1&#xD;
y.2 &amp;lt;- as.numeric(y.2)-1&#xD;
y.3 &amp;lt;- as.numeric(y.3)-1&#xD;
y.4 &amp;lt;- as.numeric(y.4)-1&#xD;
&#xD;
gbmFuncs &amp;lt;- treebagFuncs&#xD;
gbmFuncs$fit &amp;lt;- function (x, y, first, last, ...) {&#xD;
    library("gbm")&#xD;
    n.levels &amp;lt;- length(unique(y))&#xD;
    if ( n.levels == 2 ) {&#xD;
        distribution = "bernoulli"&#xD;
    } else {&#xD;
        distribution = "gaussian"&#xD;
    }&#xD;
    gbm.fit(x, y, distribution = distribution, ...)&#xD;
}&#xD;
gbmFuncs$pred &amp;lt;- function (object, x) {&#xD;
    n.trees &amp;lt;- suppressWarnings(gbm.perf(object,&#xD;
                                         plot.it = FALSE,&#xD;
                                         method = "OOB"))&#xD;
    if ( n.trees &amp;lt;= 0 ) n.trees &amp;lt;- object$n.trees&#xD;
    predict(object, x, n.trees = n.trees, type = "link")&#xD;
}&#xD;
control$functions &amp;lt;- gbmFuncs&#xD;
&#xD;
n.trees &amp;lt;- 1e2                          # Default value for gbm is 100&#xD;
&#xD;
profile.1 &amp;lt;- rfe(x, y.1, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 1 predictors:", predictors(profile.1), fill = TRUE )&#xD;
profile.2 &amp;lt;- rfe(x, y.2, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 2 predictors:", predictors(profile.2), fill = TRUE )&#xD;
profile.3 &amp;lt;- rfe(x, y.3, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 3 predictors:", predictors(profile.3), fill = TRUE )&#xD;
profile.4 &amp;lt;- rfe(x, y.4, sizes = sizes, rfeControl = control, verbose = FALSE,&#xD;
                 n.trees = n.trees)&#xD;
cat( "gbm    : Profile 4 predictors:", predictors(profile.4), fill = TRUE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;And we get the results below:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;gbm    : Profile 1 predictors: V.1 V.10 V.11 V.12 V.13&#xD;
gbm    : Profile 2 predictors: V.1 V.2&#xD;
gbm    : Profile 3 predictors: V.4 V.1 V.2 V.3 V.7&#xD;
gbm    : Profile 4 predictors: V.11 V.10 V.1 V.6 V.7 V.18&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;It is all good and very flexible, for sure, but I can’t really say it is better than the Boruta approach for these simple examples.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…"&gt;Feature selection: All-relevant selection with the Boruta package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. There are two main approaches to selecting the features (variables) we will use for the analysis: the minimal-optimal feature selection which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the all-relevant feature selection which identifies all variables that are in some circumstances relevant for the classification. In this article we take a first look at the problem of all-relevant feature selection using the Boruta package by Miron B. Kursa and Witold R. Rudnicki. This package is developed fo…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=O6IQ4h7grTk:tc7aeSALqaE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=O6IQ4h7grTk:tc7aeSALqaE:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/O6IQ4h7grTk" height="1" width="1"/&gt;</content><published>2010-11-16T19:35:00Z</published><updated>2010-11-18T06:58:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html</feedburner:origLink></entry><entry><title type="text">Feature selection: All-relevant selection with the Boruta package</title><id>urn:uuid:72b78e0b-1552-5e4c-8305-a363cc446cea</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/0S81Gxhmv0s/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/feature-1.4.150.png" wifht="150" height="150" alt="[Variable importance example]" />
  </a>
</div>
<p>
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  There are two main approaches to selecting the features (variables) we will use for the analysis: the <dfn>minimal-optimal feature selection</dfn> which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the <dfn>all-relevant feature selection</dfn> which identifies all variables that are in some circumstances relevant for the classification.
</p>
<p>
In this article we take a first look at the problem of all-relevant feature selection using the <a href="http://www.jstatsoft.org/v36/i11/">Boruta package</a> by Miron B. Kursa and Witold R. Rudnicki.  This package is developed for the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building.  There are two main approaches to selecting the features (variables) we will use for the analysis: the &lt;dfn&gt;minimal-optimal feature selection&lt;/dfn&gt; which identifies a small (ideally minimal) set of variables that gives the best possible classification result (for a class of classification models) and the &lt;dfn&gt;all-relevant feature selection&lt;/dfn&gt; which identifies all variables that are in some circumstances relevant for the classification.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In this article we take a first look at the problem of all-relevant feature selection using the &lt;a href="http://www.jstatsoft.org/v36/i11/"&gt;Boruta package&lt;/a&gt; by Miron B. Kursa and Witold R. Rudnicki.  This package is developed for the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Background&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
All-relevant feature selection is extremely useful for commercial data miners.  We deploy it when we want to &lt;em&gt;understand&lt;/em&gt; the mechanisms behind the behaviour or subject of interest, rather than just building a black-box predictive model.  This understanding leads us to a better appreciation of our customers (or other subject under investigation) and not just how, but &lt;em&gt;why&lt;/em&gt; they behave as they do, which is useful for all areas of the business, including strategy and product development.  More narrowly, it also help us define the variables that we want to observe which is what will really make a difference in our ability to predict behaviour (as opposed to, say, run the data mining application a little longer).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I really like the theoretical approach that the Boruta package tries to implement.  It is based on the more general idea that by adding randomness to a system and then collecting results from random samples of the bigger system, one can actually reduce the misleading impact of randomness in the original sample.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For the implementation, the Boruta package relies on a random forest classification algorithm.  This provides an intrinsic measure of the importance of each feature, known as the Z score.  While this score is not directly a statistical measure of the significance of the feature, we can compare it to random permutations of (a selection of) the variables to test if it is higher than the scores from random variables.  This is the essence of the implementation in Boruta.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The tests&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
This article is a first investigation into the performance of the Boruta package.  For this initial examination we will use a test data sample that we can control so we know what is important and what is not.  We will consider 200 observations of 20 normally distributed random variables:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;run.name &amp;lt;- "feature-1"&#xD;
library("Boruta")&#xD;
set.seed(1)&#xD;
## Set up artificial test data for our analysis&#xD;
n.var &amp;lt;- 20&#xD;
n.obs &amp;lt;- 200&#xD;
x &amp;lt;- data.frame(V=matrix(rnorm(n.var*n.obs), n.obs, n.var))&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Normal distribution has the advantage of simplicity, but for commercial application where highly non-normally distributed features like money spent are important may not be the best test.  Nevertheless, we will use it for now and define a simple utility function before we get on to the tests:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## Utility function to make plots of Boruta test results&#xD;
make.plots &amp;lt;- function(b, num,&#xD;
                       true.var = NA,&#xD;
                       main = paste("Boruta feature selection for test", num)) {&#xD;
    write.text &amp;lt;- function(b, true.var) {&#xD;
        if ( !is.na(true.var) ) {&#xD;
            text(1, max(attStats(b)$meanZ), pos = 4,&#xD;
                 labels = paste("True vars are V.1-V.",&#xD;
                     true.var, sep = ""))        &#xD;
        }&#xD;
    }&#xD;
    plot(b, main = main, las = 3, xlab = "")&#xD;
    write.text(b, true.var)&#xD;
    png(paste(run.name, num, "png", sep = "."), width = 8, height = 8,&#xD;
        units = "cm", res = 300, pointsize = 4)&#xD;
    plot(b, main = main, lwd = 0.5, las = 3, xlab = "")&#xD;
    write.text(b, true.var)&#xD;
    dev.off()&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h3&gt;Test 1: Simple test of single significant variable&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For a simple classification based on a single variable, Boruta performs well: while it identifies three variables as being potentially important, this does include the true variable (V.1) and the plot clearly shows it as being by far the most significant.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 1. Simple test of single variable&#xD;
y.1 &amp;lt;- factor( ifelse( x$V.1 &amp;gt;= 0, 'A', 'B' ) )&#xD;
&#xD;
b.1 &amp;lt;- Boruta(x, y.1, doTrace = 2)&#xD;
make.plots(b.1, 1)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.1.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.1.400.png" width="400" height="400" alt="[Example 1]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 1: Simple test of Boruta feature selection with single variable.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;h3&gt;Test 2: Simple test of linear combination of variables&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
With a test of a linear combination of the first four variables where the weights are decreasing from 4 to 1, we begin to get closer to the limitations of the approach.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 2. Simple test of linear combination&#xD;
n.dep &amp;lt;- floor(n.var/5)&#xD;
print(n.dep)&#xD;
&#xD;
m &amp;lt;- diag(n.dep:1)&#xD;
&#xD;
y.2 &amp;lt;- ifelse( rowSums(as.matrix(x[, 1:n.dep]) %*% m) &amp;gt;= 0, "A", "B" )&#xD;
y.2 &amp;lt;- factor(y.2)&#xD;
&#xD;
b.2 &amp;lt;- Boruta(x, y.2, doTrace = 2)&#xD;
make.plots(b.2, 2, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.2.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.2.400.png" width="400" height="400" alt="[Example 2]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 2: Simple test of Boruta feature selection with linear combination of four variables.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
The implementation correctly identified the first three variables (with weights 4, 3, and 2, respectively) as being important, but it had the fourth variable as possible along with the two random variables V.8 and V.9.  Still, six variables are more approachable than twenty.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;Test 3: Simple test of less-linear combination of four variables&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For this text and the following we consider less obvious combinations of the first four variables.  If we just count how many of them are positive, then we get to a situation where Boruta excels (because random forests excel at this type of problem).&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;## 3. Simple test of less-linear combination&#xD;
y.3 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0))&#xD;
print(summary(y.3))&#xD;
b.3 &amp;lt;- Boruta(x, y.3, doTrace = 2)&#xD;
print(b.3)&#xD;
make.plots(b.3, 3, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.3.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.3.400.png" width="400" height="400" alt="[Example 3]"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 3: Simple test of Boruta feature selection counting the positives of four variables.&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;h3&gt;Test 4: Simple test of non-linear combination&lt;/h3&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
For a spectacular fail of the Boruta approach we will have to consider a classification in the hyperplane of the four variables.  For this simple example, we simply count if there are an even or odd number of positive values among the first four variables:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;## 4. Simple test of non-linear combination&#xD;
y.4 &amp;lt;- factor(rowSums(x[, 1:n.dep] &amp;gt;= 0) %% 2)&#xD;
b.4 &amp;lt;- Boruta(x, y.4, doTrace = 2)&#xD;
print(b.4)&#xD;
make.plots(b.4, 4, n.dep)&#xD;
&lt;/pre&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;p&gt;&lt;a href="http://static.cybaea.net/images/feature-1.4.png"&gt;&lt;img src="http://static.cybaea.net/images/feature-1.4.400.png" width="400" height="400" alt="Example 4"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;&#xD;
&lt;p class="caption"&gt;Figure 4: Simple test of Boruta feature selection with non-linear combination of four variables&lt;/p&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
Ouch.  The package rejects the four known significant variables.  It is too hard for the random forest approach.  Increasing the number of observations to 1,000 does not help though at 5,000 observations Boruta identifies the four variables right.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Limitations&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Some limitations of the Boruta package are worth highlighting:&#xD;
&lt;/p&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;It only works with classification (factor) target variables.  I am not sure why: as far as I remember, the random forest algorithm also provides a variable significance score when it is used as a predictor, not just when it is run as a classifier.&lt;/li&gt;&#xD;
&lt;li&gt;It does not handle missing (&lt;code&gt;NA&lt;/code&gt;) values at all.  This is quite a problem when working with real data sets, and a shame as random forests are in principle very good at handling missing values.  A simple re-write of the package using the &lt;code&gt;party&lt;/code&gt; package instead of &lt;code&gt;randomForest&lt;/code&gt; should be able to fix this issue.&lt;/li&gt;&#xD;
&lt;li&gt;It does not seem to be completely stable.  I have crashed it on several real-world data sets and am working on a minimal set to send to the authors.&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;p&gt;&#xD;
But this is a really promising approach, if somewhat slow on large sets.  I will have a look at some real-world data in a future post.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.52]" title="[0.52]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing som…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=0S81Gxhmv0s:xZQ8PYNwtsQ:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=0S81Gxhmv0s:xZQ8PYNwtsQ:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/0S81Gxhmv0s" height="1" width="1"/&gt;</content><published>2010-11-15T10:04:00Z</published><updated>2010-11-16T19:10:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Feature-selection-All_relevant-selection-with-the-Boruta-package.html</feedburner:origLink></entry><entry><title type="text">Big data for R</title><id>urn:uuid:04001d8b-1947-56b3-86a5-265707a84aa9</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/pegHIMxElX0/Big-data-for-R.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Revolutions Analytics recently <a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html">announced</a> their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.
</p>
<p>
However, if you want to replicate their analysis in standard <a href="http://www.r-project.org/">R</a>, then you can absolutely do so and we show you how.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Revolutions Analytics recently &lt;a href="http://blog.revolutionanalytics.com/2010/08/announcing-big-data-for-revolution-r.html"&gt;announced&lt;/a&gt; their "big data" solution for R.  This is great news and a lovely piece of work by the team at Revolutions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
However, if you want to replicate their analysis in standard &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, then you can absolutely do so and we show you how.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;Data preparation&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First you need to prepare the rather large data set that they use in the Revolutions white paper.  The preparation script shown  below does two passes over alal the files which is not needed: changing it to a single pass is left as an exercise for the reader....  Note that the following script will take a while to run and will need some 30-odd gig of free disk space (another exercise: get rid of the airlines.csv file), but once it is done the analysis is fast.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="big.R"&gt;&#xD;
#!/usr/bin/Rscript&#xD;
## big.R - Preprocess the airline data&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Install the packages we will use&#xD;
install.packages("bigmemory",&#xD;
                 dependencies = c("Depends", "Suggests", "Enhances"))&#xD;
&#xD;
## Data sets are downloaded from the Data Expo '09 web site at&#xD;
## http://stat-computing.org/dataexpo/2009/the-data.html&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    if ( !file.exists(file.name) ) {&#xD;
        url.text &amp;lt;- paste("http://stat-computing.org/dataexpo/2009/",&#xD;
                          year, ".csv.bz2", sep = "")&#xD;
        cat("Downloading missing data file ", file.name, "\n", sep = "")&#xD;
        download.file(url.text, file.name)&#xD;
    }&#xD;
}&#xD;
&#xD;
## Read sample file to get column names and types&#xD;
d &amp;lt;- read.csv("2008.csv.bz2")&#xD;
integer.columns &amp;lt;- sapply(d, is.integer)&#xD;
factor.columns  &amp;lt;- sapply(d, is.factor)&#xD;
factor.levels   &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
n.rows &amp;lt;- 0L&#xD;
&#xD;
## Process each file determining the factor levels&#xD;
## TODO: Combine with next loop&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name)&#xD;
    n.rows &amp;lt;- n.rows + NROWS(d)&#xD;
    new.levels &amp;lt;- lapply(d[, factor.columns], levels)&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        factor.levels[[i]] &amp;lt;- c(factor.levels[[i]], new.levels[[i]])&#xD;
    }&#xD;
    rm(d)&#xD;
}&#xD;
save(integer.columns, factor.columns, factor.levels, file = "factors.RData")&#xD;
&#xD;
## Now convert all factors to integers so we can create a bigmatrix of the data&#xD;
col.classes &amp;lt;- rep("integer", length(integer.columns))&#xD;
col.classes[factor.columns] &amp;lt;- "character"&#xD;
cols  &amp;lt;- which(factor.columns)&#xD;
first &amp;lt;- TRUE&#xD;
csv.file &amp;lt;- "airlines.csv"   # Write combined integer-only data to this file&#xD;
csv.con  &amp;lt;- file(csv.file, open = "w")&#xD;
&#xD;
for (year in 1987:2008) {&#xD;
    file.name &amp;lt;- paste(year, "csv.bz2", sep = ".")&#xD;
    cat("Processing ", file.name, "\n", sep = "")&#xD;
    d &amp;lt;- read.csv(file.name, colClasses = col.classes)&#xD;
    ## Convert the strings to integers&#xD;
    for ( i in seq(1, length(factor.levels)) ) {&#xD;
        col &amp;lt;- cols[i]&#xD;
        d[, col] &amp;lt;- match(d[, col], factor.levels[[i]])&#xD;
    }&#xD;
    write.table(d, file = csv.con, sep = ",", &#xD;
                row.names = FALSE, col.names = first)&#xD;
    first &amp;lt;- FALSE&#xD;
}&#xD;
close(csv.con)&#xD;
&#xD;
## Now convert to a big.matrix&#xD;
library("bigmemory")&#xD;
backing.file    &amp;lt;- "airlines.bin"&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- read.big.matrix(csv.file, header = TRUE,&#xD;
                        type = "integer",&#xD;
                        backingfile = backing.file,&#xD;
                        descriptorfile = descriptor.file,&#xD;
                        extraCols = c("age"))&#xD;
&lt;/pre&gt;&#xD;
&lt;h2&gt;Sample analysis&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
All done now.  Sample analysis:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## bigScale.R - Replicate the analysis from &lt;a href="http://bit.ly/aTFXeN"&gt;http://bit.ly/aTFXeN&lt;/a&gt; with normal R&#xD;
##   http://info.revolutionanalytics.com/bigdata.html&#xD;
## See big.R for the preprocessing of the data&#xD;
&#xD;
## Load required libraries&#xD;
library("biglm")&#xD;
library("bigmemory")&#xD;
library("biganalytics")&#xD;
library("bigtabulate")&#xD;
&#xD;
## Use parallel processing if available&#xD;
## (Multicore is for "anything-but-Windows" platforms)&#xD;
if ( require("multicore") ) {&#xD;
    library("doMC")&#xD;
    registerDoMC()&#xD;
} else {&#xD;
    warning("Consider registering a multi-core 'foreach' processor.")&#xD;
}&#xD;
&#xD;
day.names &amp;lt;- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",&#xD;
               "Saturday", "Sunday")&#xD;
&#xD;
## Attach to the data&#xD;
descriptor.file &amp;lt;- "airlines.des"&#xD;
data &amp;lt;- attach.big.matrix(dget(descriptor.file))&#xD;
&#xD;
## Replicate Table 5 in the Revolutions document:&#xD;
## Table 5&#xD;
t.5 &amp;lt;- bigtabulate(data,&#xD;
                   ccols = "DayOfWeek",&#xD;
                   summary.cols = "ArrDelay", summary.na.rm = TRUE)&#xD;
## Pretty-fy the outout&#xD;
stat.names &amp;lt;- dimnames(t.5.2$summary[[1]])[2][[1]]&#xD;
t.5.p &amp;lt;- cbind(matrix(unlist(t.5$summary), byrow = TRUE,&#xD;
                      nrow = length(t.5$summary),&#xD;
                      ncol = length(stat.names),&#xD;
                      dimnames = list(day.names, stat.names)),&#xD;
               ValidObs = t.5$table)&#xD;
print(t.5.p)&#xD;
#             min  max     mean       sd    NAs ValidObs&#xD;
# Monday    -1410 1879 6.669515 30.17812 385262 18136111&#xD;
# Tuesday   -1426 2137 5.960421 29.06076 417965 18061938&#xD;
# Wednesday -1405 2598 7.091502 30.37856 405286 18103222&#xD;
# Thursday  -1395 2453 8.945047 32.30101 400077 18083800&#xD;
# Friday    -1437 1808 9.606953 33.07271 384009 18091338&#xD;
# Saturday  -1280 1942 4.187419 28.29972 298328 15915382&#xD;
# Sunday    -1295 2461 6.525040 31.11353 296602 17143178&#xD;
&#xD;
## Figure 1&#xD;
plot(t.5.p[, "mean"], type = "l", ylab="Average arrival delay")&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Just like the Revolutions paper.  You can now use &lt;code&gt;biglm.big.matrix&lt;/code&gt; and &lt;code&gt;bigglm.big.matrix&lt;/code&gt; for basic regression and there are also k-means clustering and other functions.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I must admit here that I do not understand the Revolutions regression example, so I have not attempted to replicate it here.  It seems kind of sad if they change the syntax to be incompatible with standard R formulas, which is what appears to be happening.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Credit to Michael Kane and Jay Emerson of Yale who showed much of this in their poster &lt;a href="http://stat-computing.org/dataexpo/2009/posters/kane-emerson.pdf"&gt;The Airline Data Set... What's the big deal?&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Getting-started-with-HHP.html" title="The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform ."&gt;Getting started with the Heritage Health Price competition&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The US$ 3 million Heritage Health Price competition is on so we take a look at how to get started using the R statistical computing and analysis platform .&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a d…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=pegHIMxElX0:jg-xZLG8yVc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=pegHIMxElX0:jg-xZLG8yVc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/pegHIMxElX0" height="1" width="1"/&gt;</content><published>2010-08-05T08:22:00Z</published><updated>2010-08-05T08:22:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Big-data-for-R.html</feedburner:origLink></entry><entry><title type="text">Area Plots with Intensity Coloring</title><id>urn:uuid:6b83e364-13a9-58b5-9f83-ec94683bf592</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/6pq1Dbge-y0/Area-Plots-with-Intensity-Coloring.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="Click to read full article">
    <img src="http://static.cybaea.net/images/nino-150.png" width="150" height="150" alt="[Graphics output]" />
  </a>
</div>
<p>I am not sure apeescape’s <a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/">ggplot2 area plot with intensity colouring</a> is really the best way of presenting the information, but it had me intrigued enough to replicate it using base <a href="http://www.r-project.org/">R</a> graphics.</p>

<p>The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, <code>lines(..., type="l")</code> does not recycle the colour <code>col=</code> argument, so we end up with rather more loops than I thought would be necessary.</p>

<p>We also get a nice opportunity to use the under-appreciated <code>read.fwf</code> function.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;I am not sure apeescape’s &lt;a href="http://probabilitynotes.wordpress.com/2010/07/10/area-plots-with-intensity-coloring-el-nino-sst-anomalies-w-ggplot2/"&gt;ggplot2 area plot with intensity colouring&lt;/a&gt; is really the best way of presenting the information, but it had me intrigued enough to replicate it using base &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; graphics.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that.  Unfortunately, &lt;code&gt;lines(..., type="l")&lt;/code&gt; does not recycle the colour &lt;code&gt;col=&lt;/code&gt; argument, so we end up with rather more loops than I thought would be necessary.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;(The answer is not to use &lt;code&gt;lines(..., type="h")&lt;/code&gt; which, confusingly, &lt;em&gt;does&lt;/em&gt; recycle the colour &lt;code&gt;col=&lt;/code&gt; argument.  This one had me for a while, but the &lt;code&gt;type=h&lt;/code&gt; lines always start from zero so you do not get the gradient feature.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We also get a nice opportunity to use the under-appreciated &lt;code&gt;read.fwf&lt;/code&gt; function.&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="document"&gt;##!/usr/bin/Rscript&#xD;
## nino.R - another version of &lt;a href="http://bit.ly/9P9Gh1"&gt;http://bit.ly/9P9Gh1&lt;/a&gt;&#xD;
## Copyright © 2010 Allan Engelhardt (&lt;a href="http://www.cybaea.net/"&gt;http://www.cybaea.net/&lt;/a&gt;)&#xD;
&#xD;
## Get the data from the NOAA server&#xD;
nino &amp;lt;- read.fwf("&lt;a href="http://www.cpc.noaa.gov/data/indices/wksst.for"&gt;http://www.cpc.noaa.gov/data/indices/wksst.for&lt;/a&gt;",&#xD;
                 widths=c(-1, 9, rep(c(-5, 4, 4), 4)),&#xD;
                 skip=4,&#xD;
                 col.names=c("Week",&#xD;
                     paste(rep(c("Nino12","Nino3","Nino34","Nino4"), rep(2, 4)),&#xD;
                           c("SST", "SSTA"), sep=".")))&#xD;
&#xD;
## Make the date column something useful&#xD;
nino$Week &amp;lt;- as.Date(nino$Week, format="%d%b%Y")&#xD;
&#xD;
## Make colour gradients&#xD;
ncol &amp;lt;- 50&#xD;
grad.neg &amp;lt;- hsv(4/6, seq(0, 1, length.out=ncol), 1) # Blue gradient&#xD;
grad.pos &amp;lt;- hsv(  0, seq(0, 1, length.out=ncol), 1) # Red gradient&#xD;
&#xD;
## Make plot&#xD;
plot(Nino34.SSTA ~ Week, data=nino, type="n",&#xD;
     main="Nino34", xlab="Date", ylab="SSTA", axes=FALSE)&#xD;
do.call(function (...) rect(..., col="gray85", border=NA),&#xD;
        as.list(par("usr")[c(1, 3, 2, 4)]))&#xD;
&#xD;
y &amp;lt;- nino$Nino34.SSTA                   # The values we will plot&#xD;
x &amp;lt;- nino$Week&#xD;
&#xD;
axis.Date(1, x=x, tck=1, col="white")&#xD;
axis(2, tck=1, col="white")&#xD;
box()&#xD;
&#xD;
idx &amp;lt;- integer(NROW(nino))&#xD;
idx[y &amp;gt;= 0] &amp;lt;- 1 + round( y[y &amp;gt;= 0] * (ncol - 1) / max( y[y &amp;gt;= 0]), 0)&#xD;
idx[y &amp;lt;  0] &amp;lt;- 1 + round(-y[y &amp;lt;  0] * (ncol - 1) / max(-y[y &amp;lt;  0]), 0)&#xD;
&#xD;
draw.gradient &amp;lt;- function(x, ys, cols) {&#xD;
    xs &amp;lt;- rep(x, 2)&#xD;
    for (i in seq(1, length(ys)-1))&#xD;
        plot.xy(list(x=xs, y=c(ys[i], ys[i+1])), type="l", col=cols[i])&#xD;
}&#xD;
&#xD;
for (i in 1:length(x)) {&#xD;
    ys &amp;lt;- seq(0, y[i], length.out=idx[i]+1)&#xD;
    cols &amp;lt;- (if (y[i] &amp;gt;=0) grad.pos else grad.neg)&#xD;
    draw.gradient(x[i], ys, cols)&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;The result is a decent gradient:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/nino-800.png" title="Click for larger version"&gt;&lt;img src="http://static.cybaea.net/images/nino-400.png" width="400" height="400" alt="[Graphics output]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I deliberately omitted the scale legend on the right hand side following Allan’s First Law of Happy Graphics: Thou shall not present the same information twice.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
For less dense information, you should increase the line width.  That is left to the reader. (Hint: it is hard to get just right in base graphics, but &lt;code&gt;lwd &amp;lt;- ceiling(par("pin")[1] / dev.size("in")[1] * dev.size("px")[1] / length(x))&lt;/code&gt; could be a starting point for an approximation. We really need gradient-filled polygons in base R.)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.56]" title="[0.56]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.54]" title="[0.54]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.48]" title="[0.48]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Feature-selection-Using-the-caret-package.html" title="Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his."&gt;Feature selection: Using the caret package&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is an important step for practical commercial data mining which is often characterised by data sets with far too many variables for model building. In a previous post we looked at all-relevant feature selection using the Boruta package while in this post we consider the same (artificial, toy) examples using the caret package. Max Kuhn kindly listed me as a contributor for some performance enhancements I submitted, but the genius behind the package is all his.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=6pq1Dbge-y0:43TWymNOdd4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=6pq1Dbge-y0:43TWymNOdd4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/6pq1Dbge-y0" height="1" width="1"/&gt;</content><published>2010-07-13T07:47:00Z</published><updated>2010-07-13T07:47:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html</feedburner:origLink></entry><entry><title type="text">Employee productivity as function of number of workers revisited</title><id>urn:uuid:cee42e41-ea6c-5ee6-a0b5-4e4644168052</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mWb7tx4LS94/Employee-productivity-as-function-of-number-of-workers-revisited.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<div class="floatRight"><a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="Click for read full article"><img width="150" height="150" src="http://static.cybaea.net/images/ftse100-150.png" alt="[Results of analysis shown in graph]" /></a></div>We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that <a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html">when you treble the number of workers, you halve their individual productivity</a> which is mildly scary.
</p>
<p>
We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger.  We have previously found that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;when you treble the number of workers, you halve their individual productivity&lt;/a&gt; which is mildly scary.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Let’s try the FTSE-100 index of leading UK companies to see if they are significantly different from the S&amp;amp;P 500 leading American companies that &lt;a href="http://www.cybaea.net/Blogs/Journal/employee_productivity.html"&gt;we analyzed four years ago&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;We will of course use the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; for our analysis, and once again we are grateful to &lt;a href="http://uk.finance.yahoo.com/"&gt;Yahoo Finance&lt;/a&gt; for providing the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;The analysis script is available as &lt;a href="http://static.cybaea.net/files/ftse100.R"&gt;ftse100.R&lt;/a&gt; and is really simple:&lt;/p&gt;&#xD;
&lt;pre class="document"&gt;## ftse100.R - Display employee productivity for FTSE-100 consitituents&#xD;
## Copyright © 2010 Allan Engelhardt &amp;lt;http://www.cybaea.net/&amp;gt;&#xD;
## All Rights Reserved.&#xD;
&#xD;
## Get the index constituents.&#xD;
ftse.100 &amp;lt;- read.csv(file = "http://uk.old.finance.yahoo.com/d/quotes.csv?s=@%5EFTSE&amp;amp;f=s&amp;amp;e=.csv", header = FALSE)&#xD;
names(ftse.100) &amp;lt;- c("symbol")&#xD;
data &amp;lt;- data.frame(symbol=NULL, employees=NULL, profit=NULL, sector=NULL)&#xD;
&#xD;
## For each stock symbol, get employees, profit, and sector&#xD;
for (symbol in ftse.100$symbol) {&#xD;
    profile.url &amp;lt;- paste("http://uk.finance.yahoo.com/q/pr?s=", symbol, sep="")&#xD;
    con &amp;lt;- url(profile.url, open = "r")&#xD;
    text &amp;lt;- readChar(con, 2^24)     # enough bytes&#xD;
    close(con)&#xD;
    x &amp;lt;- sub('.*Number of employees:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([[:digit:],]+).*', "\\1", text, ignore.case = TRUE)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    empl &amp;lt;- tryCatch(as.integer(x), warning = function(x) NA)&#xD;
    x &amp;lt;- sub('.*Net Profit.*?&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;[[:space:]]*([+-]?[[:digit:],]+).*', '\\1', text)&#xD;
    x &amp;lt;- gsub(',', '', x)&#xD;
    profit &amp;lt;- tryCatch(as.integer(x)*1e6, warning = function(x) NA)&#xD;
    sector &amp;lt;- sub('.*Sector:&amp;lt;/td&amp;gt;&amp;lt;td.*?&amp;gt;(.*?)&amp;lt;/td&amp;gt;.*', '\\1', text)&#xD;
    if (any(c(empl, profit) &amp;lt;= 0, is.na(c(empl, profit)))) {&#xD;
        cat("Error parsing symbol", symbol, "see", profile.url, "\n")&#xD;
    } else {&#xD;
        data &amp;lt;- rbind(data, data.frame(symbol=symbol, employees=empl, profit=profit, sector=sector))&#xD;
    }&#xD;
    Sys.sleep(1)&#xD;
}&#xD;
&#xD;
## Save the data so we don't have to hit Yahoo all the time.&#xD;
save(data, file = "data.RData")&#xD;
&#xD;
## Save plot to file:&#xD;
#png(filename="ftse100.png", width=800, height=800, pointsize=14, bg="white", res=100)&#xD;
&#xD;
opar &amp;lt;- par(cex.sub = sqrt(sqrt(2)), font.sub = 3, font.lab = 2)&#xD;
&#xD;
## x and y coordinates of plot and plot limits&#xD;
x &amp;lt;- with(data, employees)&#xD;
y &amp;lt;- with(data, profit/employees)&#xD;
xlim &amp;lt;- c(10^floor(log10(min(x))), 10^ceiling(log10(max(x))))&#xD;
ylim &amp;lt;- c(10^floor(log10(min(y))), 10^ceiling(log10(max(y))))&#xD;
&#xD;
## Set up to display different color and symbols&#xD;
plot_col &amp;lt;- 1&#xD;
plot_pch &amp;lt;- 1&#xD;
markers &amp;lt;- 21:25&#xD;
pchs &amp;lt;- rep(markers, ceiling(length(levels(data$sector))/length(markers)))&#xD;
palette(rainbow(length(levels(data$sector)), start=3/6, end=6/6))&#xD;
&#xD;
# Make empty plot:&#xD;
plot.new()&#xD;
plot(profit/employees ~ employees, data = data[FALSE, ], &#xD;
     type = "p", pch = pchs[plot_pch], col = plot_col,&#xD;
     log="xy", xaxp = c(xlim, 1), yaxp = c(ylim, 1), xlim = xlim, ylim = ylim,&#xD;
     main = "Profit per employee (FTSE 100)", xlab = "Employees", ylab = "Profit per employees (GBP)")&#xD;
&#xD;
## Plot each sector&#xD;
for (sector in levels(data$sector)) {&#xD;
    plot.xy(xy.coords(with(data[data$sector == sector,], employees),&#xD;
                      with(data[data$sector == sector,], profit/employees),&#xD;
                      log = "xy", xlab = "", ylab = ""),&#xD;
            type = "p", pch = pchs[plot_pch], col = plot_col, bg = plot_col)&#xD;
    plot_pch &amp;lt;- plot_pch + 1&#xD;
    plot_col &amp;lt;- plot_col + 1&#xD;
}&#xD;
legend(x = "bottomleft", legend = levels(data$sector), title = "Industry Sectors", &#xD;
       col = palette(), pt.bg = palette(), pch = pchs, cex = 2/3, pt.cex = 1, ncol = 2)&#xD;
&#xD;
## Fit a linear model to the log-log data:&#xD;
m &amp;lt;- lm(log10(y) ~ log10(x))&#xD;
xl &amp;lt;- c(xlim[1]*5, xlim[2]/5)&#xD;
yl &amp;lt;- 10^predict(m, data.frame(x = xl))&#xD;
lines(xl, yl, col = "darkred", lty = "dashed", lwd = 2)&#xD;
t &amp;lt;- sprintf("Power = %0.3g", m$coefficients[2])&#xD;
text(xl[2], yl[2], t, adj = c(0.25, -1.5), col = "darkred", font = 2)&#xD;
&#xD;
## All done.&#xD;
par(opar)&#xD;
dev.off()&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Leave it to run and this is what you get:&lt;/p&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse100.png"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse100-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;The power law still broadly holds.  In a large company, the productivity of the individual employee is only ¼ of the productivity in a company with one-tenth of the number of workers.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The analysis for the FTSE All-Share index is easy (&lt;a href="http://static.cybaea.net/files/ftse-all.R" title="Click for full size"&gt;ftse-all.R&lt;/a&gt;) and gives a slope of -0.7605541 for the 301 companies with the required information, which is much worse.  More convincingly, fitting the companies with more than 1,000 employees (to avoid some bias of smaller companies needing to have large profits per employee in order to be big enough to afford a stock market listing) gives a slope of -0.2838.&lt;/p&gt;&#xD;
&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse-all.png" title="Click for full size"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse-all-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;div class="floatCenter" style="width: 400px"&gt;&#xD;
  &lt;a href="http://static.cybaea.net/images/ftse-all-big.png" title="Click for full size"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/ftse-all-big-400.png" width="400" height="400" alt="[Analysis output]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.51]" title="[0.51]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Benchmarking-feature-selection-with-Boruta-and-caret.html" title="Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with."&gt;Benchmarking feature selection with Boruta and caret&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an important step in the analysis process. And since we often work on very large data sets the performance of our process is very important to us. Having looked at feature selection using the Boruta package and feature selection using the caret package separately, we now consider the performance of the two approaches. Neither approach is suitable out of the box for the sizes of data sets that we normally work with.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Big-data-for-R.html" title="Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how."&gt;Big data for R&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show you how.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mWb7tx4LS94:yeSkNsef7zA:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mWb7tx4LS94:yeSkNsef7zA:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mWb7tx4LS94" height="1" width="1"/&gt;</content><published>2010-06-22T11:20:00Z</published><updated>2010-06-22T11:20:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html</feedburner:origLink></entry><entry><title type="text">Comparing standard R with Revoutions for performance</title><id>urn:uuid:3293adea-fce4-57ac-844d-8c40497745e3</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/SNu7nI9K28g/Comparing-standard-R-with-Revoutions-for-performance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Following on from my previous post about <a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html">improving performance of R by linking with optimized linear algebra libraries</a>, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their <a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php">Revolutionary Performance</a> pages.</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Following on from my previous post about &lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html"&gt;improving performance of R by linking with optimized linear algebra libraries&lt;/a&gt;, I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their &lt;a href="http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php"&gt;Revolutionary Performance&lt;/a&gt; pages.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For convenience I collected their tests into a single script &lt;a href="http://static.cybaea.net/files/revolution_benchmark.R"&gt;revolution_benchmark.R&lt;/a&gt; that I can simply run with &lt;code&gt;Rscript --vanilla revolution_benchmark.R&lt;/code&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;The results, compared with the speed-up factors Revolution claims for their version:&lt;/p&gt;&#xD;
&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Revolutions benchmarks compared with R on x86_64 system&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;&lt;/th&gt;&lt;th&gt;R&lt;/th&gt;&lt;th&gt;R + ATLAS&lt;/th&gt;&lt;th&gt;Speed-up&lt;/th&gt;&lt;th&gt;Revolution’s&lt;br&gt;claimed speed-up&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Matrix Multiply&lt;/td&gt;&lt;td&gt;360.96&lt;/td&gt;&lt;td&gt;9.30&lt;/td&gt;&lt;td&gt;37.8&lt;/td&gt;&lt;td&gt;41.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Cholesky Factorization&lt;/td&gt;&lt;td&gt;27.28&lt;/td&gt;&lt;td&gt;5.65&lt;/td&gt;&lt;td&gt;3.8&lt;/td&gt;&lt;td&gt;21.0&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Singular Value Decomposition&lt;/td&gt;&lt;td&gt;98.73&lt;/td&gt;&lt;td&gt;23.57&lt;/td&gt;&lt;td&gt;3.2&lt;/td&gt;&lt;td&gt;12.6&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Principal Components Analysis&lt;/td&gt;&lt;td&gt;454.55&lt;/td&gt;&lt;td&gt;40.92&lt;/td&gt;&lt;td&gt;10.1&lt;/td&gt;&lt;td&gt;15.2&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;tr&gt;&lt;td&gt;Linear Discriminant Analysis&lt;/td&gt;&lt;td&gt;271.44&lt;/td&gt;&lt;td&gt;79.61&lt;/td&gt;&lt;td&gt;2.4&lt;/td&gt;&lt;td&gt;4.4&lt;/td&gt;&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&#xD;
&lt;p&gt;In all instances Revolution’s claimed speed-up is greater, though probably not significantly so for the Matrix Multiply test and hardly so for the Principal Components Analysis.  (Of course, I do not have a copy of Revolution Analytics’ product, so I can’t verify their claims or make a comparable test.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Whether saving 48 seconds on a linear discriminant analysis is enough to justify buying the product is a decision I leave to you: you know what analysis you do.  For me, there are (many) orders of magnitudes to be gained by better algorithms and better variable selections so I am not too worried about factors of 2 or even 10.  For extra raw power, I run R on a cloud service like AWS which scales well for many problems and is easy to do with stock R while I guess there are some sort of license implications if you wanted to do the same with Revolution’s product.  (But I &lt;em&gt;like&lt;/em&gt; Revolution and am still trying to find an excuse to use their product.)&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Your mileage may vary.&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.49]" title="[0.49]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=SNu7nI9K28g:jpcoNrrR4Rk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=SNu7nI9K28g:jpcoNrrR4Rk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/SNu7nI9K28g" height="1" width="1"/&gt;</content><published>2010-06-17T09:05:00Z</published><updated>2010-06-17T09:05:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html</feedburner:origLink></entry><entry><title type="text">Faster R through better BLAS</title><id>urn:uuid:428f009b-a07d-59dc-a643-50cc9a2b86ca</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/txxBGby0z-I/Faster-R-through-better-BLAS.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Can we make our analysis using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.</p>
<p>But recently David Smith was <a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries">suggesting</a> that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.</p>
<p>The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Can we make our analysis using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; run faster?  Usually the answer is yes, and the best way is to improve your algorithm and variable selection.&lt;/p&gt;&#xD;
&lt;p&gt;But recently David Smith was &lt;a href="http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html" title="Performance benefits of linking R to multithreaded math libraries"&gt;suggesting&lt;/a&gt; that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library.  So I decided to investigate.&lt;/p&gt;&#xD;
&lt;p&gt;The quick summary is that it only really makes a difference for fairly artificial benchmark tests.  For “normal” work you are unlikely to see a difference most of the time.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;The environment&lt;/h2&gt;&#xD;
&lt;p&gt;I use R on a 64-bit &lt;a href="http://fedoraproject.org/"&gt;Fedora&lt;/a&gt; 12 Linux system.  Fortunately, it is very easy to rebuild R using different libraries on this platform.  For the following, I will assume that you have a working &lt;a href="http://www.rpm.org/max-rpm-snapshot/rpmbuild.8.html"&gt;rpmbuild&lt;/a&gt; environment.  The test system has a quad core Intel Xeon E5420 CPU with each core running at 2.50 GHz.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmarks&lt;/h2&gt;&#xD;
&lt;p&gt;Benchmarking R is complex.  Very complex.  But for this simple test we use two tests from the &lt;a href="http://r.research.att.com/benchmarks/"&gt;R Benchmarks&lt;/a&gt; page: &lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt; and &lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R-benchmark-25.R&lt;/a&gt;.  The first is a simple benchmark using the examples from the MASS package, and has the advantage that it reflects real-world problems and real-world analysis, albeit small problems and short analysis.  The second is a much more artificial example and primarily test matrix operations.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;We run the MASS benchmark as:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;/usr/bin/time -p R --vanilla CMD BATCH MASS-ex.R /dev/null&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;While the R-benchmark-25 is simply:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Rscript --vanilla R-benchmark-25.R&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;For the MASS benchmark we simply capture the real elapsed time while the R benchmark 2.5 provides more detailed output for the three classes of tests (matrix calculation, -functions, and program execution) as well as overall summaries.  They are all shown in the table below.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Compiler-optimized R&lt;/h2&gt;&#xD;
&lt;p&gt;For the experiments that follow the first thing to do is to grab copies of the source RPMs for R and for ATLAS:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;cd ~/rpmbuild/SRPMS&#xD;
yumdownloader --source atlas R&#xD;
cd ..&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;At the time I did this, I got &lt;code&gt;R-2.11.0-1.fc12.src.rpm&lt;/code&gt; and &lt;code&gt;atlas-3.8.3-12.fc12.src.rpm&lt;/code&gt;.  I crank up the level of optimization that I do when building from source so the first thing is to edit &lt;code&gt;&lt;a href="http://static.cybaea.net/files/.rpmrc"&gt;~/.rpmrc&lt;/a&gt;&lt;/code&gt; to include the line &lt;code&gt;optflags: x86_64 -O3 -march=native -m64 -g&lt;/code&gt;.  With that in place we can simply do:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild --rebuild SRPMS/R-2.11.0-1.fc12.src.rpm  #  Change version numbers as needed&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;We now have a compiler-optimized version of R and we can re-run our tests.  It doesn't make much difference, but that is also good to know.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;ATLAS BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;Now let's try linking to the ATLAS BLAS libraries instead.  I assume you have them installed (&lt;code&gt;yum install atlas&lt;/code&gt; if not) so you can just grab a copy of &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;R-atlas.diff&lt;/a&gt; to change the spec file like this:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpm -ihv SRPMS/R-2.11.0-1.fc12.src.rpm   # Install to your rpmbuild environment&#xD;
cd SPECS&#xD;
wget &lt;a href="http://static.cybaea.net/files/R-atlas.diff"&gt;http://static.cybaea.net/files/R-atlas.diff&lt;/a&gt;&#xD;
patch -o R-atlas.spec R.spec R-atlas.diff&#xD;
cd ..&#xD;
rpmbuild -bb SPECS/R-atlas.spec&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/R*2.11.0-1*.rpm RPMS/x86_64/libRmath*2.11.0-1*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;You now have a version of R that uses the ATLAS BLAS libraries, so you can re-run the tests.  The results are in the table below in the “Optimized R + Standard ATLAS” row.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;As expected, the matrix operations from the &lt;code&gt;R-benchmark-25.R&lt;/code&gt; runs a lot faster: they complete in about 30-40% of the time, much of which comes from the multi-threading so all four CPU cores are used.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;However, for the analysis-heavy code is &lt;code&gt;MASS-ex.R&lt;/code&gt; there is little difference.  If anything, we see a tiny increase in running time.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
  &lt;em&gt;Multi-threaded BLAS libraries make no significant difference to real-world analysis problems using R.&lt;/em&gt;&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Other BLAS libraries&lt;/h2&gt;&#xD;
&lt;p&gt;For good measure we also try an optimized version of ATLAS, but it does not make much difference on the x86_64 architecture:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;rpmbuild -D "enable_native_atlas 1" --rebuild SRPMS/atlas-3.8.3-12.fc12.src.rpm&#xD;
su -c 'rpm -Uhv --force RPMS/x86_64/atlas*3.8.3-12*.rpm'&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;And (only) for completeness, we also try the standard Netlib BLAS and LAPACK libraries (&lt;code&gt;yum install blas lapack&lt;/code&gt;) by the same method as the ATLAS library above but with a slightly different change to the SPEC file: &lt;code&gt;&lt;a href="http://static.cybaea.net/files/R-blas.diff"&gt;R-blas.diff&lt;/a&gt;&lt;/code&gt;.  It performs a little better than vanilla R.&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;For more information about rebuilding R with different BLAS libraries, see the &lt;a href="http://cran.r-project.org/doc/manuals/R-admin.html#Linear-algebra"&gt;linear algebra section in the R Installation and Administration manual&lt;/a&gt;.&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Benchmark results&lt;/h2&gt;&#xD;
&lt;table border="1" class="border"&gt;&#xD;
&lt;caption&gt;Benchmark results for various optimizations of R and the BLAS library&lt;/caption&gt;&#xD;
&lt;thead&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th rowspan="3"&gt;R version&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;&lt;a href="http://r.research.att.com/benchmarks/MASS-ex.R"&gt;MASS-ex.R&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;th colspan="10"&gt;&lt;a href="http://r.research.att.com/benchmarks/R-benchmark-25.R"&gt;R benchmark 2.5&lt;/a&gt;&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th colspan="2"&gt;Real&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Total time&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Overall mean&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅰ. Matrix calc.&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅱ. Matrix functions&lt;/th&gt;&#xD;
&lt;th colspan="2"&gt;Ⅲ. Program.&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&lt;th&gt;&lt;abbr title="seconds"&gt;secs&lt;/abbr&gt;&lt;/th&gt;&lt;th&gt;index&lt;/th&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/thead&gt;&#xD;
&lt;tbody&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Base install&lt;/td&gt;&#xD;
&lt;td&gt;19.00&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;78.49&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.11&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;2.32&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;3.86&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;1.05&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R&lt;/td&gt;&#xD;
&lt;td&gt;18.98&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;76.11&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;2.02&lt;/td&gt;&lt;td&gt;0.96&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;3.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Netlib BLAS&lt;/td&gt;&#xD;
&lt;td&gt;18.56&lt;/td&gt;&lt;td&gt;0.98&lt;/td&gt;&lt;td&gt;73.22&lt;/td&gt;&lt;td&gt;0.93&lt;/td&gt;&lt;td&gt;1.81&lt;/td&gt;&lt;td&gt;0.86&lt;/td&gt;&lt;td&gt;2.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;2.41&lt;/td&gt;&lt;td&gt;0.62&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Standard ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.43&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.74&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.97&lt;/td&gt;&lt;td&gt;0.46&lt;/td&gt;&lt;td&gt;0.90&lt;/td&gt;&lt;td&gt;0.39&lt;/td&gt;&lt;td&gt;1.04&lt;/td&gt;&lt;td&gt;0.27&lt;/td&gt;&lt;td&gt;0.99&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;tr&gt;&#xD;
&lt;td&gt;Optimized R + Optimized ATLAS&lt;/td&gt;&#xD;
&lt;td&gt;19.31&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;16.36&lt;/td&gt;&lt;td&gt;0.21&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&lt;td&gt;0.45&lt;/td&gt;&lt;td&gt;0.84&lt;/td&gt;&lt;td&gt;0.36&lt;/td&gt;&lt;td&gt;1.02&lt;/td&gt;&lt;td&gt;0.26&lt;/td&gt;&lt;td&gt;1.00&lt;/td&gt;&lt;td&gt;0.95&lt;/td&gt;&#xD;
&lt;/tr&gt;&#xD;
&lt;/tbody&gt;&#xD;
&lt;/table&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.49]" title="[0.49]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Comparing-standard-R-with-Revoutions-for-performance.html" title="Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages."&gt;Comparing standard R with Revoutions for performance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Following on from my previous post about improving performance of R by linking with optimized linear algebra libraries , I thought it would be useful to try out the five benchmarks Revolutions Analytics have on their Revolutionary Performance pages.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-versus-SAS_SPSS-in-corporations.html" title="A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make."&gt;R versus SAS/SPSS in corporations&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;A recent question on one of the LinkedIn groups about the advantages of using R over commercial tools like SAS or IBM SPSS Modeller drew lots of comments for R. We like R a lot and we use it extensively, but I also wanted to balance the discussion. R is great, but looking at commercial organizations near the end of 2011 it is not necessarily the right choice to make.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Keep-your-packages-up_to_date.html" title="In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date."&gt;R tips: Keep your packages up-to-date&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" title="Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Since it is unusually painful to get working, I might as well copy the instructions here."&gt;R tips: Installing Rmpi on Fedora Linux&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Somebody on the R-help mailing list asked how to get Rmpi working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the R statistical computing and analysis platform . Sinc…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.33]" title="[0.33]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Eliminating-the-save-workspace-image-prompt-on-exit.html" title="When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off."&gt;R tips: Eliminating the “save workspace image” prompt on exit&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=txxBGby0z-I:XtMdQd8RATE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=txxBGby0z-I:XtMdQd8RATE:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/txxBGby0z-I" height="1" width="1"/&gt;</content><published>2010-06-15T10:21:00Z</published><updated>2010-06-15T10:21:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html</feedburner:origLink></entry><entry><title type="text">R: Eliminating observed values with zero variance</title><id>urn:uuid:5394cf3c-2009-5225-955d-1b6c90ae4445</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/tQRUNQKxFng/R-Eliminating-observed-values-with-zero-variance.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I needed a fast way of eliminating observed values with zero variance from large data sets using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I needed a fast way of eliminating observed values with zero variance from large data sets using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  In other words, I want to find the columns in a data frame that has zero variance.  And as fast as possible, because my data sets are large, many, and changing fast.  The final result surprised me a little.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
I use the &lt;a href="http://www.kddcup-orange.com/data.php"&gt;KDD Cup 2009 data sets&lt;/a&gt; as my reference for this experiment.  (You will need to register to download the data.)  It is a realistic example of the type of customer data that I usually work with.  It has 50,000 observations of 15,000 variables.  To load it into R you'll need a reasonably beefy machine.  My workstation has 16GB of memory; if yours have less then use a sample of the data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
We load the data into R and propose a few ways in which we may identify the columns we need:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;#!/usr/bin/Rscript&#xD;
## zero-var.R - find the fastest way of eliminating observations with zero variance&#xD;
## © 2010 Allan Engelhardt, http://www.cybaea.net&#xD;
&#xD;
## Read the data file.&#xD;
## We have already converted it to R format and saved it, so we can do&#xD;
load("train.RData")&#xD;
## instead of something like&#xD;
# train &amp;lt;- read.delim(file="../orange_large_train.data.bz2")&#xD;
&#xD;
## Some suggestions for zero variance functions:&#xD;
zv.1 &amp;lt;- function(x) {&#xD;
    ## The literal approach&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y == 0)&#xD;
}&#xD;
zv.2 &amp;lt;- function(x) {&#xD;
    ## As before, but avoiding direct comparison with zero&#xD;
    y &amp;lt;- var(x, na.rm = TRUE)&#xD;
    return(is.na(y) || y &amp;lt; .Machine$double.eps ^ 0.5)&#xD;
}&#xD;
zv.3 &amp;lt;- function(x) {&#xD;
    ## Maybe it is faster to check for equality than to compute?&#xD;
    y &amp;lt;- x[!is.na(x)]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
zv.4 &amp;lt;- function(x) {&#xD;
    ## Taking out the special case may speed things up?&#xD;
    ## (At least for this data set where this case is common.)&#xD;
    z &amp;lt;- is.na(x)&#xD;
    if ( all(z) ) return(TRUE);&#xD;
    y &amp;lt;- x[!z]&#xD;
    return(all(y == y[1]))&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Now we just have to load the very useful &lt;a href="http://cran.r-project.org/web/packages/rbenchmark/index.html"&gt;rbenchmark&lt;/a&gt; package and let the machine figure it out:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code"&gt;library("rbenchmark")&#xD;
&#xD;
cat("Running benchmarks:\n")&#xD;
benchmark(&#xD;
          zv1 = { sapply(train, zv.1) },&#xD;
          zv2 = { sapply(train, zv.2) },&#xD;
          zv3 = { sapply(train, zv.3) },&#xD;
          zv4 = { sapply(train, zv.4) },&#xD;
          replications = 5,&#xD;
          columns = c("test", "elapsed", "relative", "sys.self"),&#xD;
          order = "elapsed"&#xD;
          )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The answer (on my machine) is that it is faster to calculate than to check for equality:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;Running benchmarks:&#xD;
  test elapsed relative sys.self&#xD;
1  zv1  78.619 1.000000    6.395&#xD;
2  zv2  79.276 1.008357    6.586&#xD;
3  zv3 113.024 1.437617    1.735&#xD;
4  zv4 118.579 1.508274    1.716&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The two functions based on the core variance function are easily the fastest (despite having to do arithmetic) while taking out the special case in the equality functions is a Bad Idea.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Can you think of an even faster way to do it?&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.48]" title="[0.48]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.41]" title="[0.41]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing som…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=tQRUNQKxFng:XLrBfXK16uw:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=tQRUNQKxFng:XLrBfXK16uw:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/tQRUNQKxFng" height="1" width="1"/&gt;</content><published>2010-03-08T14:46:00Z</published><updated>2010-03-08T14:46:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html</feedburner:origLink></entry><entry><title type="text">Beautiful Data</title><id>urn:uuid:770cda82-5757-5a26-827a-2aeff8a8a098</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/nsxeLGyxKLM/Beautiful-Data.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
  <a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="Click for full article">
    <img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]" />
  </a>
</div>
<p>
O'Reilly's recent publication <a href="http://oreilly.com/catalog/9780596157111/">Beautiful Data</a> has a chapter by <a href="http://jeffjonas.typepad.com/jeff_jonas/">Jeff Jonas</a> which is enough reason in itself for me to recommend it.  The chapter, <a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf">Data Finds Data</a>, is also available as a PDF download.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O'Reilly's recent publication &lt;a href="http://oreilly.com/catalog/9780596157111/"&gt;Beautiful Data&lt;/a&gt; has a chapter by &lt;a href="http://jeffjonas.typepad.com/jeff_jonas/"&gt;Jeff Jonas&lt;/a&gt; which is enough reason in itself for me to recommend it.  The chapter, &lt;a href="http://jeffjonas.typepad.com/DataFindsDataFinal.pdf"&gt;Data Finds Data&lt;/a&gt;, is also available as a PDF download.&#xD;
&lt;/p&gt;&#xD;
&lt;div class="floatRight"&gt;&#xD;
  &lt;a href="http://oreilly.com/catalog/9780596157111/" title="Click for book details"&gt;&#xD;
    &lt;img src="http://static.cybaea.net/images/beautiful-data-small.png" width="100" height="131" alt="[book cover]"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
I met Jeff a couple of year ago at an ETech conference, and he is easily one of the smartest people I have ever met who is thinking about data.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html" title="I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government."&gt;Data.gov&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am always on the lookout for useful data sources for training in statistics, so I am excited that Data.gov has opened for business. The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.34]" title="[0.34]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/When-Big-Data-Matters.html" title="Big Data is a buzzword, but is it real: does it address real business issues or is it just an excuse to sell more computers, software, and consulting services? We argue that it is real and it does matter, but only in some well-defined circumstances: it is not a universal solution or requirement to every problem. We provide a framework for determining where the Big Data applications are within your work and where traditional approaches apply. Get this article as a PDF: When Big Data matters ."&gt;When Big Data Matters&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=nsxeLGyxKLM:Wgxpc7L0eVI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=nsxeLGyxKLM:Wgxpc7L0eVI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/nsxeLGyxKLM" height="1" width="1"/&gt;</content><published>2009-07-27T19:38:00Z</published><updated>2009-07-27T19:38:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Beautiful-Data.html</feedburner:origLink></entry><entry><title type="text">Massively parallel database for analytics</title><id>urn:uuid:a8ba9e43-b837-551c-bd02-a1a7b4506c41</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/Apso1Get0Yk/Massively-parallel-database-for-analytics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html">Announcing release of HadoopDB</a>.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end.  But much more than a theoretical discussion, they have built a solution which they call HadoopDB.  It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source.  Alternative, column-based, backends to PostgreSQL are being implemented now.  Read: &lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html"&gt;Announcing release of HadoopDB&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;See also:&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;&lt;a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html"&gt;Short version: key bullet points&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf"&gt;Long version (12 pages, PDF)&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://tech.slashdot.org/story/09/07/21/1747241/Researchers-Create-Database-Hadoop-Hybrid?from=rss"&gt;Slashdot discussion&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/HadoopStreaming/index.html"&gt;R package HadoopStreaming&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=Apso1Get0Yk:udpBUFdH-J4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=Apso1Get0Yk:udpBUFdH-J4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/Apso1Get0Yk" height="1" width="1"/&gt;</content><published>2009-07-22T13:37:00Z</published><updated>2009-07-22T13:37:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Massively-parallel-database-for-analytics.html</feedburner:origLink></entry><entry><title type="text">The Knapsack Problem</title><id>urn:uuid:6efce6d8-6489-5275-aa88-1ddce86d4e65</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/JCEN5oEfIRM/The-Knapsack-Problem.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
<a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html">David posts a question</a> about how to solve <a href="http://xkcd.com/287/">this</a> <a href="http://en.wikipedia.org/wiki/Knapsack_problem">knapsack problem </a> using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  My reply in the comments seems to have disappeared for a while so here is my proposed solution:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;div class="floatCenter" style="width: 640px;"&gt;&#xD;
  &lt;a href="http://xkcd.com/287/"&gt;&#xD;
    &lt;img src="http://imgs.xkcd.com/comics/np_complete.png" width="640" height="414" alt="[Cartoon from XKCD]" title="NP-Complete"&gt;&lt;/img&gt;&#xD;
  &lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html"&gt;David posts a question&lt;/a&gt; about how to solve this &lt;a href="http://en.wikipedia.org/wiki/Knapsack_problem"&gt;knapsack problem &lt;/a&gt; using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  My reply in the comments seems to have disappeared for a while so here is my proposed solution.  See David’s blog for my earlier proposed solution with a very common error.&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
## http://blog.revolution-computing.com/2009/07/because-its-friday-the-knapsack-problem.html&#xD;
appetizer.solution &amp;lt;- local (&#xD;
function (target) {&#xD;
  app &amp;lt;- c(2.15, 2.75, 3.35, 3.55, 4.20, 5.80)&#xD;
  r &amp;lt;- 2L&#xD;
  repeat {&#xD;
	c &amp;lt;- gtools::combinations(length(app), r=r, v=app, repeats.allowed=TRUE)&#xD;
	s &amp;lt;- rowSums(c)&#xD;
	if ( all(s &amp;gt; target) ) {&#xD;
	  print("No solution found")&#xD;
	  break&#xD;
	}&#xD;
	x &amp;lt;- which( abs(s-target) &amp;lt; 1e-4 )&#xD;
	if ( length(x) &amp;gt; 0L ) {&#xD;
	  cat("Solution found: ", c[x,], "\n")&#xD;
	  break&#xD;
	}&#xD;
	r &amp;lt;- r + 1L&#xD;
  }&#xD;
})&#xD;
&#xD;
appetizer.solution(15.05)&#xD;
# Solution found:  2.15 3.55 3.55 5.8 &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Brute force works, it just doesn’t scale well.  (Note that 7×2.15 is another solution.)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=JCEN5oEfIRM:esvQ6McvdMU:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=JCEN5oEfIRM:esvQ6McvdMU:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/JCEN5oEfIRM" height="1" width="1"/&gt;</content><published>2009-07-10T20:30:00Z</published><updated>2009-07-10T20:30:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/The-Knapsack-Problem.html</feedburner:origLink></entry><entry><title type="text">OECD Statistics</title><id>urn:uuid:43e585f9-9c60-505d-b349-b65d1a20c969</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/OECD-Statistics.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/5fn__mpTK8o/OECD-Statistics.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am a sucker for good quality data.  I <a href="http://www.cybaea.net/Blogs/Data/Data_gov.html">wrote about data.gov</a>, the US Government data site before, and now I find <a href="http://stats.oecd.org/">OECD Statistics</a> which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am a sucker for good quality data.  I &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html"&gt;wrote about data.gov&lt;/a&gt;, the US Government data site before, and now I find &lt;a href="http://stats.oecd.org/"&gt;OECD Statistics&lt;/a&gt; which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Exports in multiple formats, including Excel, CSV, and &lt;a href="http://sdmx.org/"&gt;SDMX&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=5fn__mpTK8o:hTjwnwI_7oI:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=5fn__mpTK8o:hTjwnwI_7oI:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/5fn__mpTK8o" height="1" width="1"/&gt;</content><published>2009-07-02T20:33:00Z</published><updated>2009-07-02T20:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/OECD-Statistics.html</feedburner:origLink></entry><entry><title type="text">R tips: Determine if function is called from specific package</title><id>urn:uuid:08988ce4-0f96-564e-9575-7c6f2ff16147</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/mqFcMzo8FLQ/R-tips-Determine-if-function-is-called-from-specific-package.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I like the "multicore" library for a particular task.  I can easily write a combination of<code> if(require("multicore",...))</code> that means that my function will automatically use the parallel <code>mclapply()</code> instead of <code>lapply()</code> where it is available.  Which is grand 99% of the time, except when my function is called from <code>mclapply()</code> (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.
</p>
<p>
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I like the "multicore" library for a particular task.  I can easily write a combination of&lt;code&gt; if(require("multicore",...))&lt;/code&gt; that means that my function will automatically use the parallel &lt;code&gt;mclapply()&lt;/code&gt; instead of &lt;code&gt;lapply()&lt;/code&gt; where it is available.  Which is grand 99% of the time, except when my function is called from &lt;code&gt;mclapply()&lt;/code&gt; (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
So, I needed a function to determine if my function was called from any function in the "multicore" library.  Here it is.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
First define a generally useful function:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="code" title="is.in.namespace()"&gt;&#xD;
is.in.namespace &amp;lt;-&#xD;
function (ns) {&#xD;
  for ( frame in seq(1, sys.nframe(), 1) ) {&#xD;
	fun &amp;lt;- sys.function(frame);&#xD;
	env &amp;lt;- environment(fun)&#xD;
	n   &amp;lt;- environmentName(env)&#xD;
	if ( n == ns ) return(TRUE);&#xD;
  }&#xD;
  return(FALSE);&#xD;
}&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Then we use it for our purpose:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&#xD;
is.in.multicore &amp;lt;- function (...) { return(is.in.namespace("multicore")) }&#xD;
library("multicore")&#xD;
stopifnot( mclapply(as.list(1), is.in.multicore)[[1]] == TRUE )&#xD;
stopifnot(   lapply(as.list(1), is.in.multicore)[[1]] == FALSE )&#xD;
stopifnot( local( {mclapply &amp;lt;- function(x) return(x); mclapply(is.in.multicore())} ) == FALSE )&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Easy when you know how.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.46]" title="[0.46]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/A-warning-on-the-R-save-format.html" title="The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?"&gt;A warning on the R save format&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data. I recommend that you save data in a data format (e.g. CSV or CDF), not using the save() function which is really for objects (data and code). What is your approach?&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-Eliminating-observed-values-with-zero-variance.html" title="I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little."&gt;R: Eliminating observed values with zero variance&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=mqFcMzo8FLQ:I9l-WH946Nc:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=mqFcMzo8FLQ:I9l-WH946Nc:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/mqFcMzo8FLQ" height="1" width="1"/&gt;</content><published>2009-06-16T10:27:00Z</published><updated>2009-06-16T10:27:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Determine-if-function-is-called-from-specific-package.html</feedburner:origLink></entry><entry><title type="text">R tips: Installing Rmpi on Fedora Linux</title><id>urn:uuid:57259815-f049-5226-bda6-95b15ae0f4f2</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/z6AZUNX1s3Y/R-tips-Installing-Rmpi-on-Fedora-Linux.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Somebody on the R-help mailing list asked how to get <a href="http://cran.r-project.org/web/packages/Rmpi/index.html">Rmpi</a> working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.  Since it is unusually painful to get working, I might as well copy the instructions here.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;Somebody on the R-help mailing list asked how to get &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; working on his Fedora Linux machine so he could do high-performance computing on a cluster of machines (or a single multicore machine) using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.  Since it is unusually painful to get working, I might as well copy the instructions here.&#xD;
&lt;/p&gt;&#xD;
&lt;h2&gt;1. Install Open MPI on Fedora Core&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
First install the &lt;a href="http://www.open-mpi.org/"&gt;openmpi&lt;/a&gt; libraries using:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;yum install openmpi openmpi-devel openmpi-libs&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The default installation on Fedora still doesn’t &lt;i&gt;quite&lt;/i&gt; work, so you need to execute the following command as root (only once is required, after installation of the package):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;ldconfig /usr/lib64/openmpi/lib/&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
You are not quite done: for R to work right with the libraries, you need to modify the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variable to include the path to the Open MPI libraries.  I have the following in my &lt;code&gt;~/.bash_profile&lt;/code&gt;:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".bash_profile"&gt;export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}${LD_LIBRARY_PATH:+:}/usr/lib64/openmpi/lib/"&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Edit your file to contain the same, and execute that line at the command prompt and you are ready to continue.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;2. Install the &lt;code&gt;Rmpi&lt;/code&gt; package for &lt;code&gt;R&lt;/code&gt;&lt;/h2&gt;&#xD;
&lt;p&gt;&#xD;
Now that your Open MPI libraries are set up, and what you do next depends on what version of &lt;code&gt;Rmpi&lt;/code&gt; you are installing.  Most likely you are installing the latest version in which case the following section applies.  The instructions for older versions are retained in a later section for reference.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.1. Current versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
Make sure you have executed the &lt;code&gt;ldconfig&lt;/code&gt; command and set the &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; environment variables as described in the previous section before you continue.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Since at least version 0.5-8 of the &lt;code&gt;Rmpi&lt;/code&gt; library you can install it from the &lt;code&gt;R&lt;/code&gt; command line after you have fixed the Open MPI install.  At the &lt;code&gt;R&lt;/code&gt; prompt do:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;install.packages("Rmpi",&#xD;
                 configure.args =&#xD;
                 c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
                   "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
                   "--with-Rmpi-type=OPENMPI"))&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
It should work and install OK.  This is obviously quite a mouthful to remember, but help is at hand through the &lt;code&gt;options()&lt;/code&gt; mechanism in R.  In your &lt;code&gt;~/.Rprofile&lt;/code&gt; you can add something like:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title=".Rprofile"&gt;local({&#xD;
    my.configure.args &amp;lt;-&#xD;
        list("Rmpi" =&#xD;
             c("--with-Rmpi-include=/usr/include/openmpi-x86_64/",&#xD;
               "--with-Rmpi-libpath=/usr/lib64/openmpi/lib/",&#xD;
               "--with-Rmpi-type=OPENMPI"),&#xD;
             ## Not needed for Rmpi but shown to illustrate the format&#xD;
             "ncdf" =&#xD;
             c("-with-netcdf_incdir=/usr/include/netcdf",&#xD;
               "-with-netcdf_libdir=/usr/lib64/")&#xD;
             );&#xD;
    options("configure.args" = my.configure.args)&#xD;
})&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;Then you can just type &lt;code&gt;install.packages("Rmpi")&lt;/code&gt; at the R command prompt to install the package.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h3&gt;2.2. Older versions of the &lt;code&gt;Rmpi&lt;/code&gt; package&lt;/h3&gt;&#xD;
&lt;p&gt;&#xD;
The problem is the configuration file &lt;code&gt;configure.ac&lt;/code&gt; which is, unfortunately, completely brain-damaged with hard-coded assumptions about which subdirectories should contain header and library files and no way of overriding it.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Download the latest &lt;a href="http://cran.r-project.org/web/packages/Rmpi/index.html"&gt;Rmpi&lt;/a&gt; package from CRAN and unpack it using &lt;code&gt;tar zxvf Rmpi_0.5-7.tar.gz&lt;/code&gt;.  Go to the new &lt;code&gt;Rmpi&lt;/code&gt; directory and replace the file &lt;code&gt;configure.ac&lt;/code&gt; with the one below (for a x86_64 system; for 32 bit you probably need to change &lt;code&gt;-64&lt;/code&gt; to &lt;code&gt;-32&lt;/code&gt;):&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="configure.ac"&gt; Process this file with autoconf to produce a configure script.&#xD;
&#xD;
AC_INIT(DESCRIPTION)&#xD;
&#xD;
AC_PROG_CC&#xD;
&#xD;
MPI_LIBS=`pkg-config --libs openmpi-1.3.1-gcc-64`&#xD;
MPI_INCLUDE=`pkg-config --cflags openmpi-1.3.1-gcc-64`&#xD;
MPITYPE="OPENMPI"&#xD;
MPI_DEPS="-DMPI2"&#xD;
&#xD;
AC_CHECK_LIB(util, openpty, [ MPI_LIBS="$MPI_LIBS -lutil" ])&#xD;
AC_CHECK_LIB(pthread, main, [ MPI_LIBS="$MPI_LIBS -lpthread" ])&#xD;
&#xD;
PKG_LIBS="${MPI_LIBS} -fPIC"&#xD;
PKG_CPPFLAGS="${MPI_INCLUDE} ${MPI_DEPS} -D${MPITYPE} -fPIC"&#xD;
&#xD;
AC_SUBST(PKG_LIBS)&#xD;
AC_SUBST(PKG_CPPFLAGS)&#xD;
AC_SUBST(DEFS)&#xD;
&#xD;
AC_OUTPUT(src/Makevars) &#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
The number 1.3.1 may change in future releases of Fedora: see &lt;code&gt;/usr/lib64/pkgconfig/openmpi-*.pc&lt;/code&gt; for the current value.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Still in the &lt;code&gt;Rmpi&lt;/code&gt; directory do the following in your shell:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;autoconf&#xD;
cd ..&#xD;
tar zcvf Rmpi_0.5-7-F11.tar.gz Rmpi&#xD;
R CMD INSTALL Rmpi_0.5-7-F11.tar.gz &#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;h2&gt;3. Test it&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;Now &lt;code&gt;Rmpi&lt;/code&gt; should be working in R:&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("Rmpi")&#xD;
&amp;gt; mpi.spawn.Rslaves(nslaves=2)&#xD;
    2 slaves are spawned successfully. 0 failed.&#xD;
master (rank 0, comm 1) of size 3 is running on: server&#xD;
slave1 (rank 1, comm 1) of size 3 is running on: server&#xD;
slave2 (rank 2, comm 1) of size 3 is running on: server&#xD;
&amp;gt; x &amp;lt;- c(10,20)&#xD;
&amp;gt; mpi.apply(x,runif)&#xD;
[[1]]&#xD;
 [1] 0.25142616 0.93505554 0.03162852 0.71783194 0.35916139 0.85082154&#xD;
 [7] 0.35404191 0.14221315 0.60063773 0.71805190&#xD;
&#xD;
[[2]]&#xD;
 [1] 0.84157864 0.63481773 0.38217188 0.67839089 0.27827728 0.35429266&#xD;
 [7] 0.04898744 0.96601584 0.25687905 0.77381186 0.69011927 0.37391028&#xD;
[13] 0.19017369 0.51196594 0.51970563 0.15791524 0.21358237 0.69642478&#xD;
[19] 0.12690207 0.44177656&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.45]" title="[0.45]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Spreadsheet-errors.html" title="For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel."&gt;Spreadsheet errors&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-1-of-Non_Life-Insurance-Pricing-with-GLM.html" title="Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing some work in this area recently. Needing a robust internal training course and documented methodology, we have been working our way through the book again and converting the examples and exercises to R , the statistical computing and analysis platform. This is part of a series of posts containing elements of the R code."&gt;R code for Chapter 1 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Insurance pricing is backwards and primitive, harking back to an era before computers. One standard (and good) textbook on the topic is Non-Life Insurance Pricing with Generalized Linear Models by Esbjorn Ohlsson and Born Johansson. We have been doing som…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.37]" title="[0.37]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/Excel_Tip_1.html" title="I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing ( array = value ) , as in these examples: (A1:A10=foo) SUMPRODUCT((B2:B6=B10)*1, C2:C6) This works in Gnumeric but not in OpenOffice 1.4. More notes and examples below."&gt;Excel Tip: Array boolean operator&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I learn something new every day. Thinking I knew pretty much everythging there is to know about Microsofts Excel spreadsheet application, I was surprised to see that you could turn any array into a boolean array depending on a condition by simply writing …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-10.png" width="85" height="16" alt="[0.36]" title="[0.36]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html" title="Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of their (commercial) version of R was that it was linked to a to a better linear algebra library. So I decided to investigate. The quick summary is that it only really makes a difference for fairly artificial benchmark tests. For “normal” work you are unlikely to see a difference most of the time."&gt;Faster R through better BLAS&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Can we make our analysis using the R statistical computing and analysis platform run faster? Usually the answer is yes, and the best way is to improve your algorithm and variable selection. But recently David Smith was suggesting that a big benefit of the…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=z6AZUNX1s3Y:pAonRiKpcZg:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=z6AZUNX1s3Y:pAonRiKpcZg:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/z6AZUNX1s3Y" height="1" width="1"/&gt;</content><published>2009-06-12T10:23:00Z</published><updated>2009-06-12T10:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Installing-Rmpi-on-Fedora-Linux.html</feedburner:origLink></entry><entry><title type="text">Data Mashups in R from O'Reilly</title><id>urn:uuid:edb63dc9-21f9-5664-8b35-afb01d7d6472</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/QV_4TAhfmFU/Data-Mashups-in-R-from-O_Reilly.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><div class="floatRight">
<a href="http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html" title="Click for full article"><img src="http://static.cybaea.net/images/fc_heat_small.png" width="150" height="150" alt="[Philadelphia County July 2009 Foreclosure Heat Map]" /></a>
</div>
<p>
O’Reilly has published <a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R ">Data Mashups in R</a> as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one here.  This is all done with the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a>.
</p>
</div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
O’Reilly has published &lt;a href="http://oreilly.com/catalog/9780596804770/" title="Data Mashups in R "&gt;Data Mashups in R&lt;/a&gt; as a $4.99 PDF download in their Short Cut series.  In 27 pages it takes you through an example of how to combine foreclosure information with maps and geographical information to produce plots like the one below.  This is all done with the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/fc_heat.png" title="Larger version of Philadelphia County July 2009 Foreclosure Heat Map"&gt;&lt;img src="http://static.cybaea.net/images/fc_heat_medium.png" width="400" height="400" alt="[Philadelphia County July 2009 Foreclosure Heat Map]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
They show how to:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Use regular expressions to parse HTML files&lt;/li&gt;&#xD;
&lt;li&gt;Use the &lt;a href="http://cran.r-project.org/web/packages/XML/index.html"&gt;XML&lt;/a&gt; package to parse XML data from a web service (&lt;a href="http://developer.yahoo.com/maps/rest/V1/geocode.html"&gt;Yahoo! Geocode&lt;/a&gt;)&lt;/li&gt;&#xD;
&lt;li&gt;Find ERSI shape files for your maps&lt;/li&gt;&#xD;
&lt;li&gt;Use &lt;a href="http://cran.r-project.org/web/packages/PBSmapping/index.html"&gt;PBSmapping&lt;/a&gt; to process and display geographical data (GIS)&lt;/li&gt;&#xD;
&lt;li&gt;Importing and using US Census data with your maps&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=QV_4TAhfmFU:sCLEIdyPB64:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=QV_4TAhfmFU:sCLEIdyPB64:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/QV_4TAhfmFU" height="1" width="1"/&gt;</content><published>2009-06-09T11:23:00Z</published><updated>2009-06-09T11:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data-Mashups-in-R-from-O_Reilly.html</feedburner:origLink></entry><entry><title type="text">How to win the KDD Cup Challenge with R and gbm</title><id>urn:uuid:3fb3545e-ea30-5e3b-8f8b-8902f107b81d</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/iWBVzSGe3Aw/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about <a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html">recently</a>) kindly provides more information about how to win this public challenge using the <a href="http://www.r-project.org/">R statistical computing and analysis platform</a> on a laptop (!).
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;recently&lt;/a&gt;) kindly provides more information about how to win this public challenge using the &lt;a href="http://www.r-project.org/"&gt;R statistical computing and analysis platform&lt;/a&gt; on a laptop (!).&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
As a reminder of &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;what we wrote before&lt;/a&gt;, the challenge provided two anonymized data set each of 50,000 mobile teleco customers and each entry having 15,000 variables.  The task was to find the best churn, up-, and cross-sell models.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Hugh summarizes his team’s approach:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
Feature selection was an important first step [we &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html"&gt;mentioned before&lt;/a&gt; that this is key for all successful data mining projects – AE]. We looked at how effective each individual variable was as a predictor, which also allowed us to reading parts of the data only, &lt;em&gt;as the whole dataset didn’t fit in memory&lt;/em&gt; [my emphasis – AE]. The assessment here was homebrew, making a simple predictor on half the data and measuring performance (by the AUC measure) on the other half:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor&lt;/li&gt;&#xD;
&lt;li&gt;For continuous variables we split the variable up into "bins", as you would a histogram, and again took the average number of 1's in the response for each bin as the predictor.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
From this we came up with a set of about 200 variables for each model, which we continued to tinker with. The main model was a gradient boosted machine which used the "&lt;a href="http://www.stats.bris.ac.uk/R/web/packages/gbm/index.html"&gt;gbm&lt;/a&gt;" package in &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimising the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit. The package itself is quite powerful as it gives some useful diagnostics such as relative variable importance, allowing us to exclude some and include others.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&lt;p&gt;&#xD;
They did this on standard Windows laptops (Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive) against a competition that had more computing clusters available than Imelda Marcos had shoes.  It is not what you’ve got, it’s how you use it &lt;tt&gt;:-)&lt;/tt&gt;.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Congratulations to Hugh and his team!&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" title="The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission."&gt;R used by KDD 2009 cup winner of slow challenge&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=iWBVzSGe3Aw:qxEUGcIYUEk:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=iWBVzSGe3Aw:qxEUGcIYUEk:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/iWBVzSGe3Aw" height="1" width="1"/&gt;</content><published>2009-06-01T07:07:00Z</published><updated>2009-06-01T07:07:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html</feedburner:origLink></entry><entry><title type="text">R used by KDD 2009 cup winner of slow challenge</title><id>urn:uuid:23be031b-ddb6-5244-ab24-77042c61951c</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/OqKxuXq79pQ/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
The results from the <a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html">KDD Cup 2009 challenge</a> (which we wrote about before) are in, and the winner of the slow challenge used the <a href="http://www.r-project.org">R statistical computing and analysis platform</a> for their winning submission.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
The results from the &lt;a href="http://www.cybaea.net/Blogs/Journal/KDD-Cup-2009.html"&gt;KDD Cup 2009 challenge&lt;/a&gt; (which we wrote about before) are in, and the winner of the slow challenge used the &lt;a href="http://www.r-project.org"&gt;R statistical computing and analysis platform&lt;/a&gt; for their winning submission.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The &lt;a href="http://www.kddcup-orange.com/factsheet.php?id=21"&gt;write up&lt;/a&gt; (username/password may be required) from &lt;a href="http://www.ms.unimelb.edu.au/Personnel/profile.php?PC_id=590"&gt;Hugh Miller&lt;/a&gt; and team at the University of Melbourne includes these points:&#xD;
&lt;/p&gt;&#xD;
&lt;ul&gt;&#xD;
&lt;li&gt;Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function&lt;/li&gt;&#xD;
&lt;li&gt;Models fit in an hour or so&lt;/li&gt;&#xD;
&lt;li&gt;Used the &lt;a href="http://www.r-project.org"&gt;R statistical package&lt;/a&gt;&lt;/li&gt;&#xD;
&lt;li&gt;Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.&lt;/li&gt;&#xD;
&lt;/ul&gt;&#xD;
&lt;p&gt;&#xD;
Impressive hardware selection!  Well done R.  Weka was another popular tool among the top entrants.  Key for all of them were clever data preparation and variable substitution.  The fast track winners from IBM document this in some detail:&#xD;
&lt;/p&gt;&#xD;
&lt;blockquote&gt;&#xD;
&lt;p&gt;&#xD;
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
We tried PCA on the large data set, but it did not seem to help.&#xD;
&lt;/p&gt;&lt;p&gt;&#xD;
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.&#xD;
&lt;/p&gt;&#xD;
&lt;/blockquote&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.47]" title="[0.47]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html" title="Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently ) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!)."&gt;How to win the KDD Cup Challenge with R and gbm&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently ) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=OqKxuXq79pQ:MYgOwyfum9M:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=OqKxuXq79pQ:MYgOwyfum9M:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/OqKxuXq79pQ" height="1" width="1"/&gt;</content><published>2009-05-31T13:17:00Z</published><updated>2009-05-31T13:17:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-used-by-KDD-2009-cup-winner-of-slow-challenge.html</feedburner:origLink></entry><entry><title type="text">R tips: Use read.table instead of strsplit to split a text column into multiple columns</title><id>urn:uuid:60775fac-6d0b-5d55-9e76-eb21bdde97c1</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/uowubtD_s_4/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving <code>strsplit</code>.  But <a href="http://staff.pubhealth.ku.dk/~pd/">Peter Dalgaard</a> comes up with a much nicer method using <code>read.table</code> on a <code>textConnection</code> object:
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200).  He wanted to sort by this column and I proposed a solution involving &lt;code&gt;strsplit&lt;/code&gt;.  But &lt;a href="http://staff.pubhealth.ku.dk/~pd/"&gt;Peter Dalgaard&lt;/a&gt; comes up with a much nicer method using &lt;code&gt;read.table&lt;/code&gt; on a &lt;code&gt;textConnection&lt;/code&gt; object:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="screen"&gt;&amp;gt; a &amp;lt;- data.frame(cbind(color=c("yellow","red","blue","red"),&#xD;
                        status=c("no","yes","yes","no"),&#xD;
                        ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))&#xD;
&amp;gt; con &amp;lt;- textConnection(as.character(a$ip))&#xD;
&amp;gt; o &amp;lt;- do.call(order,read.table(con, sep="."))&#xD;
&amp;gt; close(con)&#xD;
&amp;gt; a[o,]&#xD;
   color status            ip&#xD;
3   blue    yes     2.2.58.10&#xD;
2    red    yes   2.131.58.16&#xD;
4    red     no 162.131.58.17&#xD;
1 yellow     no 162.131.58.26&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
That is very, very neat!  Thank you Peter.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=uowubtD_s_4:JUmeZOR5FJ8:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=uowubtD_s_4:JUmeZOR5FJ8:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/uowubtD_s_4" height="1" width="1"/&gt;</content><published>2009-05-29T10:53:00Z</published><updated>2009-05-29T10:53:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/R-tips-Use-read_table-instead-of-strsplit-to-split-a-text-column-into-multiple-columns.html</feedburner:origLink></entry><entry><title type="text">Data.gov</title><id>urn:uuid:a914e8e4-59f0-5054-a5ce-d0aa76d47247</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/Data_gov.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/CaDId-zLq1A/Data_gov.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
I am always on the lookout for useful data sources for training in statistics, so I am excited that <a href="http://www.data.gov/">Data.gov</a> has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. 
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
I am always on the lookout for useful data sources for training in statistics, so I am excited that &lt;a href="http://www.data.gov/"&gt;Data.gov&lt;/a&gt; has opened for business.  The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the US Government. &#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
This is a great initiative which I look forward to explore when I am not in a tiny airport at 3 am (but hey: they have free wifi) and which I hope other countries will take up.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
Are there other catalogues of data sets that you use?&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/Data_gov.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.43]" title="[0.43]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Beautiful-Data.html" title="OReillys recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data , is also available as a PDF download."&gt;Beautiful Data&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;OReillys recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data , is also available as a PDF download.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-00.png" width="85" height="16" alt="[0.33]" title="[0.33]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Journal/When-Big-Data-Matters.html" title="Big Data is a buzzword, but is it real: does it address real business issues or is it just an excuse to sell more computers, software, and consulting services? We argue that it is real and it does matter, but only in some well-defined circumstances: it is not a universal solution or requirement to every problem. We provide a framework for determining where the Big Data applications are within your work and where traditional approaches apply. Get this article as a PDF: When Big Data matters ."&gt;When Big Data Matters&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=CaDId-zLq1A:70r8fJ5coR4:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=CaDId-zLq1A:70r8fJ5coR4:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/CaDId-zLq1A" height="1" width="1"/&gt;</content><published>2009-05-22T02:23:00Z</published><updated>2009-05-22T02:23:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/Data_gov.html</feedburner:origLink></entry><entry><title type="text">SNA with R: Loading large networks using the igraph library</title><id>urn:uuid:8764d0b0-00b6-5d9b-9c45-5d3373bc97a8</id><link rel="alternate" type="application/xhtml+xml" href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><link rel="alternate" type="text/html" href="http://feeds.cybaea.net/~r/CybaeaData/~3/UafsWYtoE_U/SNA-with-R-Loading-large-networks-using-the-igraph-library.html" /><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>
We are interested in Social Network Analysis using the statistical analysis and computing platform <a href="http://www.r-project.org/">R</a>.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.
</p>
<p>
In <a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html">our previous post on SNA</a> we gave up on using the <code>statnet</code> package because it was not able to handle our data volumes.  In this entry we have better success with the <code>igraph</code> package.
</p></div></summary><content type="html">&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;p&gt;&#xD;
We are interested in Social Network Analysis using the statistical analysis and computing platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;.  The documentation for R is voluminous but typically not very good, so this entry is part of a series where we document what we learn as we explore the tool and the packages.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
In &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html"&gt;our previous post on SNA&lt;/a&gt; we gave up on using the &lt;code&gt;statnet&lt;/code&gt; package because it was not able to handle our data volumes.  In this entry we have better success with the &lt;code&gt;igraph&lt;/code&gt; package.&#xD;
&lt;/p&gt;&#xD;
&lt;p&gt;&#xD;
The task we are considering is still how to load the network data into the R package’s internal representation.  We will assume that the raw data for our analysis is in a transactional format that is typical at least in the Telecommunications and Finance industries.  In the former the terminology is Call Detail Record (CDR) and an extract may look a little like the following:&#xD;
&lt;/p&gt;&#xD;
&lt;pre class="document" title="Sample Call Detail Records"&gt;&#xD;
&lt;b&gt;          src,         dest,     start,  duration,type,...&lt;/b&gt;&#xD;
+447000000005,+447000000006,1238510028,        52,call,...&#xD;
+447000000006,+447000000009,1238510627,       154,call,...&#xD;
+447000000009,+447000000007,1238511103,        48,call,...&#xD;
+447000000006,+447000000005,1238511145,        49,call,...&#xD;
+447000000006,+447000000005,1238511678,        12,call,...&#xD;
+447000000001,+447000000006,1238511735,       147,call,...&#xD;
+447000000007,+447000000009,1238511806,        26,call,...&#xD;
+447000000000,+447000000008,1238511825,        19,call,...&#xD;
+447000000009,+447000000008,1238511900,        28,call,...&#xD;
...&#xD;
&lt;/pre&gt;&#xD;
&lt;p&gt;&#xD;
Here a record indicates that the customer identified as &lt;var&gt;src&lt;/var&gt; called (&lt;var&gt;type&lt;/var&gt;=call) the customer &lt;var&gt;dest&lt;/var&gt; at the given time &lt;var&gt;start&lt;/var&gt; and the call lasted &lt;var&gt;duration&lt;/var&gt; seconds.  In general, there will be (many) more attributes describing the transaction which are represented by the &lt;var&gt;...&lt;/var&gt;.  In a Financial Services example, the records may be money transfers between accounts.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Loading the data in the &lt;code&gt;igraph&lt;/code&gt; package&lt;/h2&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
We are able to load the previous test data with 51 million records easily:&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
&amp;gt; m &amp;lt;- matrix(scan(bzfile("cdr.51M.csv.bz2", open="r"), &#xD;
+                  what=integer(0), skip=1, sep=','), &#xD;
+             ncol=4, byrow=TRUE)&#xD;
Read 205266564 items&#xD;
&amp;gt; ### Vertices are numbered from zero in the igraph library&#xD;
&amp;gt; m[,1] &amp;lt;- m[,1]-1; m[,2] &amp;lt;- m[,2]-1&#xD;
&amp;gt; g &amp;lt;- graph.edgelist(m[,c(2,1)])&#xD;
&amp;gt; E(g)$start    &amp;lt;- as.POSIXct(m[,3], origin="1970-01-01", tz="UTC")&#xD;
&amp;gt; E(g)$duration &amp;lt;- m[,4]&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
Time to up the ante!  We have a file with simulated call data records containing over 700 million entries where we suspect the algorithm used is under-estimating nodes with small connections.  Let’s check on the first ½ billion records (which seems to more-or-less fit in our available memory on this workstation):&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; library("igraph")&#xD;
### Note that R can only handle 2^31-1 elements in a vector (on any&#xD;
### platform, including 64-bit), so we need to read this file as a&#xD;
### list.&#xD;
&amp;gt; s &amp;lt;- scan("cdr.1e6x1e1.csv", what=rep(list(integer(0)),4), skip=1, sep=',', multi.line=FALSE)&#xD;
Read 700466826 records&#xD;
&amp;gt; m &amp;lt;- as.vector(rbind(s[[2]], s[[1]]))&#xD;
&amp;gt; print(length(m))&#xD;
[1] 1400933652&#xD;
&amp;gt; length(m) &amp;lt;- 1e9&#xD;
&amp;gt; g &amp;lt;- graph(m, directed=TRUE)&#xD;
&amp;gt; ns &amp;lt;- neighborhood.size(g, 1)&#xD;
&amp;gt; summary(ns)&#xD;
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. &#xD;
   1.00   35.00   40.00   42.92   47.00  101.00 &#xD;
&amp;gt; hist(ns, xlab="Neighborhood size", main="Distribution of neighborhood size", &#xD;
       sub="From cdr.1e6x1e1.1e9")&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;div class="floatRight"&gt;&#xD;
&lt;a href="http://static.cybaea.net/images/neighborhood_hist.png"&gt;&lt;img src="http://static.cybaea.net/images/neighborhood_hist_small.png" width="400" height="400" title="Distribution of neighborhood size" alt="[Distribution of neighborhood size plot]"&gt;&lt;/img&gt;&lt;/a&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
As we suspected, the Monte Carlo algorithm does not provide enough customers with low calling circle sizes.  Fortunately it is very easy to add these separately: the hard part is modelling the larger calling circles.  A mix of these two algorithms provide a reasonably good fit to actual customer behaviour.  (The cut-off at 100 is a parameter to our Monte Carlo simulation program which indeed was 100 for this run.)&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;h2&gt;Problems&lt;/h2&gt;&#xD;
&lt;p&gt;However, it is not all perfect.  When we attempt to add the edge parameters in the obvious way it fails:&lt;/p&gt;&#xD;
&#xD;
&lt;pre class="screen"&gt;&amp;gt; length(s[[3]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; length(s[[4]]) &amp;lt;- 0.5e9&#xD;
&amp;gt; E(g)$start     &amp;lt;- s[[3]]&#xD;
Error: cannot allocate vector of size 3.7 Gb&#xD;
Execution halted&#xD;
&amp;gt; E(g)$duration  &amp;lt;- s[[4]]&#xD;
&lt;/pre&gt;&#xD;
&#xD;
&lt;p&gt;&#xD;
So we are just at the limit.  Probably 100 million records is OK in this environment.  But &lt;a href="http://igraph.sourceforge.net/"&gt;the core igraph library&lt;/a&gt; is accessible from C so better performance can probably be achieved this way and certainly pointers are 8 byte structures on this machine so we should not have the silly limits that R imposes on us.&#xD;
&lt;/p&gt;&#xD;
&#xD;
&lt;p&gt;Jump to &lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html#h2_entry_comments"&gt;comments&lt;/a&gt;.&lt;/p&gt;&#xD;
&lt;div class="seealso"&gt;&#xD;
&lt;h1&gt;You may also like these posts:&lt;/h1&gt;&#xD;
&lt;ol&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-50.png" width="85" height="16" alt="[0.62]" title="[0.62]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-your-network-data.html" title="We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want."&gt;SNA with R: Loading your network data&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We are interested in Social Network Analysis using the statistical analysis and computing platform R . As usual with R, the documentation is pretty bad, so this series collects our notes as we learn more about the available packages and how they work. We use here the statnet group of packages, which seems to be the most comprehensive and most actively maintained network analysis packages. The first task which we consider in this post is to load our data into a network object, which is how all the statnet packages represent a network. Typically for R, the documentation is voluminous but not always as helpful as one could want.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-40.png" width="85" height="16" alt="[0.44]" title="[0.44]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/TechNotes/Mason-utf-8-clean.html" title="This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache , mod_perl , and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data. You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl. That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right."&gt;4 easy steps to make Mason utf-8 Unicode clean with Apache, mod_perl, and DBI&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;This is a note for people who are using the Mason system for high-performance, dynamic web site authoring with Apache , mod_perl , and a relational database like PostgreSQL accessed through DBI, and who want to be utf-8 Unicode clean in all their data. You want to be able to write accented letters in any language in your web pages. You want your users to be able to enter any characters in web forms, and you want that data to get in and out of your relational database and still display correctly and be handled correctly by perl. That is, unfortunately, not how it works out of the box, at least not on Red Hat Enterprise Linux 5 or on Fedora 10. This article shows how we made it work right.&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.40]" title="[0.40]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Employee-productivity-as-function-of-number-of-workers-revisited.html" title="We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis for the FTSE-100 constituent companies and find that the relation still holds four years later and across a continent."&gt;Employee productivity as function of number of workers revisited&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We have a mild obsession with employee productivity and how that declines as companies get bigger. We have previously found that when you treble the number of workers, you halve their individual productivity which is mildly scary. We revisit the analysis …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-30.png" width="85" height="16" alt="[0.39]" title="[0.39]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/Area-Plots-with-Intensity-Coloring.html" title="I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not support natively so we have to roll our own code for that. Unfortunately, lines(..., type=l) does not recycle the colour col= argument, so we end up with rather more loops than I thought would be necessary. We also get a nice opportunity to use the under-appreciated read.fwf function."&gt;Area Plots with Intensity Coloring&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;I am not sure apeescape’s ggplot2 area plot with intensity colouring is really the best way of presenting the information, but it had me intrigued enough to replicate it using base R graphics. The key technique is to draw a gradient line which R does not …&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;li&gt;&#xD;
&lt;p&gt;&#xD;
&lt;img src="http://static.cybaea.net/logo2011/cybaea-rate-20.png" width="85" height="16" alt="[0.38]" title="[0.38]"&gt;&lt;/img&gt;&#xD;
&lt;a href="http://www.cybaea.net/Blogs/Data/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM.html" title="We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohlsson and Börn Johansson (Amazon UK | US ). At this stage, our purpose is to reproduce the analysis from the book using the R statistical computing and analysis platform, and to answer the data analysis elements of the exercises and case studies. Any critique of the approach and of pricing and modeling in the Insurance industry in general will wait for a later article."&gt;R code for Chapter 2 of Non-Life Insurance Pricing with GLM&lt;/a&gt;&#xD;
&lt;/p&gt;&#xD;
&lt;p class="summary"&gt;We continue working our way through the examples, case studies, and exercises of what is affectionately known here as “the two bears book” (Swedish björn = bear) and more formally as Non-Life Insurance Pricing with Generalized Linear Models by Esbjörn Ohl…&lt;/p&gt;&#xD;
&lt;/li&gt;&#xD;
&lt;/ol&gt;&#xD;
&lt;/div&gt;&#xD;
&#xD;
&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?i=UafsWYtoE_U:VuX19eZpOZo:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.cybaea.net/~ff/CybaeaData?a=UafsWYtoE_U:VuX19eZpOZo:TzevzKxY174"&gt;&lt;img src="http://feeds.feedburner.com/~ff/CybaeaData?d=TzevzKxY174" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CybaeaData/~4/UafsWYtoE_U" height="1" width="1"/&gt;</content><published>2009-05-06T15:33:00Z</published><updated>2009-05-06T15:33:00Z</updated><author><name>Allan Engelhardt</name><uri>http://www.cybaea.net/</uri></author><feedburner:origLink>http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html</feedburner:origLink></entry></feed>

