Here are detailed comments on the book. Elsewhere there is a review of the book.
How to read R For Dummies
In order to learn R you need to do something with it. After you have read a little of the book, find something to do. Mix reading and doing your project.
You cannot win if you do not play.
Two complementary documents
They are also complimentary.
Some hints for the R beginner
“Some hints for the R beginner” is a set of pages that give you the basics of the R language. It is a completely different approach to the one R For Dummies takes — you may want to investigate it.
The R Inferno
If you are just at the beginning of learning R, you should ignore The R Inferno (except perhaps Circle 1).
When you start using R for real and run into problems, that is the time to pick it up and see if it helps.
There is one thing that I think is missing in R For Dummies. Actually it isn’t missing, it comes at the very end while I think it should be at the start.
That piece is the
search function. More specifically the way that R operates that is highlighted by the results of the
The start of “Some hints for the R beginner” talks about
search and how R finds objects.
How to use these annotations
If you are new to R and first reading the book, then you should probably mostly ignore my comments. However, when you are confused by something in the book, you can look to see if there is a comment on that page that pertains to what you are confused about.
On further reading, these comments are more likely to be of use. Some are clarifications, some are extensions.
Page by page comments
These comments are based on the first printing.
There is more history in the Inferno-ish R presentation.
I’m not a lawyer, but I think the phrasing about redistribution is not right. I think it should say “change and redistribute” rather than “change or redistribute”.
If what you do never leaves your entity, then you can do absolutely whatever you want. That is the free as in speech part. Legalities only come into play if what you do is made available to others. It is a common misunderstanding that you are restricted in what you do within your own world.
The book highlights that R runs on many operating systems. It fails to make clear that the objects that it creates on the operating systems are all the same. You can start a project on a Linux machine at work, continue it while you commute with your Mac laptop, and then finish it on your Windows machine at home. No problem.
The book should tell you not to be afraid of new words. New words like “vector”. You don’t need to make friends with them right away, but don’t be scared off.
(technical) Unhappily the word “vector” in R has several meanings — so it is unfortunate that it is the first new word. The meaning used throughout the book is the most common meaning. See The R Inferno (Circle 5.1) for the gory details.
Pretty much everywhere in the book where it says “statistics” I would prefer “data analysis” instead. Statistics in many people’s mind is formal and academic, not like what they do. More people can feel comfortable doing data analysis than statistics.
In addition to the fear factor, there really is a (slight) difference between data analysis and statistics. I think data analysis is more important even though I’m trained as a statistician.
fields of study
There are additional fields of study where R is used that are not considered to be data hotbeds, such as music and literature. The flexibility of R becomes very important for data in non-traditional forms.
If you are new to R, you shouldn’t expect yourself to understand this discussion. Just let it sink in over time.
Always put spaces around the assignment operator. That makes the code much more readable.
The book tells you on page 63 that you can use
= as well. You will see both used. They are mostly the same (differences are explained in The R Inferno, Circle 8.2.26). I agree with the book’s approach to use
<- but really you can use either.
A nice feature of the RStudio workspace view is that it categorizes the objects.
Windows pathnames (technical)
The book implies that you can not write Windows pathnames with backslashes. Actually you can, you just need to put a double backslash where you want a backslash. Hence it is easier and (often) less confusing to use slashes rather than backslashes.
loading objects (technical)
It is possible to use
attach instead of
load. If you load an object, then it is put into your global environment. If you attach an object, it is put separately on the search list. If you modify an object that has been attached, then the modified version goes into your global environment.
There are different forms of vectorization, and the book doesn’t make that explicit. Vectorization can be put into three categories:
- vectorization along vectors
- vectorization across arguments
mean are vectorized in the sense that they take a vector and summarize it. This is done in pretty much all languages, it is not special.
Vectorization as it is commonly spoken of in R is vectorization along vectors. For example the addition operator as seen on page 24. This is the form of vectorization that is so useful and powerful in R.
You should not expect the third form of vectorization in R. However, it does exist in a few functions. The
mean functions do summary-type vectorization:
> sum(1:3)  6 > mean(1:3)  2
sum function also does vectorization along arguments:
> sum(1, 2, 3)  6
That is basically anomalous. The
mean function is more typical by not doing this form of vectorization:
> mean(1, 2, 3) # WRONG  1
Unfortunately you don’t get an error or a warning in this case. Do not expect this form of vectorization.
Getting error messages can be frightening for a while. But it’s not the end of the world. Relax.
In fact it is possible to get any name that you want, but you probably don’t want to.
return is not a reserved word, but you should treat it as if it were.
> break <- 1 Error in break <- 1 : invalid (NULL) left side of assignment > while <- 1 Error: unexpected assignment in "while <-"
> return <- 1 #do NOT do this >
F and T
I wish to emphasize the advice in the book:
- never abbreviate
- avoid using
Fas object names
The book suggests (with a slight revision on page 361) to load packages with the
library function. Some of us prefer
require instead of
library for this use. The best use of
library is without arguments — this gives you a list of available packages.
> library(fortunes) # load package > require(fortunes) # same thing > library() # get list of packages > require() # don't do this Loading required package: Failed with error: ‘invalid package name’
I think the authors might be being a little too polite in their description of the quality of contributed packages.
I find base R to be phenomenally clean code — it is hard to find commercial code that is less buggy. The quality of contributed packages varies widely. A few are up to the standards of base R, some are quite good, I’m sure there are a few dreadful ones.
With contributed packages you need to be more cautious than when only using base R functionality. Or perhaps I should say that you always need to be vigilent, but if you are using contributed packages, there is a larger chance that a problem is due to a package rather than your own fault.
Without inspecting the code, I know of two clues to suggest a package is of good quality:
- widely used
- good documentation
A widely used package — such as those highlighted in the book — is an indication that a lot of problems with the code have been fixed or didn’t exist in the first place.
Many people use the test of the cleanliness of restaurant restrooms to infer the cleanliness of the kitchen. Likewise, carefully written documentation is likely to be a sign of clean code.
It is not a good idea to use
** to mean exponentiation — it is not out of the question for that to go away. Stick to using the
log and exp
The sentence a little below mid-page about creating the vector inside
exp should say inside the
The last sentence on the page should say
10^310 rather than
You are unlikely to use any of these except for
is.na, which you may use quite a lot.
types of vectors
All of the types of vectors listed may have missing values (
integer versus double
One of the nice things about R is that you hardly ever need to worry about whether something is stored as an integer or a double.
largest integer (technical)
We can see how big the biggest integer is in a couple different ways:
> format(2^31 - 1, big.mark=",")  "2,147,483,647" > .Machine$integer.max  2147483647
What is called “indexing” in the book is more commonly called “subscripting”.
missing value testing
It is a common mistake to try testing missing values with a command like:
> x == NA
That doesn’t work — you need to use
any and all
The last sentence on the page is a false statement. The
all functions are smart enough to know when they can know the answer and when they can’t:
> all(c(NA, FALSE))  FALSE > all(c(NA, TRUE))  NA > any(c(NA, FALSE))  NA > any(c(NA, TRUE))  TRUE
assigning to character (technical)
It is more correct to think of the mode being character than the class being character.
Alternatively, you can use the
value argument of
> grep("New", state.name, value=TRUE)  "New Hampshire" "New Jersey" "New Mexico"  "New York"
sub versus gsub
Here is an example that should make clear the difference between
> gsub("e", "a", c("sheep", "cheap", "cheep"))  "shaap" "chaap" "chaap" > sub("e", "a", c("sheep", "cheap", "cheep"))  "shaep" "chaap" "chaep"
factor attributes (technical)
The book says:
[factors are] neither character vectors nor numeric vectors, although they have some attributes of both.
This sentence is using “attribute” in the non-technical sense. But attributes in the technical sense do come into play: factors have “class” and “levels” attributes.
factor versus character
Notice how the factor is printed differently than the character vector.
American regions (off topic)
There is a brilliant analysis of North American regions called The Nine Nations of North America.
You might wonder what happens if you start on the thirty-first of the month rather than the first. If you wonder something, try it out to see what happens:
> myStart <- as.Date("2012-12-31") > seq(myStart, by="1 month", length=6)  "2012-12-31" "2013-01-31" "2013-03-03" "2013-03-31"  "2013-05-01" "2013-05-31"
The result is a bit Aspergery, and not to everyone’s taste. But perhaps we can do better:
> seq(myStart + 1, by="1 month", length=6) - 1  "2012-12-31" "2013-01-31" "2013-02-28" "2013-03-31"  "2013-04-30" "2013-05-31"
Wondering is great, experimenting is even greater.
one-dimensional arrays (technical)
Regular vectors are not dimensional at all in the technical sense, but we think of them as being one-dimensional. But there really are one-dimensional arrays. They are almost like plain vectors but not quite.
playing with attributes
For large objects you often won’t like the response you get when you do:
Often better is to just look at what attributes the object has:
extracting values from matrices
The flexibility of subscripting matrices (and data frames) as vectors is a curse as well as a blessing.
If you want to do:
and you do:
then you will get an entirely different result. This can be a hard mistake to find — a few pixels difference on your screen can have a big impact.
The example on this page assumes that
first.matrix is as it was first created, not as it has been modified in the intervening exercises.
So adding numbers by row is easy. How to add them by column? One way is:
> fmat <- matrix(1:12, ncol=4) > fmat + rep((1:4)*10, each=nrow(fmat)) [,1] [,2] [,3] [,4] [1,] 11 24 37 50 [2,] 12 25 38 51 [3,] 13 26 39 52
This uses the
rep function to create a vector with as many elements as the matrix has (assuming the vector being replicated has length equal to the number of columns), and the replicated values are in the desired positions.
inverting a matrix
The reason that the command to invert a matrix is not intuitive is because it is seldom the case that (explicitly) inverting a matrix is a good idea.
vectors as arrays (technical)
Actually vectors, in general, are not arrays at all. The difference is of little consequence, however.
third array dimension (technical)
I call the items in the third dimension of an array “slices” rather than “tables”. I’m not aware of any standardized nomenclature. I don’t think “tables” is such a good choice because there are other meanings of “table” in R.
array filling (technical)
I’m not able to follow the sentence in the book describing how arrays are filled. How I think of it is that the first subscripts vary fastest (no matter how many dimensions are in the array).
rows and columns (technical)
Maybe my brain went on strike, but I think that “rows” and “columns” are reversed in the first paragraph on the page.
data frame structure
Note that all the vectors that make up the columns need to be the same length.
data frame structure (technical)
It is possible for a “column” of a data frame to be a matrix, in which case the number of rows needs to match.
data frame length
Note that the length of a data frame is different from the length of the equivalent matrix. The length of the data frame is the number of columns, while the length of the matrix is the number of columns times the number of rows.
character versus factor
The book suggests always making sure that data frames hold character vectors instead of factors in order to reduce problems. The other main route to avoid frustration is to always assume that there are factors.
The thing you don’t want to do is assume that what is really a factor is a character vector.
If in the middle of the page where it says “In the previous section” you don’t know what they are talking about, not to worry — you’re not alone.
as with matrices
I’m not clear on the reference to matrices at the very bottom of the page.
data frame subscripting
You can get a column of a data frame using either the $ or [ form of subscripting. But there is a difference:
> baskets.df$Granny  12 4 5 6 9 3 > baskets.df[,Granny] Error in `[.data.frame`(baskets.df, , Granny) : object 'Granny' not found > baskets.df[,"Granny"]  12 4 5 6 9 3
Note the quotes or lack thereof.
pieces of a list
I prefer calling the pieces of a list "components" rather than "elements". One reason is that a component of a list can be another list, and hence not very elementary.
The functions that you write are essentially the same as the inbuilt functions. They are first-class citizens.
You can very effectively use R without having a clue what "functional programming" means. The important idea behind functional programming is safety -- the data that you want to use is almost surely the data that really is being used.
The object names were obviously changed midstream.
fifty should be
hundred should be
generic functions (technical)
A detail that only occasionally really matters is that the argument names in methods should match the argument name in the generic. You don't want to have the argument called
x in the generic but
object in a method.
looping without loops
Using apply functions is really hiding loops rather than eliminating them.
number of apply functions
Not that it matters, but I count 8 apply functions in the base package in version 2.15.0. There is also a reasonably large number of apply functions in contributed packages.
error checking (technical)
Another way to write the check for out of bounds values is:
stopifnot(all(x >= 0 & x <= 1))
This will create an appropriate error message if there is a violation.
This will take multiple conditions separated by commas. So you can have checks like:
to make sure that
x is a matrix and
y is a data frame.
technical tip (technical)
The first sentence starts:
In fact, functions are generic ...
It should read:
In fact, some functions are generic ...
factor to numeric
The book gives the efficient method of converting a factor to numeric:
The slightly less efficient but easier to remember method is:
Don't forget the
as.character -- it matters.
problems with factors (technical)
Circle 8.2 of The R Inferno starts with a number of items about factors.
Unfortunately, I think the authors are painting too rosy of a picture of the quality of R documentation. There probably is some great documentation for any task or issue that you have, but you may have a significant search on your hands to find that great document.
It takes practice to learn how to use help files well. It doesn't help that sections of the help files are in the wrong order (in my opinion). The "See also" and "Examples" should be near the top, "Details" should be at the bottom.
The examples often are the most important part. The book implies that all examples are reproducible. Not all are, but many are.
You don't need to understand the whole of a help file the first time around. The goal should be to improve your understanding of the function.
It is possible to subscribe via RSS to R tags.
With the cards I'm used to, the command to create cards should include
2:10 rather than
The book says that it is sometimes helpful to include the results of
sessionInfo() in questions. I would change that from "sometimes" to "often".
reading in data
The start of Circle 8.3 in The R Inferno has a number of items about problems reading data in.
If you are using the RGui, there is a "change dir" item in the File menu.
three subset operators
The [[ operator always gets one component. The result is often not a list.
In contrast the [ operator can get any number of items and (except for dropping) gives you back the same type of object.
The book shows the removal of duplicates using both logical subscripts and negative numeric subscripts. Be careful with the latter of these:
> vec <- 1:5 > dups <- duplicated(vec) > vec[!dups]  1 2 3 4 5 > vec[-which(dups)] integer(0)
If you create a vector of negative subscripts, you need to make sure it has at least one element. Otherwise you get nothing when you want everything.
The book is in error when it says that the result of
apply is always a vector. Other possible results include a matrix and a list.
sapply example (technical)
The example at the very top of the page that uses
ifelse would be more in the spirit of R if it instead used:
if(is.numeric(x)) mean(x) else NA
aggregate include the
by function (if you have a data frame) and the
Something seems to have gone wrong. That the phrase "doesn't make sense at all" appears in the paragraph seems apropos.
Often checking data with graphics is best. Do plots look as expected?
There is a
mode function in R, but it is not the same meaning as in the discussion of location.
missing values (technical)
You might think that
"pairwise" should be the default choice since it uses the most data. The problem with it is that the resulting correlation matrix is not guaranteed to be positive definite.
I wondered if
prop.table recognized a table that had added margins. The answer is no, it thinks the margins are part of the data.
multiple plots (technical)
If you want to put the graphics device back into a single plot state without using the
old.par trick, then say:
It doesn't matter which you say.
If you are putting your graphics into a word processor, then often
If you are putting your graphics onto a webpage or into a presentation, then
png can be a good choice.
To be clear: whiskers are at most 1.5 times the width of the box.
changing directory (technical)
To change the working directory and then change it back to the original, you would do something like:
> origwd <- getwd() > setwd("blah/blah") > # do stuff > setwd(origwd)
CRAN mirrors (technical)
While all mirrors are conceptually the same as the primary CRAN site, it takes time for changes to propagate. This is unlikely to be an issue unless you are trying to get a brand new release.
As of 2012 October 14, CRAN has 4087 contributed packages.
I've used R pretty much every day for over a decade and never unloaded a package. I doubt this will be a big issue for you.
R-Forge also provides mailing lists. The immediate significance of this for you is that some of your favorite contributed packages might have a dedicated mailing list.
own repository (technical)
You can even set up your own repository and fill it with packages that you write.
Do you appreciate the meaning of:
knowledge <- apply(theory, 1, sum)
I saw a little teddy bear.
Well, I said to myself,
"I know what I want. I gotta get a bear some way."
from "You cannot win if you do not play" by Steve Forbert