How to Take Samples from Data in R - dummies

How to Take Samples from Data in R

By Andrie de Vries, Joris Meys

Statisticians often have to take samples of data and then calculate statistics. Taking a sample is easy with R because a sample is really nothing more than a subset of data. To do so, you make use of sample(), which takes a vector as input; then you tell it how many samples to draw from that list.

Say you wanted to simulate rolls of a die, and you want to get ten results. Because the outcome of a single roll of a die is a number between one and six, your code looks like this:

> sample(1:6, 10, replace=TRUE)
 [1] 2 2 5 3 5 3 5 6 3 5

You tell sample() to return ten values, each in the range 1:6. Because every roll of the die is independent from every other roll of the die, you’re sampling with replacement. This means that you take one sample from the list and reset the list to its original state (in other words, you put the element you’ve just drawn back into the list).

To do this, you add the argument replace=TRUE, as in the example.

Because the return value of the sample() function is a randomly determined number, if you try this function repeatedly, you’ll get different results every time. This is the correct behavior in most cases, but sometimes you may want to get repeatable results every time you run the function.

Usually, this will occur only when you develop and test your code, or if you want to be certain that someone else can test your code and get the same values you did. In this case, it’s customary to specify a so-called seed value.

If you provide a seed value, the random-number sequence will be reset to a known state. This is because R doesn’t create truly random numbers, but only pseudo-random numbers. A pseudo-random sequence is a set of numbers that, for all practical purposes, seem to be random but were generated by an algorithm. When you set a starting seed for a pseudo-random process, R always returns the same pseudo-random sequence.

But if you don’t set the seed, R draws from the current state of the random number generator (RNG). On startup R may set a random seed to initialize the RNG, but each time you call it, R starts from the next value in the RNG stream. You can read the Help for ?RNG to get more detail.

In R, you use the set.seed() function to specify your seed starting value. The argument to set.seed() is any integer value.

> set.seed(1)
> sample(1:6, 10, replace=TRUE)
 [1] 2 3 4 6 2 6 6 4 4 1

If you draw another sample, without setting a seed, you get a different set of results, as you would expect:

> sample(1:6, 10, replace=TRUE)
 [1] 2 2 5 3 5 3 5 6 3 5

Now, to demonstrate that set.seed() actually does reset the RNG, try it again. But this time, set the seed once more:

> set.seed(1)
> sample(1:6, 10, replace=TRUE)
 [1] 2 3 4 6 2 6 6 4 4 1

You get exactly the same results as the first time you used set.seed(1).

You can use sample() to take samples from the data frame iris. In this case, you may want to use the argument replace=FALSE. Because this is the default value of the replace argument, you don’t need to write it explicitly:

> set.seed(123)
> index <- sample(1:nrow(iris), 5)
> index
[1] 44 119 62 133 142
> iris[index, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
44      5.0     3.5     1.6     0.6   setosa
119     7.7     2.6     6.9     2.3 virginica
62      5.9     3.0     4.2     1.5 versicolor
133     6.4     2.8     5.6     2.2 virginica
142     6.9     3.1     5.1     2.3 virginica