How do you get a random sample, and when is it important?

If you're doing a survey or observational study, you really need to go an extra mile to have a sure random sampling--that is, create a list of all the possible units that could be sampled, and then use a random number generator to select from this list.

The point is that there are various ways to put effort into your project. If you're doing a survey or an observational study, with the goal of learning about a larger population (e.g., what percent of columbia students believe X), then it's important to put some effort into actually getting a random sample. It is also a good way to get hands-on experience with an important topic in probability/statistics.

For example:

Sampling freshmen: get a list of all the freshmen, number the list from 1 to N, then use stata to generate a list of 100 (or however many) numbers between 1 and N, then interview the selected students.

Sampling dorm rooms: find out how many rooms are in each dorm, add up these numbers to get a total N and number all the rooms in all the dorms from 1 to N, then use stata to generate random numbers, and proceed as in the example above.

Sampling physics classes: get a list of all the physics classes and make a list of all the possible days you might go and do the data collection. This gives a grid of classes and days. Number the entries on this grid from 1 to N, then use stata to generate random numbers, and proceed as in the Sampling freshman example.

Observing people at street corners: make a list of the street corners that you might want to sample. Number these as 1 to N_1. Make a list of all the days that you might want to take measurements. Number these as 1 to N_2. Make a list of all the hours in the day that you might go and take the measurment. Number these as 1 to N_3. Now use stata to pick 1 random number between 1 and N_1. This is your street corner. pick 1 random number between 1 and N_2. This is your day. Pick 1 random number between 1 and N_3. This is your hour. Repeat the above process to pick a bunch of street corners, days, hours, then go out and collect the data.

A bad example of random sampling:

In the absence of direct guidance from the instructor, students typically take convenience samples. For example, one group, misunderstanding the idea of random sampling, wrote:

"To ensure randomization, we handed out surveys at many different places, and at different times. Moreover, by choosing to sample a relatively large population, we were able to ensure that the average results of many individual results would produce a stable result (law of large numbers---reduce bias, increase randomization)."

This approach is problematic because this group does not define a specific population and strategy for reaching them.

Good examples of random sampling:

If the data collection involves sampling, you must perform random sampling from a defined population.

For a successful past example, see the table below, which is an example of simple random sampling of school library hours. When looking at this table, consider the following questions:

a. Why did the students sample 15-minute blocks instead of full hours?
b. Why is the sample unbalanced with respect to times of night and days of week? Will the imbalance cause problems?
c. Is it a problem that the names are not assigned to time slots randomly (for example, Therese is only assigned to Mondays; Yves is only assigned times between 2:30 and 5:00 am)?

     Sunday  Monday  Tuesday  Wednesday Thursday 
11:00pm  Sabrina  Therese Sandra   85 113
11:15pm  Sabrina  30 58 86 114
11:30pm  3 Therese  Sandra    Sandra   115
11:45am  4 32 60 88 116
12:00am  Sandra  33 61 89 117
12:15am  6 34 62 90  Sandra
12:30am  Sandra  Therese  63 91 119
12:45am  Sandra  36 64 92  Sandra
1:00am  9 37 65 93 121
1:15am  10 38 66 94 122
1:30am  11 Therese  67 95 123
1:45am  12 40 68 96 124
2:00am  13 41 69 97 125
2:15am  14 42 70 98 126
2:30am  15 Therese  71 99 127
3:00am  17 45 Yves    101 129
3:15am  Yves   46 74 Sabrina  130
3:30am  19 47 75 Sabrina  Yves  
3:45am  20 Therese  76 104 132
4:00am  Yves   49 Yves    105 Yves  
4:15am  22 50 78 106 Yves  
4:30am  23 51 79 107 135
4:45am  24 Therese  80 108 136
5:00am  25 53 81 109 Sabrina
5:15am  26 54 82 110 Sabrina
5:30am  27 55 83 111 139

The above table is the ampling plan from a group of four students who were studying the use of the school library during school nights. The students divided the time into 140 15-minute slots and then took a simple random sample of 32 of these slots.


Here is another example of a good sampling design:

" Our population is defined as Columbia College sophomores and juniors. This population is listed in the facebooks … In selecting our sample, we will first divide the population into two strata: males and females. Next, each student will be assigned an integer value. These numbers will be assigned separately for each stratum … We will use two sets of random numbers to select 200 people from each stratum.

After the subjects are randomly selected, each of the four members of the group will survey 100 students at their dormitories. The locations for individual students will be obtained using the online Columbia directory. We will personally hand each subject the questionnaire … If anyone refuses to complete a survey or if they cannot be reached, we will replace that subject with another of the same sex, using the randomization procedure outlined above."

Experiences in data collection are often discouraging, but sometimes you can have unexpected success; for example,

"We had originally intended on having 300 samples, but there were a number of people who were unavailable when we went to their rooms and a very small number of people refused to take part in the survey. It was surprising how eager other people were to take part, as floormates would often brag about their head circumferences to each other after being measured."