On data collection protocol design

The project is primarily a data collecting project. The data analysis part is critical to the purpose of the data collecting experience, since a design starts with the questions one wants to answer using the data. The data to be collected should be an unbiased and representative image of what is going on in the population.

As we discuss in the lectures (probably the first one), any set of data has two dimensions: individuals (who the data are collected on) and variables (what the data are collected about). Thus, the design should be developed along these two dimensions.

The theory and principles of design can be found in your textbook and lecture notes. In the following, several examples are discussed just to give you a taste on what kinds of things should be considered in the design step of your project.

For projects doing surveys or observational studies, efforts should be taken to insure a random sample. Please read "when do you have to insure a random sample? And how?" for more details.

Example 1: an experiment-dropping spaghetti from different height

Several groups from the past conducted their projects on this topic. It sounded easy at first but there were a lot of things to consider during the experiment. For example, the angle between the pasta and the ground when it is dropped, the temperature, the possibility of wind, the time elapse between the pasta box/bag was opened and a strand of pasta was tested in the experiment, etc. Neglecting these possible hidden factors that could affect the outcome of the experiment and could result in unwanted bias. Two things can be done here: control and randomization. Please refer to your textbook and your lecture notes on these two principles in experimental design.

Example 2: survey on how much time the first-year students spend on the phone with their families.

The purpose of this survey was to find out whether the first-year students would spend more time calling home when comparing with the second-year students, given they have the same accessibility to phones (including cell phones). So, naturally, their design included sampling a group of first year students and a group of second-year students. To add a twist to this study, they also asked the second-year students to recall how much time they spent calling home in their first year.

The most important part of this design was to find a way to sample so that there will not be any sampling bias associated with the variables they are going to collect (in this case, time calling home). Could they have collected their sample by stopping people on the college walk or annoying people in the reading room of the Butler library?

Instead of giving you a definite answer, let me ask you two questions: (1) would students who spent more time in library have a different calling habit? (2) who would be most likely to be included in the sample when one random samples from the library? How about dining hall then? The gym?

See? There are a lot of things to think about.

Example 3: an observational study on "whether Columbia men are more likely to hold doors for the opposite sex."

This study is a pretty straightforward topic. Since it is an observational study, the design is not about what the students do, but about how they do it. When, where, who, what and how, the same five W's needed for story telling, also apply here.

      When: would you think people behave differently at various times of the day, say running to class or wandering around the bookstore?
      Where: would you think the people you observe at the gym and at the library belong to the same league?
      Who: are we talking about students, staff and faculty, any male on the campus of Columbia or just the students? Undergraduates only?
      What: are there any demographic information might be helpful in explaining the behavior variations?
      How: are we going to collect information every male we observe or have some randomization?

Now you see, to develop a design for a data collection involves a lot of discussion. As a group, the meeting to discuss these issues may be the most important meeting of your project. There is a saying in statistics that 'no method can save bad data'. The quality of your project depends on the quality of your topic, your data and your analysis, while your analysis heavily relies on the data your collect. Far too often, we have seen projects with an interesting topic fail because of a careless design. Spending time on the design will save you time on the analysis.