In some situations, one person might desire to know something about some other people, whether it be the amount of money they make, the number of books they have read, the percentage of them who buy a certain product, or whatever this person might fancy to know. If the population considered is small enough and can be divided into groups with names that do not call for more clarification, it might not be too much of a problem to go and find this information by asking each person, assuming he or she tells the truth, and make claims about these groups.
Nevertheless, when the population is rather large, large enough that it would be extremely difficult to gather such information from each person, or too heterogeneous for a clear-cut categorization, a practice called random sampling, alongside a rather relaxed definition of names for groups is used, where a more manageable number of people from the population are supposed to be selected in an aleatory way as to still find answers to questions about the whole population; in such case, one would say that the sample is representative of the population. Nonetheless, however careful this selection could be as to qualify it as random, can one really generalize it to the whole population? What is the mathematical basis, if any, for the belief (and confidence) that such generalization is possible? Before we get to these questions, let us look a little closer at what random sampling is.
The kind of information one would want to know such as the examples mentioned above is called a variable. In the case of the number of books, for each person of the population considered, a number would be assigned. Let’s say there are forty people; then, one can make many safe claims about the percentage of them who have read a certain number of books. However, for a very large and disparate population that cannot be managed too easily, a much smaller group is selected (sample) from this population so that one can hopefully arrive at some answers. The term “random” is attached to sample to say each person in the population has an equal chance of being selected. It happens there are many ways of making this sampling among which are simple random sampling, stratified random sampling, and multistage random sampling (Click here for more information about these types of sampling).
The business of generalization, whatever the motive might be behind it, can be a tricky one; it becomes murkier when the domain over which this generalization is made is not predictable: in our case, this domain would consist of humans. Suppose one want to know about the number of books read by people and want to consider adults between 20 and 25 years old in a city of 2.3 million habitants. The ideal case would be to ask each person in this category about how many books they have read; if one cannot beforehand identify these people by, say, having a list of all the 20-25 year-old residents in that city, then one would need to consider to consult each resident and ask his or her age before asking about the books ( I guess one can also use appearance to say more or less about someone’s age, but appearance sometimes is deceitful). If this could be done, one would have a rather careful answer to one’s inquiry (of course, there would be other factors to consider such as the number of people coming in the city while one is contacting other people, the number of them who might die or leave the city, or the number of them who want to abstain from answering). Hence, for such an apparent simple task, it seems it is rather difficult if not impossible. Because of this difficulty, suppose one instead decide to question 1, 000 residents by asking about their age and the number of books they’ve read and consider picking people in the downtown streets on Saturdays between 2:00 PM and 6:00 PM during June. While these specifications might be to increase the likeliness that one might find people in that age group, would that necessarily be representative of all people in that age group? Maybe the ones found in Downtown streets are the more outgoing ones who might not spend too much time reading books or tourists from countries where people in this group read more than residents in the city; on a Saturday, one might miss the ones living in the suburb, who only go to downtown for work on the weekdays; between 2:00 and 6:00 might the time for some people to be visiting museums and other indoor events so might not be in the streets; and in June, depending on where this city is, might be too rainy to find too many people in the streets. After all these considerations, to what extent one would say one’s sample is representative of all the people in the intended category?
Answers about the number of books read may not be that hard to find; it seems no great amount of calculation is needed to provide an answer. However, there are questions that have less definite answers since there seem to be so many factors involved that answers could unpredictably differ from one sample to another. For example, opinion questions seem to fall in that category; a very simplistic one would be “Are you in favor or disfavor of …?”. The word to complete this question alone may result in very different outcomes when one considers the same sample. It wouldn’t then be surprising that one might arrange as to select people in a setting that could easily result in desired answers, knowing about those people’s attitude toward whatever topic one chooses to ask the question about. Again, how could such a sample be said to represent what the entire population thinks when such variability exists? Could one even go as far as saying that it’s quite pretentious to make few people speak for everybody?
An answer to those questions might be that this is the randomization, based on mathematics, which makes the sampling representative; since each person has the same “chance” to be picked, to not pick someone is not the result of any premeditation. It seems less of a problem to use mathematics to calculate the probability to select an item from a group and to conclude in some case that each item of the collection has the same probability to be picked. However, when it comes to humans, it seems too many factors can be involved to even be able to assign such probability. Although one might ignore such constraints and still assign some probability to each person, it seems unclear to me how mathematics would help to generalize any statement made about one sample to the whole population. While it may be possible to apply some mathematics to whatever collection of numerical data one obtains, I don’t really see or am not aware of how mathematics could grasp the distinctness of each sample to see whether or not any generalization is possible at all.
So what do you think?