Organizing Data (A QUICK REVIEW GUIDE)
How do social scientists organize the raw numbers they collect into an easy to understand summary form? The first step is to construct a frequency distribution.
Frequency distributions of nominal data have 2 columns. The column on the left indicates the category of analysis (cry, play with another toy, etc) and the one on the column on the right headed FREQUENCY or small f indicates the number of children in each category as well as the total number of children in the sample which is indicated by the letter N.
For more general use we need to be able to compare groups despite differences in size and total frequencies. The most popular methods for doing this are proportion and percentage.
The proportion compares the number of cases in a given category with the total size of the distribution. We can convert any frequency into a proportion by dividing the number of cases in any given category (f) by the total number of cases in the distribution.
P = f
N
For example 15 out of 50 girls who found an alternative toy can be expressed as the following proportion:
P = 15 = .30
50
However most people prefer to report the results of their data in percentages. The frequency of occurrence of a category per 100 cases. To calculate % we simply multiply any given proportion by 100.
% = (100) f
N
So 15 out of 50 girls who responded by finding an alternate toy can be experessed as the proportion .30 or as a percentage (100)(15/50) = 30%.
30% of the girls located another toy to play with.
A less common method for comparing size is the ratio. Ratios compare the number of cases falling into one category with the number of cases falling into another category. A ratio is calculated in the following manner
Ratio = f1
f2
For example, if we were interested in a sample that consisted of both 150 males and 100 females, and we were interested in comparing the number of blacks to whites in the sample, we would compute it like this. (f1 = 150 males and 100 females (f2 = 100)
The Ratio = 3/2 which means there are 3 black respondents for every two white respondents in the sample.
Another type of ratio that is more widely used is known as the rate. Social scientists often analyze populations regarding rates of reproduction, death, divorce, crime, and unemployment using a rate. Where most other ratios compare the number of cases in any category or subgroup with the number of cases in another subgroup, rates indicate comparisons between actual cases and the number of potential cases. For example, for divorce rates in a given population we might show the number of divorces against the number of marriages that occur in a given year. Rates are often given in terms of having 1000 potential cases. So if 500 divorces occur in the same year as 4000 marriages the divorce rate would be 125 out of every 1000 marriages.
Divorce rate = f divorces = (1000) (500) = 125
f marriages (4000)
There is nothing special about calculating rates per 1000. The proportion you decide to use depends on what is most common or convenient for the analysis you are presenting. For example murder rates are often calculated as the number of murders per 100,000 residents.
Another type of rate of change is used to compare the same population over 2 different time periods. To compute ‘rate of change’ we compare the actual change between time period 1 and time period 2 with the level at time period 1 serving as our base.
The rate of change for a population that increases from 20,000 to 30,000 between 1980 and 1990 would be calculated as follows:
(100) time 2f - time 1f or (100) 30,000 - 20,000 = 50%
time 1f 20,000
The population increased 50% in the ten year time period.
Rate of change can also be negative. To calculate a decrease in something over time for instance a population goes from 15,000 to 12,000 over a period of time the calculation would be the same.
(100) 12,000 - 15,000 = - 20%
15,000
Grouped Frequency Distributions of Interval Data
Interval level scores are often spread over a wide range. In order to clarify presentation of the data we construct grouped frequency distributions, which means that the scores are condensed into smaller groups. Each category or group is called a class interval. When creating grouped frequency distributions we need to construct class intervals for the scores. Each class interval has an upper and lower limit. At first glance, we would assume that if we had class scores that fell between 60-64 that would be our class limits. However, that is not correct. Unlike the highest and lowest score values in an interval, the ‘class limits’ are located at the point halfway between adjacent class intervals. So the class limits for our interval would be 59.5 and 64.5. When constructing class intervals we use the formula is I = U - L. Which is interval equals upper limit - the lower limit. For our scores this would be
59.5 - 64.5 = 5.
Another characteristic of class intervals is the midpoint, which is the middlemost score of any class interval. For example an interval of 5 -9 would have a mid point of 7.
When constructing your own class intervals keep in mind that you should have at least 3 and up to 20 intervals. To many or to few intervals may obscure the group pattern you are trying to reveal. It is generally best to construct intervals that deal with whole numbers. To make calculations easier, it is conventional to make the lowest score in a class interval some multiple of its size. That is why most intervals are in multiples of 5 or 10. For example exam scores are usually categorized as 90-99, 80-89, etc.
Sometimes you will need to present data in a cumulative distribution. This is generally desirable when presenting large cases of scores. For example SAT scores are often presented in a cumulative distribution. This makes it easier to find how many people obtained what type of score from the overall group of scores.
Typically, the first column shows the class intervals. The second column is the frequency or number of persons in the sample who obtained the scores within the interval, the third column is the percent of the total sample that obtained scores within the interval and the final column (cf) indicates the cumulative percentage. The cumulative percentage indicates the percentage of cases having any score or a score that is lower.
In addition to cumulative frequency, we can also construct a distribution that shows cumulative percentage. Which will show us the percentage of cases having any score or a score that is lower. To calculate cumulative percent
C= (100) cf
N
cf = cumulative frequency in any category
N = total number of cases in the distribution.
Sometimes you will have cases where the class intervals are of different sizes. A good example would be a frequency listing different income levels. Generally the lowest income level consists of people who make fewer than 5k per month. However, as you go up the scale it would not be feasible to keep the scale in increments of 5 thousands dollars.
Frequency distributions are almost everywhere you look. They can be found in magazine articles and daily newspapers because they present information in a readily organized and understandable manner. Most social researchers use frequency distributions as a starting point to organize data that we want to explain or investigate further. For example, we might want to do try and understand why some people fall at one end of the distribution and others at the other end.
To do that we need to expand our data into even more dimensions. Often we use a cross tabulation (cross-tab for short) which presents the distribution of frequencies and percents of one variable (usually the dependent variable) across the categories of one or more additional variables (usually the independent variable or variables). For example, we would not simply want to know how many people wear seat belts, but what types of people wear them (what are the characteristics of seat belt wearers). Cross tabs help us examine two or more variables and their relationship to each other.
When working with crosstabs and keep in mind that if the independent variable is on the rows; use the row percents and if it is on the columns use the column percents. In rare cases where you cannot determine what is the independent variable (no variable can be singled out as the cause of another) total percents are frequently used.
Graphs
When making presentations, you want to grab the attention of your audience or present data in a manner that is understood more quickly than a table of numbers. Charts and graphs are a wonderful way to do this.
Pie charts whose circular pieces add up to 100% are the simplest way to present data. They are particularly useful when presenting differences in frequencies or percentages among categories of a nominal-level variable. Pie charts are really great tools but the draw back is that they can only be used with data that divides easily into just a few categories. You really don’t want to use pie charts for data that contains more than 5 categories.
Bar graphs are more commonly used in social research as they can accommodate any number of categories, at any level of measurement. They are constructed using the standard arrangement: A horizontal base line (or x axis) is where the scores values or categories are marked off and a vertical line (or y axis) along the left side that displays the frequencies for each score value or category. Generally, the taller the bar is, the greater the frequency is in that particular category. When you are dealing with population samples of different size it is better to graph the percents rather than the frequencies.