Data Structure: Statistical Analysis

Words: 1487 Pages: 5

Data Suitability

Although the structure of the data appears normal (observations are stored in one record each; replications have separate records; each variable has its own field, which is always in the same column, etc.), not all the variables from the file “Data 02 (1).sav” are suitable for analysis. The data in several variables missing or appears corrupted. For example, the values 7777 and 9999 in the variable HEIGHT3 cannot be explained without further clarification. The variable ORACE2 only has 58 valid values; the rest, 7631 values, are missing. The data in the variable IDATE is corrupt because the last number of the year is missing, although this can be mitigated because there is the IYEAR variable. The variable HEIGHT3 is very difficult to use, and should be treated as categorical; it should be recorded into a continuous variable (this is done below).

The variable CHILDREN, which is supposed to reflect the number of children in a household, often has the value of 88 (5728 out of 7689 values). For certain variables, it might be possible to address the problem; for instance, the value “88” in CHILDREN might mean that the household has no children; the missing values in the variable PREGNANT may denote the value “not pregnant,” although it is impossible to differentiate these from simply missing data.

On the whole, some of the data is usable, at least for certain purposes; some is corrupt, but can be mitigated; some is corrupt, and cannot be mitigated without additional information.

Converting and Combining Variables

Converting

Table 1 below is a frequencies table for height in feet and inches (file “Data 02 (1).sav”), whereas Table 2 displays frequencies for a new variable cat_height, obtained via Transform → Recode into different variables (Field, 2013), which reflects the categories into which the participants were sorted according to their height: 1 is ≤4 feet 11 inches; 2 is 5 feet – 5 feet 11 inches; 3 is 6 feet – 7 feet 8 inches; 0 is ≥ 7 feet 9 inches. The last category (0) contains original values 7777 and 9999 which make no sense and should be excluded from the analysis (e.g., via Filter Data procedure).

Table 1. Height in feet and inches – frequencies (file “Data 02 (1).sav”).

HEIGHT3
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	400	1	.0	.0	.0
	402	1	.0	.0	.0
	404	1	.0	.0	.0
	405	1	.0	.0	.1
	406	2	.0	.0	.1
	407	3	.0	.0	.1
	408	9	.1	.1	.2
	409	12	.2	.2	.4
	410	21	.3	.3	.7
	411	83	1.1	1.1	1.7
	500	231	3.0	3.0	4.7
	501	264	3.4	3.4	8.2
	502	600	7.8	7.8	16.0
	503	631	8.2	8.2	24.2
	504	873	11.4	11.4	35.5
	505	703	9.1	9.1	44.7
	506	740	9.6	9.6	54.3
	507	603	7.8	7.8	62.2
	508	512	6.7	6.7	68.8
	509	475	6.2	6.2	75.0
	510	419	5.4	5.4	80.4
	511	406	5.3	5.3	85.7
	600	420	5.5	5.5	91.2
	601	247	3.2	3.2	94.4
	602	145	1.9	1.9	96.3
	603	100	1.3	1.3	97.6
	604	55	.7	.7	98.3
	605	22	.3	.3	98.6
	606	17	.2	.2	98.8
	607	8	.1	.1	98.9
	608	5	.1	.1	99.0
	609	4	.1	.1	99.0
	611	1	.0	.0	99.0
	708	1	.0	.0	99.1
	7777	56	.7	.7	99.8
	9999	17	.2	.2	100.0
	Total	7689	100.0	100.0

Table 2. Height categories – frequencies (file “Data 02 (1).sav”).

cat_height
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	0	73	.9	.9	.9
	1	134	1.7	1.7	2.7
	2	6457	84.0	84.0	86.7
	3	1025	13.3	13.3	100.0
	Total	7689	100.0	100.0

Combining

Tables 3 and 4 below provide the frequencies for the numbers of men and women in households, respectively (file “Data 02 (1).sav”).

Table 3. Frequencies for men (file “Data 02 (1).sav”).

NUMMEN
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	0	2237	29.1	34.0	34.0
	1	3872	50.4	58.9	92.9
	2	407	5.3	6.2	99.1
	3	50	.7	.8	99.8
	4	9	.1	.1	100.0
	5	1	.0	.0	100.0
	Total	6576	85.5	100.0
Missing	System	1113	14.5
Total		7689	100.0

Table 4. Frequencies for women (file “Data 02 (1).sav”).

NUMWOMEN
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	0	736	9.6	11.2	11.2
	1	5109	66.4	77.7	88.9
	2	634	8.2	9.6	98.5
	3	84	1.1	1.3	99.8
	4	8	.1	.1	99.9
	5	4	.1	.1	100.0
	6	1	.0	.0	100.0
	Total	6576	85.5	100.0
Missing	System	1113	14.5
Total		7689	100.0

Table 5 provides frequencies for the total number of men and women in a household; this new variable, males_plus_females, was gained via the Transform → Compute variable (Warner, 2013).

Table 5. Frequencies for men and women (file “Data 02 (1).sav”).

males_plus_females
		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	1.00	2662	34.6	40.5	40.5
	2.00	3112	40.5	47.3	87.8
	3.00	584	7.6	8.9	96.7
	4.00	181	2.4	2.8	99.4
	5.00	27	.4	.4	99.8
	6.00	6	.1	.1	99.9
	7.00	3	.0	.0	100.0
	10.00	1	.0	.0	100.0
	Total	6576	85.5	100.0
Missing	System	1113	14.5
Total		7689	100.0

Further Data Manipulations

Merging

To combine the files “Data 02 (1).sav” and “Data 03 (1).sav,” which contain the same variables, a variable id_merge was created to enumerate the cases and keep track of them. Cases were enumerated 1 through 7689, and 7690 through 12466 for the named files, respectively.

For the file “Data 02 (1).sav,” the descriptives for the variable WEIGHT2 are shown in Table 6 below. The descriptives for the same variable from the file “Data 03 (1).sav” are shown in Table 7. The descriptives for the same variable from the merged file “02_03_merged.sav” are shown in Table 8.

Table 6. Descriptives for WEIGHT2 in “Data 02 (1).sav.”

Descriptive Statistics
	N	Mean	Std. Deviation
WEIGHT2	7689	522.08	1711.739
Valid N (listwise)	7689

Table 7. Descriptives for WEIGHT2 in “Data 03 (1).sav.”

Descriptive Statistics
	N	Mean	Std. Deviation
WEIGHT2	4777	615.26	1960.493
Valid N (listwise)	4777

Table 8. Descriptives for WEIGHT2 in “02_03_merged.sav.”

Descriptive Statistics
	N	Mean	Std. Deviation
WEIGHT2	12466	557.78	1811.593
Valid N (listwise)	12466

Merging these two files allows for combining the data from these files with respect to the sample. In other words, because both files have the same variables, merging the files simply permits to add cases from the second data set to the first data set.

Manipulating the Data to Create a New Variable

A new variable in the merged file “02_03_merged.sav” will be created by using the command Transform → Compute variable to multiply the existing variable WEIGHT2 by the number 0.453592 (George & Mallery, 2016). This will allow for creating a new variable weight_kg denoting the weight of the participants in kilograms. Such a variable will be useful if it is needed to calculate the body mass index of the participants (BMI = weight / height², where weight is in kilograms, and height is in meters), which will permit for assessing whether the participants are underweight, of normal weight, overweight, or obese.

The descriptives for WEIGHT2 in this data set can be found in Table 8 above. The descriptives for weight_kg can be found in Table 9 below.

Table 9. Descriptives for weight_kg in “02_03_merged.sav.”

Descriptive Statistics
	N	Mean	Std. Deviation
weight_kg	12466	253.0065	821.72429
Valid N (listwise)	12466

Manipulating the Data Structure

Manipulating the data structure means changing variables so that they would be measured in different units (DeCoster, 2001). On the whole, this was done in the previous subsection, when weight in pounds was transformed into weight in kilograms. The same can be done with the variable HEIGHT3 to make it usable in the analysis (file “Data 02 (1).sav”). First, it is possible to create a new variable in which the height would be measured in inches only. It is possible to do that by using the command via Transform → Recode into different variables and manually setting the values for each value of height (from 400 to 708). The resulting variable is height_inches.

There is no point in creating the descriptives for HEIGHT3 because the data is categorical. However, the frequencies are given in Table 1 above. For the data in inches only (height_inches), descriptives are provided in Table 10 below.

It should be noted that the syntax does not contain transformation instructions for the values from 700 through 707 because there are no such values in the data, as can be seen from Table 1 with frequencies for HEIGHT3. Also, the values 7777 and 9999 (outliers that make no sense in this variable) were left as they were during the transformation. They can be filtered out by using the command Data → Select cases, for example.

Table 10. Descriptives for height_inches in “Data 02 (1).sav”.

Descriptive Statistics
	N	Mean	Std. Deviation
height_inches	7689	144.62	803.189
Valid N (listwise)	7689

References

DeCoster, J. (2001). Transforming and restructuring data. Web.

Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). Thousand Oaks, CA: SAGE Publications.

George, D., & Mallery, P. (2016). IBM SPSS Statistics 23 step by step: A simple guide and reference (14th ed.). New York, NY: Routledge.

Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Thousand Oaks, CA: SAGE Publications.