Logistic Regression Results for Data Analysis

Words: 1720 Pages: 6

Introduction

This paper provides the results of a logistic regression that was run to analyze the data contained in the file “helping3.sav.” The data file is described; the assumptions for the logistic regression are articulated and tested; the research question, hypotheses, and alpha level are stated; and the results of the test are reported and explained.

Data File Description

The data was taken from the file “helping3.sav”. It comprises real research data assessing people’s helpfulness (George & Mallery, 2016). The current paper provides a logistic regression analysis to predict the people’s helpfulness (cathelp, dichotomous: 0=helpful, 1=not helpful, after the modification for analysis by SPSS) as assessed by the friend they tried to help, from those people’s sympathy towards the friend (sympathy, interval/ratio), their anger towards the friend (angert, interval/ratio), the helper’s self-efficacy (effect, interval/ratio), and the helper’s ethnicity (ethnic, categorical; will be dummy-coded in regression; see Appendix). For all the interval/ratio variables, lower numbers represent a lower intensity of the feeling (George & Mallery, 2016). The sample size is N=537. The demographic variables in the data set allow for concluding that the population for the study is rather broad.

Assumptions, Data Screening, and Verification of Assumptions

Assumptions

The logistic regression requires that certain assumptions are satisfied (Field, 2013; Laird Statistics, n.d.):

The dependent variable is dichotomous, and comprises mutually exclusive, exhaustive categories; the independent variables are interval/ratio or categorical;
There ought to be the independence of observations;
There are linear relationships between the continuous independent variables and their logit transformations;
There needs to be no multicollinearity in the data.

Verifying Assumptions

Assumption 1

As follows from the “Data File Description” section, the assumption is met.

Assumption 2

The observations are independent–each case represents a different person helping a different friend.

Assumption 3

To test this assumption, it is advised to run a linear regression using the “Enter” method; the regression should include interaction terms between the continuous independent variables and their logit transformations (for instance, their natural logarithms) (Field, 2013, sec. 19.4.1, 19.8.1).

Therefore, three variables were created for this purpose: ln_sympatht, ln_effict, and ln_angert. The results of the analysis are supplied in Table 1 below. It can be seen that the assumption is violated for the variable angert because the interaction is significant (p=.015).

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp(B)
Step 1^a	sympathy	.978	1.313	.555	1	.456	2.659
	angert	1.341	.541	6.153	1	.013	3.823
	effect	4.121	2.834	2.114	1	.146	61.637
	ethnic			5.036	4	.284
	ethnic(1)	-.330	.369	.801	1	.371	.719
	ethnic(2)	.204	.486	.176	1	.675	1.226
	ethnic(3)	-.105	.436	.058	1	.810	.900
	ethnic(4)	-.675	.451	2.239	1	.135	.509
	ln_sympatht by sympathy	-.189	.521	.132	1	.716	.828
	angert by ln_angert	-.630	.260	5.882	1	.015	.533
	effect by ln_effict	-1.166	1.110	1.103	1	.294	.312
	Constant	-15.355	5.553	7.646	1	.006	.000
a. Variable(s) entered on step 1: sympatht, angert, effict, ethnic, ln_sympatht * sympatht , angert * ln_angert , effict * ln_effict.

Table 1. Testing the assumption of the linear relationship between the independent variables and their logit transformations.

To address this problem, data transformation procedures can be employed to adjust the independent variables (Warner, 2013). However, this will not be done in the current paper due to the need to follow specific instructions from George and Mallery (2016).

Assumption 4

To test the assumption of non-multicollinearity, it is possible to run a linear regression with the same variables as the main logistic regression while using the SPSS option “multicollinearity diagnostics” (Field, 2013, sec. 19.8.2). It is stated that the tolerance values below 0.1 and VIF (variance inflation factor) values greater than 10 indicate multicollinearity. As one can see from Table 2 below, the analyzed data should have no problems with multicollinearity. Therefore, the assumption is met.

Coefficients
Model		Collinearity Statistics
Model		Tolerance	VIF
1	sympathy	.945	1.058
	effect	.963	1.038
	angert	.982	1.018
	ethnic	.996	1.004
a. Dependent Variable: cathelp

Table 2. Coefficients for collinearity diagnostics.

Screening the Data

Figures 1-3 below provide histograms for sympathy, effect, and angert variables. It can be seen that sympathy and effect are approximately normally distributed, and there are no apparent considerable outliers. However, the anger variable is not normally distributed due to a large number of values close to 1 (about 220 cases, or nearly 40% of the sample), and because of this, there are some outliers. However, the data will not be transformed, and no outliers will be excluded, for it is needed to follow specific instructions from George and Mallery (2016); also, it is very improbable that such a distribution results from a sampling error.

As for out-of-bounds values, the histograms show that there are no such values in the data.

Inferential Procedure, Hypotheses, Alpha Level

The research question for the given analysis is: “Do any of the following variables: the sympathy, anger, self-efficacy, and ethnicity of helpers–predict the helpfulness as perceived by those they provided help for?” The null hypothesis will be: “None of the following variables: the sympathy, anger, self-efficacy, and ethnicity of helpers–predict the helpfulness as perceived by those they provided help for.” The alternative hypothesis will be: “At least some of the following variables: the sympathy, anger, self-efficacy, and ethnicity of helpers–predict the helpfulness as perceived by those they provided help for.” The alpha level will be standard, α=.05. The hypotheses will be addressed by the χ² of the model and its significance value.

Interpretation

Therefore, logistic regression was conducted using the variables sympathy, effect, anger, and ethnicity as predictors, and the variable can help as the outcome. As was noted, the variable ethnic was recoded by SPSS using the dummy coding technique (see Appendix). The forward stepwise method of entry based on the likelihood ratio was used; the likelihood ratio for variable entry was set at.05, and the likelihood ratio of.10 was used for variable removal.

There were 2 steps in the variable entry. Table 3 below shows how the variables entered improved the model. It can be seen that the first variable added χ²(1)=114.843, p<.001, and the second variable added χ²(1)=29.792, p<.001, resulting in total χ²(2)=144.635, p<.001 of the model.

Omnibus Tests of Model Coefficients
		Chi-square	df	Sig.
Step 1	Step	114.843	1	.000
	Block	114.843	1	.000
	Model	114.843	1	.000
Step 2	Step	29.792	1	.000
	Block	144.635	2	.000
	Model	144.635	2	.000

Table 3. The variables included in the model.

As a result, the variable’s anger and ethnic(1)– ethnic(4) (the dummy variables representing the variable ethnic) were removed from the model (see Table 4 below).

Variables not in the Equation
			Score	df	Sig.
Step 1	Variables	sympathy	29.135	1	.000
		angert	.019	1	.891
		ethnic	4.379	4	.357
		ethnic(1)	.353	1	.553
		ethnic(2)	2.681	1	.102
		ethnic(3)	.246	1	.620
		ethnic(4)	1.709	1	.191
	Overall Statistics		34.113	6	.000
Step 2	Variables	angert	.640	1	.424
		ethnic	4.892	4	.299
		ethnic(1)	.464	1	.496
		ethnic(2)	2.304	1	.129
		ethnic(3)	.357	1	.550
		ethnic(4)	2.169	1	.141
	Overall Statistics		5.404	5	.369

Table 4. The variables which were not included in the model (the final step).

From the Table 5 below, it can be seen that the final model predicted the outcome variable better than the first one, for the -2 Log-likelihood was smaller in the second model (George & Mallery, 2016). It can also be seen that the final model predicted nearly 23.6-31.5% of the variance in the data (Cox & Snell’s R²=.236, Nagelkerke R²=.315). Table 6 below shows that the final model successfully predicted the outcomes in nearly 70.4% of cases.

Model Summary
Step	-2 Log likelihood	Cox & Snell R Square	Nagelkerke R Square
1	629.506^a	.193	.257
2	599.713^b	.236	.315
a. Estimation terminated at iteration number 4 because parameter estimates changed by less than.001.
b. Estimation terminated at iteration number 5 because parameter estimates changed by less than.001.

Table 5. Model summary for the final step.

Classification Table^s
	Observed		Predicted
			cathelp		Percentage Correct
			NOT HELPFUL	HELPFUL
Step 1	cathelp	NOT HELPFUL	176	89	66.4
		HELPFUL	79	193	71.0
	Overall Percentage				68.7
Step 2	cathelp	NOT HELPFUL	181	84	68.3
		HELPFUL	75	197	72.4
	Overall Percentage				70.4
a. The cut value is.500

Table 6. Classification table for the final step.

Only the variables effect and sympathy were retained in the final model, as can be seen from Table 7 below. Effect significantly predicted helpfulness: Exp(B)=3.046, Wald’s test statistic=76.197, df=1, p<.001. Sympathy also significantly predicted helpfulness: Exp(B)=1.596, Wald’s test statistic=27.459, df=1, p<.001. The Constant coefficient for the model was Exp(B)=.001; it had Wald’s test statistic=93.006, df=1, p<.001.

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp(B)
Step 1^a	effect	1.130	.122	85.356	1	.000	3.094
Step 1^a	Constant	-5.303	.585	82.030	1	.000	.005
Step 2^b	sympathy	.467	.089	27.459	1	.000	1.596
	effect	1.114	.128	76.197	1	.000	3.046
	Constant	-7.471	.775	93.006	1	.000	.001
a. Variable(s) entered on step 1: effict.
b. Variable(s) entered on step 2: sympathy.

Table 7. Variables in the regression equation at the final step.

Thus, the equation for the regression model at the final step was as follows (George & Mallery, 2016):

ln (Odds (helping)) = ln (P(helping) / P(not helping)) = -7.471 + 0.467×sympatht + 1.114×effict,

or:

Odds (helping) = P(helping) / P(not helping) = e^-7.471 × e^0.467×^sympatht + e^1.114×^effict.

Therefore, the null hypothesis was rejected, and evidence was found to support the alternative hypothesis.

A certain limitation related to the fact that the given equations do not take into account the standard error of B, which is displayed in Table 7 above, should be pointed out. Another limitation is related to the entry method (forward stepwise method of entry based on likelihood ratio); purely mathematical considerations were used to select the variables for the final model, which is strongly advised against by Field (2013).

Conclusion

Thus, logistic regression was run to find out whether the sympathy, anger, self-efficacy, and ethnicity of helpers could predict whether they would be assessed as helpful or non-helpful by friends for whom they provided the aid. It was found that sympathy and self-efficacy could predict their helpfulness, whereas the rest of the independent variables could not. Therefore, the null hypothesis was rejected, and evidence was found to support the alternative hypothesis that at least some of the independent variables could predict the helpfulness of helpers.

References

Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). Thousand Oaks, CA: SAGE Publications.

George, D., & Mallery, P. (2016). IBM SPSS Statistics 23 step by step: A simple guide and reference (14th ed.). New York, NY: Routledge.

Laerd Statistics. (n.d.). Binomial logistic regression using SPSS Statistics. Web.

Warner, R. M. (2013). Applied statistics: From bivariate through multivariate techniques (2nd ed.). Thousand Oaks, CA: SAGE Publications.

Appendix

Dummy coding of the independent variable ethnic as carried out by SPSS:

Categorical Variables Codings
		Frequency	Parameter coding
		Frequency	(1)	(2)	(3)	(4)
ethnic	WHITE	293	1.000	.000	.000	.000
	BLACK	50	.000	1.000	.000	.000
	HISPANIC	80	.000	.000	1.000	.000
	ASIAN	70	.000	.000	.000	1.000
	OTHER/DTS	44	.000	.000	.000	.000