The Statistical Package for Social Sciences (SPSS) is software that enables researchers in many fields to manage their data and to perform statistical analysis of them. Market researchers, health researchers, education researchers, survey companies, sociologists, psychologists and others can use SPSS to obtain:
Descriptive statistics - frequency distributions, cross tabulation
Bi-variate statistics - t-test, ANOVA
Predictions - linear regression
And more
Dataset - an organized collection of data
Attributes – characteristics (of persons or things)
Variables – logical grouping of attributes (male → sex, baker → occupation)
Independent variable – values are taken as given, the cause
Dependent variable - depends on the independent variable
Figure 1 SPSS Icon
Some versions of SPSS start off by a pop up window like the one in Figure 3. You can open a dataset through that dialogue or you can close it and SPSS will open a blank dataset.
Figure 2 IBM SPSS Intro Dialog
The Toolbar consists of two lines. The lower line is a selection of some of the functions you can also access through the top line. For example, Open Data Document that looks like a yellow paper folder and is located at the far left of the second line in the Toolbar allows you to open a data set to work with. You can also open a data set through File → Open → Data. “File” button is located on the far left on the top of the Toolbar.
Other operations that are located on the Toolbar allow you to edit, manage and analyze data, change what you can see on your screen, and get help.
Figure 3 Toolbar
File - Here you can open, data set, import data, and save or rename your data set.
Edit - This tab allows you to edit your work. Here you can find functions such as undo, copy, paste, and, and replace.
View - Allows you to change the settings of what you can see on your screen.
Data - Here you can manipulate the entire data set; for example, you can split file, or sort cases.
Transform - Under his tab you will find functions that allow you to manipulate variables; recode, compute variables or replace missing values.
Analyze - You will use this tab to create reports about your data.
Direct Marketing - This tab is used to analyze customer information and consumer data.
Graph - You can visually display your data here by creating graphs and charts.
Utilities - This tab contains functions that help you compare datasets and perform data transformation.
Extensions - You can work with IBM SPSS extension bundles.
Window - Allows you to manipulate the settings of the SPSS window.
Help - Use this tab to access SPSS built-in help features.
Variable View (Figure 4)
This view is useful for setting up variables and their properties.
Figure 4 Variable View
Under Name write a simple name of the variable without spaces.
Under Type define the type of the variable. Types are numeric, comma, dot, scientific notation, date, dollar, custom currency, or string. See Figure 5.
Figure 5 Variable Name and Variable Type
Width determines the number of characters allowed to use to define a variable.
If you would like to define the number of decimals for the variable, click of Decimal.
To label the variable, click on Label and type the text including spaces (up to 256 characters).
Variable values can be defined by clicking on the cell Values and then clicking on the ellipsis. A dialog box will appear. Label the value and click on Add before defining the next value. See Figure 6.
Figure 6 Variable Values
When you click on Missing, you can define which variables to exclude from analysis. Sometimes the questionnaire contains data that are not important for analysis; answers such as ‘not applicable’ or ‘I don’t know’ are often excluded.
Measure is where you define what level of measurement you are using for the variable. You can choose Scale, Ordinal or Nominal. Nominal measures are all at the same level; we cannot tell which one is more or less (sex, place of birth, etc.). Ordinal measure tells us the relationship between the cases. We can tell which one is more and which one is less, but we cannot tell by how much (Likert scale, levels of education, etc.). Scale is a measure that tells us the relationship between cases and allows us to tell “by how much” they are different (height, weight, income, etc.).
You can switch to Data View by clicking on the Data View tab on the bottom-left of the window. This is where you manually input your data, view or change your data.
You can use an existing data set and import it into SPSS. There are two ways you can import data.
File → Import Data → Excel (for example) → Choose the file you would like to work with
i) Click on File
ii) Select Import Data
iii) Select the type of data you are importing, for example Excel
iv) Chose the file from your computer that you would like to work with
You can also enter all your data directly into SPSS.
First, set up all the variables that you will be working with in the Variable View. Typically, you would determine at least the Variable Name, Type, Width, Decimal and Label. You can always go back and revise.
Switch to Data View. Look at the bottom of your screen. You will see two tabs (Data View and Variable View) either on the left or in the middle. Click on Data View. An empty sheet will appear with the names of the variables you set up earlier as headers of the columns and numbers indication the rows. (See Figure 7)
In Figure 5, there is a column that is labeled var and has a slightly lighter color than the rest of the columns. This indicates that there is still room for more variables but they have not been set up. This column is not going to be included in any analysis and should not contain any data.
Figure 7 Data View with Labeled Variables Age, Gender and ID
In the blank space below the variable names you can enter the data from each case. Make sure you enter data for one case on the same line (the same row). See Figure 8 for an example of a few cases entered for variables ID, Gender and Age.
Figure 8 Data View with Data for Variables ID, Gender and Age
You are a health researcher. You conducted a survey of your patients where you asked about their marital status, highest level of education, weight and whether or not they smoke. You would like to know what proportion of your sample smokes, is married and what education level is the most frequently achieved among your respondents and what is the most likely weight of a person in this study. You received responses from 271 people. Your assistant entered the data into Excel and left notes on what the values he entered mean.
In SPSS, you can view the distribution of your data using the Frequency Distribution command in Descriptive Statistics tab. Here is how you execute that:
Open SPSS, make sure you start off with a blank file and open the Excel dataset called Health.
Marital = marital status
Edlevel = highest level of education
Weightrate = current weight
Smoke = do you smoke
ID will stay the way it is
The values for the variables we are working with are as follows:
Marital 1 single Edlevel 1 primary school
2 married 2 secondary school
3 divorced 3 trade training/post-secondary training
4 widowed 4 undergraduate degree
5 postgraduate degree
Weightrate 1 very underweight Smoke 1 yes
10 very overweight 2 no
Figure 1 Example of Labeling Values
One of the basic functions of SPSS is Frequency distribution. This function displays the data in a Frequency table and is useful for viewing how many cases or percent of our sample have selected each answer.
Follow this path to create a Frequency table for your data set:
Start at the top, in the toolbar. Select: Analyze → Descriptive Statistics → Frequencies (Figure 2)
Figure 2 Path to Frequency Distribution
Once you execute an operation in SPSS a separate window will open called Output. This is where all the reports, graphs and code will be displayed.
Figure 3 shows the output created by executing the Frequency Distribution for variables “marital,” “edlevel,” “weightrate,” and “smoke.”
Right at the top of the window you can see the code, where SPSS keeps log of actions.
In this case it tells us where we got our dataset and that we ran a frequency distribution for the
above listed variables.
Next on the Output are the frequency distributions. Under the title “Frequencies,” there is a table labeled Statistics.
Statistics show the number of valid answers and the number of missing answers.
Frequency Table shows how the frequency and percentage of answers are distributed in the sample for each variable.
Figure 3 Reading the Output
In Figure 4 is the Frequency Distribution of Respondents’ Marital Status. Under Frequency is the number of people who answered the particular option on the survey. Percent shows how many percent out of the total number on respondents that makes. Valid Percent shows how many percent, not including Missing Values that makes. Cumulative percent adds up the percentage by each line.
Figure 4 Frequency Distribution of Respondents’ Marital Status, shows that 19.9 percent of the respondents were single, 69.9 percent were married and 10.7 percent were either divorced or widowed. Most respondents were married.
Figure 4 Frequency Distribution of Respondents’ Marital Status
Pie Charts
Figure 1 Path To Creating a Pie Chart
Figure 2 Type of Pie Chart Dialog
Figure 3 Define Pie Chart Window
Figure 4 Defined Pie Chart
Figure 5 Pie Chart Title
Figure 6 Pie Chart in SPSS Output
Figure 7 Chart Editor
Figure 8 Show Data Labels Icon
Figure 9 Pie Chart with Labeled Data
Figure 10 Properties Window
Figure 11 Text Properties for Data Labels
Figure 12 Pie Chart with Customized Labels
Figure 13 Finished Pie Chart
Figure 1 Path to Creating a Histogram
Figure 2 Histogram Popup Window
Figure 3 Display Normal Curve on Histogram
Figure 4 Histogram Titles Window
Creating an Index in SPSS
Figure 1 Path to Compute Variable
Figure 2 Compute Variable Window
Figure 3 Addition of All Variables in the Index
Figure 4 Parentheses Around All Terms in Addition
Figure 5 Division of All Terms
Figure 6 Frequency Distribution of depressionINXED
T-test helps us test our hypotheses about means. It determines whether there is a statistically significant difference between the means of two groups (Independent Samples T-test), between the means of the same sample at two different times (Paired Sample T-test), or between a mean of a group and a predicted value of that mean (One Sample T-test).
There are several assumptions we have to consider before we run a t-test:
Level of measurement: The dependent variable must be continuous (interval/ratio).
Independence: The observations are independent of one another.
Normality: The dependent variable should be approximately normally distributed.
Outliers: The dependent variable should not contain any outliers.
It is always good to check if the data fits the assumptions outlined above.
The variable has to be measured continuously, that means that it can have any value within a certain range (weight, height, age). Non-continuous values are categorical values (nominal - place of birth; or ordinal - level of education) where the variable can assume only some values.
It is hard to test for independence; however, if the sample was chosen randomly, there is very small chance that the data is biased (not independent).
One of the assumptions we need to consider before running a t-test is that the dependent variable is approximately normally distributed. That means that the data distribution roughly follows the bell curve.
In SPSS, we can check if the data is normally distributed using a histogram with the normal curve. To do that follow this path:
Graps --> Legacy Dialogs --> Histogram
Figure 1 Path to creating a histogram
Figure 2 Histogram dialog
Figure 3 Distribution of weightrate data
When performing a t-test it is important to identify and remove outliers. Outliers are data points that are way out of the expected range of responses, or abnormally far from the rest of the data.
Figure 4 First steps to checking for outliers
Figure 5 Explore window
Figure 6 Explore: Plots window
Figure 7 Boxplot for weightrate
Figure 8 Boxplot with outliers
If we want to compare the mean of a sample against an assumed value, we use One Sample T-test.
The null hypothesis (H0) is that the sample mean (ȳ) and the assumed value (x) are equal.
The test hypothesis (H1) is that the sample mean (ȳ) and the assumed value (x) are not equal.
H0: ȳ = x
H1: ȳ ≠ x
A random sample of 25 eighth grade students has a GPA of 3.5 in English. The marks range from 1 (worst) to 5 (excellent). The GPA of all eighth grade students of the last five years is 3.7. Is the GPA of the 25 students different from the populations’ GPA?
In this case, students’ GPA, or sample mean, (ȳ) is compared with the assumed value of mean GPA from the past five years (x).
H0: ȳ = x
H1: ȳ ≠ x
Practice problem and procedure:
You are a health science researcher and you want to know if the data your assistant collected are representative of the population you are working with. You know that the mean weightrate for your population is 6.5. How can you determine if the mean of the sample (6.31) is statistically different from the mean commonly cited in the peer reviewed literature (6.5)?
(Hint: You will run a one sample t-test)
Figure 1 Path to One-Sample T-Test
Figure 2 One-Sample T-Test window
Figure 3 One-Sample T-Test output
The null hypothesis (H0) is that the sample mean (ȳ) and the assumed value (x) are equal.
The test hypothesis (H1) is that the sample mean (ȳ) and the assumed value (x) are not equal.
H0: ȳ = x
H1: ȳ ≠ x
In this example:
H0: ȳ = 6.5
H1: ȳ ≠ 6.5
The Independent Samples T-Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different.
Figure 1 Path To Independent Samples T-test in SPSS
Figure 2 Independent Samples T-Test Popup
Figure 3 Test Variable
Figure 4 Choosing Grouping Variable
Figure 5 T-Test Define Groups
Figure 6 SPSS Output for Independent T-Test
ANOVA Assumptions
There are several assumptions we have to consider before we run an ANOVA:
Level of measurement: The dependent variable must be continuous (interval/ratio)
Independence: The observations are independent of one another
Normality: The dependent variable should be approximately normally
distributed.
Outliers: The dependent variable should not contain any outliers
Variance: The variances of the samples should be homogeneous
It is always good to check if the data fits the assumptions outlined above.
The variable has to be measured continuously, that means that it can have any value within a certain range (weight, height, age). Non-continuous values are categorical values (nominal - place of birth; or ordinal - level of education) where the variable can assume only some values.
It is hard to test for independence; however, if the sample was chosen randomly, there is very small chance that the data is biased (not independent).
One of the assumptions we need to consider before running a t-test is that the dependent variable is approximately normally distributed. That means that the data distribution roughly follows the bell curve.
How to check normality in SPSS:
In SPSS, we can check if the data is normally distributed using a histogram with the normal curve. To do that follow this path:
Graphs --> Legacy Dialogs --> Histogram
Figure 1 Path to creating a histogram
Figure 2 Histogram dialogue
Figure 3 Distribution of weightrate data
When performing a t-test it is important to identify and remove outliers. Outliers are data points that are way out of the expected range of responses, or abnormally far from the rest of the data.
Checking for outliers in SPSS:
Figure 4 First steps to checking for outliers
Figure 5 Explore window
Figure 6 Explore: Plots window
Figure 7 Boxplot for weightrate
Figure 8 Boxplot with outliers
Variance
To test if the variances of the samples are homogeneous you need to conduct Levene’s Test. The test of homogeneity can be included in One-Way ANOVA. See instructions on how to do that under the ANOVA tab.
Figure 1 Path to One-Way ANOVA
Figure 2 One-Way ANOVA Window
Figure 3 One-Way ANOVA Options Window
Figure 4 One-Way ANOVA Dialog
Figure 5 One-Way ANOVA Post Hoc Multiple Comparisons Window
Once you execute the One-Way ANOVA command, SPSS will produce the results of the analysis in the Output window.
Figure 6 Test of Homogeneity
Figure 7 ANOVA
The Post Hoc Test (Figure 8) shows which means are different from all the other means (Figure 8)
Figure 8 Post Hoc Test - Tukey
Crosstab – also known as cross tabulation or contingency table is a way to display data. It is a frequency distribution of two variables, one on rows and another in columns of a matrix. When we display data this way we can see basic relationship between the variables.
Table 1 Recoded Shotgun in Home and Respondents’ Sex Crosstabulation (a_shotgun, sex)
|
Respondents’ Sex |
|
|
Shotgun in Home |
Male |
Female |
Total |
Yes |
193 24.1% |
153 15.0% |
346 19.0% |
No |
609 75.9% |
868 85.0% |
1477 81.0% |
Total |
802 100% |
1021 100% |
1823 100% |
Source: General Social Survey 2016
n= 2867
Table 1 shows the frequency distribution of the variable sex and the variable shotgun. The attributes of the variable sex are displayed in columns and the attributes or the variable a_shotgun (the answers to the question: “Do you happen to have in your home any guns or revolvers?”) are in rows. If we want to know if there is any relationship between respondents’ sex and possession of shotgun in respondent’s home we can compare the percentage in the cell of interest to the percentage if the corresponding total (marginal). Notice that there are two Totals, one in the last row and one in the last column. These totals correspond to the row or column they end. The total of all respondents is in the cell all the way on the bottom right (1823). The total of all respondents is lower that the total number of respondents in the survey (2867). That means that they either refused to answer that question, they didn’t know or the question was not applicable to them.
Respondents’ sex is the independent variable and shotgun is the dependent variable. That is why sex is displayed in columns and shotgun in rows.
If there were no relationship between sex and shotgun in home the percentages in the same row would be approximately the same. In Table 1 the percentage of male respondents who have a shotgun in their home (24.1) is around 5 percent higher than the corresponding marginal (19). The percentage of female respondent who have a shotgun (15) in their homes is 4 percent lower than the corresponding marginal (19). Looking at the crosstab we can suspect that sex influences the likelihood of respondents having a shotgun in their homes.
To create a crosstab in SPSS, follow the path Analyze à Descriptive Statistics à Crosstabs. Put independent variable in columns and dependent variable in rows. Click on Cells and under Percentages select Columns. Hit Continue and OK.
Figure 1 Cosstabs Window
Figure 2 Display Variable Names and Sort Alphabetically Function
Figure 3 Independent and Dependent Variable in Crosstabs Window
Figure 4 Crosstabs – Cell Display
Figure 5 SPSS Output – Crosstabulation of sex and a_shotgun
If we want to find out whether there is a statistically significant relationship between two variables, we conduct a statistical test for association – Chi Square.
To create a crosstab and run the Chi Square test in SPSS, follow the path Analyze --> Descriptive Statistics --> Crosstabs. Put independent variable in columns and dependent variable in rows. Click on Cells and under Percentages select Columns. Hit Continue and OK. Click on Statistics and select Chi Square. Click Continue and in the next window hit OK.
Figure 1 Crosstabs – Statistics Window
Figure 2 SPSS Output for Chi Square Including a Crosstab
Figure 7 shows the output that will be produced by SPSS when a Chi Square test is conducted.
The top table is the crosstabulation of the two variables that are in the analysis. Independent variable is in columns and dependent variable in in rows.
The table on the bottom of Figure 2 is the information related to the Chi Square test.
Figure 3 is a close up of the table showing the Chi Square statistic, degrees of freedom and level of significance P.
Figure 3 Chi Square Test Statistic
In rows:
Pearson Chi Square – Statistical test for association
In columns:
Value – the value of Chi Square; if this value is larger than the critical value of Chi Square for the particular degrees of freedom there is support for the alternative hypothesis that these is association between the independent and dependent variable. If the value is lower than the critical value we fail to reject the null hypothesis (H0) and there is no association between the independent and dependent variable.
df – degrees of freedom, the value of degrees of freedom is calculated as (the number of columns – 1) times (the number of rows – 1)
df = (C-1) (R-1)
Asymptotic Significance (2-sided) – this is the p-value that is compared to a (alpha); if p-value is smaller than alpha we reject the null hypothesis. If the p-value is larger than alpha we fail to reject the null hypothesis.
p < a --> reject H0
p > a --> fail to reject H0
Linear relationship between the continuous dependent and independent variables in the model.
Figure 1 Path to Scatter/Dot Plot
Figure 2 Scatter/Dot Plot Options Window
Figure 3 Simple Scatterplot Settings Window
Figure 4 Scatter Dot Plot in SPSS Output
Figure 5 Add Line of Best Fit Icon
Figure 6 Scatter/Dot Plot with Line of Best Fit in Chart Editor
Figure 7 Path to Linear Regression
Figure 8 Linear Regression Settings Window
Figure 9 Linear Regression with Dependent Variable and a List of Independent Variables
The variables used in this analysis are:
Dependent Variable: MNTLHLTH – Number of work days missed in the past 30 days due to poor
mental health
Independent Variables: AGE – Respondent’s age in years
s_2 - Dummy for Sex – Male
r_2 - Dummy for Race – Black
r_3 - Dummy for Race – Other
MOREDAYS – Number of days worked extra in the past 30 days
This box specifies the number of the model and the variables that were entered in the model or removed from it. Variables are typically removed when you choose another method than Enter. You can specify criteria for keeping/removing variables if you want to run a stepwise regression, for example.
Table 1 Linear Regression Output - Variables Entered and Removed
Model summary box again specifies the number of the model and provides R, R square, adjusted R square and the root mean square error for each model.
Table 2 Linear Regression Output – Model Summary
R is the square root of R square and shows the correlation between the observed and predicted values of the dependent variable.
R square shows how much of the variance of the dependent variable can be explained by the independent variables in the model.
Adjusted R square reduces inflation of R square when adding more variables in the model. By adding more variables R square can increase simply due to chance.
Standard Error of the Estimate is the standard deviation of the error. It measures the accuracy of the prediction. Smaller standard error indicates more accurate prediction.
This table shows the sum of squares, degrees of freedom, mean square, F, and p-value.
Table 3 Linear Regression Output – ANOVA
F and Sig. help answer the question: “Do the independent variables reliably predict the dependent variable?” If Sig. (p-value) is smaller than 0.05, the answer is: “Yes, they do”.
For each model, there are several statistics in the Coefficients table.
Table 4 Linear Regression Output - Coefficients
Model column lists the predictor variables for each model, including the constant (Y-intercept). The Y-intercept shows the value of the dependent variable when all the predictors are held at 0.
Unstandardized Coefficients - B - measure the effect of the increase of the independent variable by one unit on the dependent variable; the coefficient is unstandardized because it is measured in its natural units. That means that we cannot tell which predictor is more influential.
- Std. Error – shows the standard error of the coefficient and helps determine if the coefficient is significantly different from 0, form a confidence interval, and calculate the t value.
Standardized Coefficients Beta can be compared to each other because all the variables in the model have been standardized before running the regression.
Sig. shows the p-value for the hypothesis that the coefficients are not different from 0 (2-tailed test). If the value in the Sig. column is lower than alpha (typically 0.05), the coefficients are different from 0.
Sometimes variables have more attributes than we need for our analysis or are assigned a numeric value that does not make sense for our analysis. In such cases, we can reduce the number of attributes or assign a different numeric value to the attributes by recoding the variable.
If we want to recode the attributes of the variable, we need to know how they are coded in the first place. You can either check the coding scheme in the codebook that should be available with the dataset that you are using or you can do see how the attributes were coded under Values in the Variable View in SPSS.
SPSS Procedure:
Figure 1 Where to Check the Coding Scheme of a Variable
Figure 2 Coding Scheme for Respondents’ Marital Status
Let’s say that for our analysis, we are only interested to know whether or not people are married. The marital variable contains “too much” information. We need to make sure that all those who answered single, divorced or widowed are grouped together in a new category “not married”. The Recode into Different Variables is useful here.
SPSS Procedure:
Follow the path: Transform -> Recode into Different Variables
Figure 3 Path to Recode into Different Variables Command
Figure 4 Recode into Different Variables Pop-Up Window
Figure 5 Display Variable Names and Sort Alphabetically Function
Figure 6 Name and Label New Variable
Figure 7 Recode into Different Variables - Old and New Values Pop-Up Window
In this window (Figure 7) you indicate how you want the values to be transformed. The left panel is where you will input the old, or original, values. The top right panel is where you input what you want the new values to be. You can see what SPSS will do in the white panel on bottom right after you have set one command up and clicked Add.
1 = single, 2 = married, 3 = divorced and 4 = widowed
Old value New value
1 single 1 not married
2 married 2 married
3 divorced 1 not married
4 widowed 1 not married
Figure 8 Assigning New Values in Recode into Different Variables - Old and New Values Pop-Up Window
Figure 9 Copy All Unchanged Values
Once you hit OK it will seem as though nothing happened. If you look at the output window the only thing that will change there is that there will be a few lines of code that were not there before (Figure 10). This code is very important. It tells us what is the name of the original and new variable and the changes we made.
Figure 10 SPSS Output After Executing Recode into Different Variables Command
Let’s break down how to read the output:
Another very important thing that will happen once you execute the operation is that there will be a new variable in the Variable View in the Data window. The new variable will be listed all the way at the end, after all the other variables and its values will not be labeled.
It is useful to label the variable values to always know what the numeric values mean. This is especially true when you create a new variable because this variable will not be in the codebook so later there will be nowhere to look to find out what the values represent.
Figure 11 Recoded Variable Without Value Labels in Variable View
Figure 12 Labeling Attributes of Recoded Variable
Check Your Work
To check if recoding was done correctly we can run a frequency distribution of the original and recoded variable and compare them.
Figure 13 Frequency Distribution of Respondents’ Marital Status
Figure 14 Frequency Distribution of Recoded Respondents’ Marital Status