SPSS Assignment 1: Data Quality and Reliability Analysis¶

Jason O'Brien
CI 7303 — Psychometric Methods
22 March 2026


This assignment walks through importing survey data, evaluating data quality, assessing internal consistency (Cronbach's alpha), and creating scale scores. Python is used in place of SPSS; all procedures are equivalent.

In [1]:
import pandas as pd
import numpy as np
import pyreadstat
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

# Load the practice data
df, meta = pyreadstat.read_sav('SPSSAssignment1_Practice_Data.sav')
print(f'Dataset: {df.shape[0]} observations, {df.shape[1]} variables')
df.head()
Dataset: 384 observations, 19 variables
Out[1]:
ID sect16_2 sect17_2_r sect18_2 sect19_2 sect20_2_r sect21_2 sect22_2 sect23_2 sft70_2 sft71_2 sft72_2_r sft73_2 sft74_2 Anxiety InformationProcessing Exam_Score Homework_Score CourseCredit
0 3001.0 4.0 4.0 5.0 4.0 4.0 5.0 4.0 4.0 3.0 2.0 1.0 3.0 3.0 2.888889 2.777778 77.163067 44.218159 1.0
1 3002.0 3.0 3.0 4.0 4.0 3.0 4.0 4.0 4.0 3.0 5.0 5.0 3.0 3.0 1.888889 4.000000 58.389741 72.248523 0.0
2 3003.0 5.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 3.0 4.0 2.0 3.0 3.0 5.000000 2.777778 84.329808 98.737232 1.0
3 3004.0 5.0 5.0 5.0 4.0 5.0 4.0 5.0 4.0 4.0 2.0 5.0 3.0 4.0 2.888889 3.444444 71.864053 92.913460 1.0
4 3005.0 1.0 1.0 3.0 2.0 3.0 2.0 3.0 4.0 5.0 5.0 4.0 3.0 5.0 1.111111 4.666667 47.051978 NaN 0.0

1. Data Overview¶

The dataset contains two sets of scale items, pre-computed scale scores, and outcome variables:

Variable Group Variables Description
SECT items sect16_2 – sect23_2 (8 items) Scale items; _r = reverse-coded
SFT items sft70_2 – sft74_2 (5 items) Scale items; _r = reverse-coded
Pre-computed scales Anxiety, InformationProcessing Already-calculated scale scores
Outcomes Exam_Score, Homework_Score, CourseCredit Performance measures

2. Descriptive Statistics (Means, SDs, Frequencies)¶

In [2]:
# Means and standard deviations for all variables
desc = df.describe().T[['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']]
desc = desc.round(3)
display(desc)
count mean std min 25% 50% 75% max
ID 384.0 3192.500 110.995 3001.000 3096.750 3192.500 3288.250 3384.0
sect16_2 384.0 3.542 1.419 1.000 2.000 4.000 5.000 5.0
sect17_2_r 384.0 3.159 1.342 1.000 2.000 3.000 4.000 5.0
sect18_2 384.0 3.607 1.202 1.000 3.000 4.000 5.000 9.0
sect19_2 384.0 3.581 1.217 1.000 3.000 4.000 5.000 5.0
sect20_2_r 384.0 3.232 1.398 1.000 2.000 3.000 5.000 5.0
sect21_2 377.0 3.101 1.331 1.000 2.000 3.000 4.000 5.0
sect22_2 384.0 4.042 1.158 1.000 3.000 4.000 5.000 5.0
sect23_2 384.0 3.888 1.108 1.000 3.000 4.000 5.000 5.0
sft70_2 384.0 3.109 1.236 1.000 2.000 3.000 4.000 5.0
sft71_2 384.0 3.013 1.227 0.000 2.000 3.000 4.000 5.0
sft72_2_r 384.0 3.201 1.278 1.000 2.000 3.000 4.000 5.0
sft73_2 384.0 3.169 1.313 1.000 2.000 3.000 4.000 5.0
sft74_2 371.0 3.668 1.208 1.000 3.000 4.000 5.000 5.0
Anxiety 384.0 2.627 1.082 1.000 1.778 2.611 3.361 5.0
InformationProcessing 384.0 3.346 0.745 1.111 2.778 3.333 3.889 5.0
Exam_Score 354.0 64.500 19.595 11.363 50.649 63.806 78.666 115.0
Homework_Score 379.0 55.774 26.492 0.000 38.324 56.464 73.001 150.0
CourseCredit 384.0 0.451 0.498 0.000 0.000 0.000 1.000 1.0
In [3]:
# Frequency distributions for the scale items
sect_items = ['sect16_2', 'sect17_2_r', 'sect18_2', 'sect19_2', 'sect20_2_r', 'sect21_2', 'sect22_2', 'sect23_2']
sft_items = ['sft70_2', 'sft71_2', 'sft72_2_r', 'sft73_2', 'sft74_2']

print('=== SECT Item Frequencies ===')
for item in sect_items:
    print(f'\n{item}:')
    print(df[item].value_counts().sort_index())

print('\n=== SFT Item Frequencies ===')
for item in sft_items:
    print(f'\n{item}:')
    print(df[item].value_counts().sort_index())
=== SECT Item Frequencies ===

sect16_2:
sect16_2
1.0     56
2.0     41
3.0     55
4.0    103
5.0    129
Name: count, dtype: int64

sect17_2_r:
sect17_2_r
1.0    54
2.0    80
3.0    78
4.0    95
5.0    77
Name: count, dtype: int64

sect18_2:
sect18_2
1.0     26
2.0     33
3.0    117
4.0    102
5.0    105
9.0      1
Name: count, dtype: int64

sect19_2:
sect19_2
1.0     27
2.0     45
3.0    101
4.0    100
5.0    111
Name: count, dtype: int64

sect20_2_r:
sect20_2_r
1.0    59
2.0    64
3.0    89
4.0    73
5.0    99
Name: count, dtype: int64

sect21_2:
sect21_2
1.0    51
2.0    84
3.0    97
4.0    66
5.0    79
Name: count, dtype: int64

sect22_2:
sect22_2
1.0     17
2.0     28
3.0     63
4.0     90
5.0    186
Name: count, dtype: int64

sect23_2:
sect23_2
1.0     20
2.0     15
3.0     94
4.0    114
5.0    141
Name: count, dtype: int64

=== SFT Item Frequencies ===

sft70_2:
sft70_2
1.0     46
2.0     67
3.0    139
4.0     63
5.0     69
Name: count, dtype: int64

sft71_2:
sft71_2
0.0      1
1.0     50
2.0     76
3.0    127
4.0     76
5.0     54
Name: count, dtype: int64

sft72_2_r:
sft72_2_r
1.0     53
2.0     55
3.0    106
4.0    102
5.0     68
Name: count, dtype: int64

sft73_2:
sft73_2
1.0     57
2.0     52
3.0    124
4.0     71
5.0     80
Name: count, dtype: int64

sft74_2:
sft74_2
1.0     21
2.0     41
3.0    103
4.0     81
5.0    125
Name: count, dtype: int64

Interpretation¶

Most item means fall in the 3.0–4.0 range with standard deviations between 1.1 and 1.4, which is reasonable for 5-point Likert-type items. Two items stand out in the descriptive statistics as needing attention:

  • sect18_2 has a maximum of 9.0, which exceeds the 1–5 response scale.
  • sft71_2 has a minimum of 0.0, which falls below the 1–5 response scale.

These are non-plausible values and will be addressed in Section 4. Standard deviations are otherwise consistent across items, suggesting no ceiling or floor effects in the remaining data. The pre-computed Anxiety and InformationProcessing scores show means of 2.63 and 3.35 respectively, both within their expected ranges.

3. Missing Data¶

In [4]:
# Missing data summary
missing = pd.DataFrame({
    'Valid': df.count(),
    'Missing': df.isnull().sum(),
    'Pct_Missing': (df.isnull().sum() / len(df) * 100).round(2)
})
display(missing[missing['Missing'] > 0])
print(f'\nVariables with no missing data: {(missing["Missing"] == 0).sum()} of {len(missing)}')
Valid Missing Pct_Missing
sect21_2 377 7 1.82
sft74_2 371 13 3.39
Exam_Score 354 30 7.81
Homework_Score 379 5 1.30
Variables with no missing data: 15 of 19

Interpretation¶

Four variables have missing data. For the scale items, missingness is minimal: sect21_2 is missing 7 cases (1.8%) and sft74_2 is missing 13 cases (3.4%). Among the outcome variables, Exam_Score has the most missingness at 30 cases (7.8%), and Homework_Score is missing 5 cases (1.3%). The remaining 15 variables have complete data.

Missing data is distributed across different variables rather than concentrated in a single item or a single cluster of respondents, suggesting it is likely missing at random rather than systematically. At these rates, listwise deletion (the default in most analyses) will not meaningfully reduce statistical power or bias results.

A note on missing data mechanisms. Missing data falls into three categories, and which one applies determines how serious the problem is:

  • MCAR (Missing Completely At Random): Missingness is unrelated to anything in the dataset. A server error drops records, or a respondent skips a question because they sneezed. No bias. This is the easiest case.
  • MAR (Missing At Random): Missingness is related to an observed variable. Sophomores skip an item more often than seniors, but you can see class standing. Manageable with techniques like multiple imputation.
  • MNAR (Missing Not At Random): Missingness is related to the missing value itself. A student with high anxiety skips the anxiety question. This is the dangerous case because the missingness is driven by the unobserved value, and no statistical technique can fully correct for it.

In this dataset, the pattern of missingness across unrelated variables is consistent with MCAR or MAR. There is no indication of MNAR.

4. Non-Plausible Values and Outliers¶

In [5]:
# Check for non-plausible values in scale items (expected range: 1-5)
all_items = sect_items + sft_items
print('=== Non-Plausible Values (outside 1-5 range) ===')
for item in all_items:
    out_of_range = df[(df[item] < 1) | (df[item] > 5)][item]
    if len(out_of_range) > 0:
        print(f'\n  {item}: {len(out_of_range)} case(s)')
        print(f'    Values: {out_of_range.values}')
        print(f'    IDs: {df.loc[out_of_range.index, "ID"].values}')
=== Non-Plausible Values (outside 1-5 range) ===

  sect18_2: 1 case(s)
    Values: [9.]
    IDs: [3102.]

  sft71_2: 1 case(s)
    Values: [0.]
    IDs: [3334.]
In [6]:
# Boxplots for scale items
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df[sect_items].boxplot(ax=axes[0], rot=45)
axes[0].set_title('SECT Items')
axes[0].set_ylabel('Response Value')

df[sft_items].boxplot(ax=axes[1], rot=45)
axes[1].set_title('SFT Items')
axes[1].set_ylabel('Response Value')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [7]:
# Boxplots for outcome variables
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for ax, var in zip(axes, ['Exam_Score', 'Homework_Score', 'CourseCredit']):
    df[[var]].boxplot(ax=ax)
    ax.set_title(var)

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation¶

Two non-plausible values were identified in the scale items:

  1. sect18_2, ID 3102: value = 9. This exceeds the 1–5 response range and cannot represent a valid response. It is a data entry error.
  2. sft71_2, ID 3334: value = 0. This falls below the 1–5 response range. The lowest valid response is 1.

These are not outliers (extreme but possible values); they are impossible values on the measurement scale. Because no valid interpretation exists for these responses, they are recoded to missing (NaN) rather than imputed. Imputation assumes a true score exists to be estimated — for an impossible value, there is no underlying true score to recover.

The boxplots for the outcome variables show some extreme values (e.g., Homework_Score near 150, Exam_Score above 100). These may represent legitimate performance outcomes (extra credit, bonus points) or errors, but without additional context about the scoring rubric, they are retained and flagged for consideration.

In [8]:
# Recode non-plausible values to NaN
print('Before cleaning:')
print(f'  sect18_2 max: {df["sect18_2"].max()}')
print(f'  sft71_2 min: {df["sft71_2"].min()}')

# sect18_2: value of 9 is impossible on a 1-5 scale
df.loc[df['sect18_2'] > 5, 'sect18_2'] = np.nan

# sft71_2: value of 0 is impossible on a 1-5 scale
df.loc[df['sft71_2'] < 1, 'sft71_2'] = np.nan

print('\nAfter cleaning:')
print(f'  sect18_2 max: {df["sect18_2"].max()}, missing: {df["sect18_2"].isnull().sum()}')
print(f'  sft71_2 min: {df["sft71_2"].min()}, missing: {df["sft71_2"].isnull().sum()}')
Before cleaning:
  sect18_2 max: 9.0
  sft71_2 min: 0.0

After cleaning:
  sect18_2 max: 5.0, missing: 1
  sft71_2 min: 1.0, missing: 1

5. Reliability Analysis (Cronbach's Alpha)¶

In [9]:
def cronbachs_alpha(items_df):
    """Calculate Cronbach's alpha for a set of items."""
    items = items_df.dropna()
    k = items.shape[1]
    item_vars = items.var(ddof=1)
    total_var = items.sum(axis=1).var(ddof=1)
    alpha = (k / (k - 1)) * (1 - item_vars.sum() / total_var)
    return alpha

def alpha_if_deleted(items_df):
    """Calculate alpha-if-item-deleted and corrected item-total correlations."""
    items = items_df.dropna()
    results = []
    total = items.sum(axis=1)
    
    for col in items.columns:
        # Alpha if this item is deleted
        remaining = items.drop(columns=[col])
        a = cronbachs_alpha(remaining)
        
        # Corrected item-total correlation (correlation with total minus this item)
        corrected_total = total - items[col]
        r = items[col].corr(corrected_total)
        
        results.append({'Item': col, 'Alpha_if_Deleted': round(a, 4), 'Corrected_Item_Total_r': round(r, 4)})
    
    return pd.DataFrame(results)
In [10]:
# SECT scale reliability (after cleaning non-plausible values)
sect_alpha = cronbachs_alpha(df[sect_items])
print(f'SECT Scale — Cronbach\'s Alpha: {sect_alpha:.4f}')
print(f'Number of items: {len(sect_items)}')
print()

sect_aid = alpha_if_deleted(df[sect_items])
display(sect_aid)
SECT Scale — Cronbach's Alpha: 0.8564
Number of items: 8

Item Alpha_if_Deleted Corrected_Item_Total_r
0 sect16_2 0.8422 0.5807
1 sect17_2_r 0.8358 0.6276
2 sect18_2 0.8344 0.6457
3 sect19_2 0.8411 0.5828
4 sect20_2_r 0.8399 0.5965
5 sect21_2 0.8406 0.5884
6 sect22_2 0.8393 0.6016
7 sect23_2 0.8403 0.5951
In [11]:
# SFT scale reliability (after cleaning non-plausible values)
sft_alpha = cronbachs_alpha(df[sft_items])
print(f'SFT Scale — Cronbach\'s Alpha: {sft_alpha:.4f}')
print(f'Number of items: {len(sft_items)}')
print()

sft_aid = alpha_if_deleted(df[sft_items])
display(sft_aid)
SFT Scale — Cronbach's Alpha: 0.6626
Number of items: 5

Item Alpha_if_Deleted Corrected_Item_Total_r
0 sft70_2 0.6082 0.4215
1 sft71_2 0.6056 0.4274
2 sft72_2_r 0.6247 0.3860
3 sft73_2 0.6103 0.4179
4 sft74_2 0.6065 0.4259

6. Interpretation: Alpha if Item Deleted¶

SECT Scale (α = .856, 8 items) — Good reliability.

All alpha-if-item-deleted values fall below the overall alpha of .856, meaning no individual item weakens the scale. If any item's removal caused alpha to increase, that would signal the item is inconsistent with the rest. Here, the opposite is true — every item contributes to internal consistency. Corrected item-total correlations range from .58 to .65, indicating each item correlates moderately-to-strongly with the composite of the remaining items. This is a well-functioning scale.

SFT Scale (α = .663, 5 items) — Below the conventional .70 threshold.

No single item is responsible for the low alpha. All alpha-if-item-deleted values are below .663 (the lowest is .606 for sft70_2), so removing any item would make reliability worse, not better. Corrected item-total correlations (.39–.43) are modest but consistent.

The most likely explanation for the low alpha is the small number of items. Alpha is a function of both inter-item correlation and item count; with only 5 items, even moderate inter-item correlations produce a lower alpha than the same correlations would in an 8-item scale. The SFT items correlate with each other at roughly the same level as the SECT items, but the shorter scale length pulls the coefficient down. This is a known limitation of alpha as a reliability estimate for short scales.

An alternative possibility is that the 5 SFT items are tapping slightly different facets of the construct, which would reduce internal consistency. Further analysis (e.g., factor analysis) would be needed to evaluate this.

7. Create Scale Scores¶

In [12]:
# Mean scale scores (recommended approach)
df['SECT_Mean'] = df[sect_items].mean(axis=1)
df['SFT_Mean'] = df[sft_items].mean(axis=1)

print('=== New Scale Score Descriptives ===')
display(df[['SECT_Mean', 'SFT_Mean']].describe().round(3))
=== New Scale Score Descriptives ===
SECT_Mean SFT_Mean
count 384.000 384.000
mean 3.518 3.229
std 0.900 0.819
min 1.125 1.400
25% 2.875 2.600
50% 3.625 3.200
75% 4.250 3.800
max 5.000 5.000

8. Final Verification¶

In [13]:
# Check scale scores for missing data and outliers
print('=== Scale Score Missing Data ===')
for scale in ['SECT_Mean', 'SFT_Mean']:
    n_missing = df[scale].isnull().sum()
    print(f'{scale}: {n_missing} missing ({n_missing/len(df)*100:.1f}%)')

print()

# Boxplots of final scale scores
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df[['SECT_Mean']].boxplot(ax=axes[0])
axes[0].set_title('SECT Mean Score')
df[['SFT_Mean']].boxplot(ax=axes[1])
axes[1].set_title('SFT Mean Score')
plt.tight_layout()
plt.show()
=== Scale Score Missing Data ===
SECT_Mean: 0 missing (0.0%)
SFT_Mean: 0 missing (0.0%)

No description has been provided for this image

Summary and Interpretation¶

Data Quality¶

The dataset contained 384 observations across 19 variables. Two non-plausible values were identified in the scale items: a response of 9 on sect18_2 (ID 3102) and a response of 0 on sft71_2 (ID 3334). Both fall outside the valid 1–5 response range and were recoded to missing as data entry errors. Missing data across the dataset was minimal (1.3%–7.8% on affected variables) and distributed across different variables, consistent with data missing at random.

Reliability¶

The SECT scale (8 items) demonstrated good internal consistency (α = .856). All items contributed positively to scale reliability, with corrected item-total correlations between .58 and .65. No items were flagged for removal.

The SFT scale (5 items) fell below the conventional .70 threshold (α = .663). However, no individual item was responsible for the low reliability — all alpha-if-item-deleted values were lower than the overall alpha. The most plausible explanation is the small number of items rather than poor item quality. The corrected item-total correlations (.39–.43) are consistent but modest, and the scale would likely benefit from additional items to improve reliability.

Scale Scores¶

Mean scale scores were computed for both scales. The SECT mean score (M = 3.52, SD = 0.90) and SFT mean score (M = 3.23, SD = 0.82) both fall near the midpoint of the 1–5 range, indicating adequate variability in the sample. Neither scale score showed evidence of floor or ceiling effects.

Deliverables¶

Deliverable Status
Descriptive statistics table Section 2
Reliability output (Cronbach's alpha) Section 5
Interpretation of weak items Section 6
Final scale scores Section 7–8

AI Use Disclosure¶

This assignment was completed using Claude (Anthropic) as a collaborative tool through the Claude Code CLI. The AI assisted with:

  • Code execution: Python was used in place of SPSS. Claude wrote the analysis code (pyreadstat for data import, pandas for descriptives, custom functions for Cronbach's alpha and alpha-if-item-deleted).
  • Interpretation drafting: Initial interpretations were drafted by Claude and then reviewed, discussed, and revised through a Socratic dialogue process.
  • Concept reinforcement: Key concepts (Cronbach's alpha, non-plausible values vs. outliers, missing data patterns, mean vs. sum scoring) were discussed before interpretations were finalized.

What I did: Directed the analysis, identified what needed to be done at each step, explained my understanding of each concept before Claude drafted interpretations, caught and corrected conceptual gaps, and made final editorial decisions. I reviewed all output and confirmed that interpretations matched my understanding of the material.

What Claude did: Wrote the Python code, generated statistical output, drafted interpretation text based on our discussions, and connected psychometric concepts to my professional context.

Process documentation: A detailed experiment log documenting every interaction, concept discussion, and decision point is maintained at for_jason/coursework-ai-experiment-log.md and version-controlled via git. This log records my initial understanding of each concept, where corrections were needed, and what was reinforced — serving as evidence that the learning process, not just the output, was the focus of this work.