Mental Illness, Aging and Self-Reported Health in Baby Boomers

Identifying Mental Health Trends in Aging Adults

Miguel Angel Santana II, MBA
10 min readJan 30, 2021

Methodology

My personal background includes a large stint in Medicare insurance sales. To give a little context these insurance products work with Medicare in the United States for individuals over the age of 65 years old (some exceptions).

The baby boomer generation (55–75 years old) is often under served as our society has become entangled with technology. Research illustrates an increase in diagnosis of depression and other mental illness in aging adults. Additional research on these trends is available via The Family Institute at Northwestern.

Due to COVID-19’s effect on mental health around the world, it is especially important to understand what trends effect our elders in an attempt to work toward favorable living conditions for all. The Centers for Disease Control and Prevention released the following Alzheimer’s Disease and Healthy Aging Dataset.

The jupyter notebook and clustering analysis can be found on github under username ‘miguelangelsantana’.

Our analysis will look for commonalities across individuals based on ethnicity, age, gender and geographic region. While no definitive conclusions will be drawn due to limitations of the dataset, K-Means clustering algorithms will help identify highly related features and insight into baby boomers as they age in our society.

The OSEMN framework was used to structure this analysis.

- Obtain
- Scrub
- Explore
- Model
- Interpret

Scrub

During the scrubbing phase a few key decisions were made in order to make the dataset more digestible during the exploratory data analysis. The columns with duplicate values (such as abbreviations for an existing column) and additional redundancies were removed. Examples of this include, columns Datasource and Response which represent the same values across the dataset; BRFSS (Behavioral Risk Factor Surveillance System) and None respectively.

Next, generic labels like StratificationCategory were renamed to more appropriately reflect the data they headed. These simplifications were done in an effort to make the data more digestible as many of the values represented large phrases that made visualization challenging. Special care was taken when reducing the Topic values as they represent the key factors of our analysis. Please see our jupyter notebook for additional insight into these changes. Lastly, our analysis focused on individuals who reside on the West Coast. The states that were included were Arizona, California, Colorado, Nevada, Oregon and Washington.

Exploratory Data Analysis

Three key columns represent the stratifications of the data set. They are Ethnicity_Gender, Age_Group and GenderEthnLabel. With so many factors, the analysis looked to find the most significant variation amongst divided categories (age, gender, ethnicity) before narrowing down one category and exploring the subdivisions. For reference, each category is divided as follows:

Ethnicity: ‘White, non-Hispanic’, ’Asian/Pacific Islander’, ‘Native Am/Alaskan Native’, ‘Black, non-Hispanic’, ‘Hispanic’, ‘Overall’

Gender: ‘Male’, ‘Female’, ‘Overall’

Age: ’50–64’, ‘65+’, ‘Overall’

Overall (By State)

Prior to addressing the three categories, a quick visualization was created using common features per West Coast state.

# Overall Age, Ethnicity, Gender
overall = data.loc[(data['Ethnicity_Gender'] == 'Overall') &
(data['Age_Group'] == 'Overall') &
(data['Value'] != 0)] # removing null values

A key obstacle in the analysis was the fact that health questionnaires did not always address the same topics (per year) so the distribution of values was in question. To make up for this, the ‘overall’ analysis was conducted on questionnaires that occurred in the same year, on topics that occurred across each questionnaire.

With respect to obesity, binge drinking (in the last month) and self-reports of mental distress; individuals in Oregon reported the highest percentage of adults over the three categories (in 2018). Interestingly enough, Colorado reported the lowest percentage of obesity in adults and mental distress while reporting the highest percentage of adult binge drinking (past month).

With respect to sufficient sleep, lifetime diagnosis of depression and lack of leisure time (in the past month); Colorado questionnaires reported the highest number of adults with sufficient sleep, lowest lifetime depression diagnosis and lowest ‘No Leisure’ scores (translated as most individuals having sufficient leisure).

Seems like 2018 was a great year for baby boomers in Colorado when compared to other West Coast states.

Ethnicity

In the following exploratory analysis, all states and ages were represented with the only stratification being ethnicity.

ethn_df = data.loc[(data['Age_Group'] == 'Overall') & 
(data['GenderEthnLabel'] == 'Race/Ethnicity') &
(data['Ethnicity_Gender'].isin(ethnicity)) &
(data['Value'] != 0)]

Unfortunately, the last row of features includes several ethnicities that were not represented across each of the topics and as such, our analysis will continue by considering age and gender instead of ethnicity. It is interesting to note commonalities amongst ethnic groups who report certain qualities but ultimately no significant trends are apparent and additional factors like sample size and financial status may heavily influence these features.

Age

The data frame was subdivided into age categories ‘50–64’ and ‘65+’.

agedf = data.loc[(data['Age_Group'] != 'Overall') & 
(data['GenderEthnLabel'] == 'Overall') &
(data['Ethnicity_Gender'] == 'Overall') &
(data['Value'] != 0)]

Interesting trends begin to emerge with baby boomers in the 50–64 age group representing substantially higher percentages of smoking, need for assistance (related to cognitive decline), life time depression diagnosis, mental distress, binge drinking (in the last month), being a caregiver of another (in the last month), expecting to care for another (in the next two years) as well as effects to the performance of daily activities when compared to the reports of those over the age of 65 years old.

While this data is limited in scope; research has attempted to quantify the stress of getting closer to retirement with respect to readiness due to increases in the cost of living, medical expenses, dependence on technology and long-term care. Many of the significant variations may be influenced by several outside factors but feature commonalities help us understand some of the more pressing stressors apparent in aging baby boomers.

Gender

The data frame was subdivided into gender categories ‘Male’ and ‘Female’.

Overall = ['Overall']
genderdf = data.loc[(data['Age_Group'].isin(Overall)) &
(data['GenderEthnLabel'] == 'Gender') &
(data['Ethnicity_Gender'] != 'Overall') &
(data['Value'] != 0)]

Health questionnaires report larger adult female reports of binge drinking, lifetime depression diagnosis, high blood pressure and issues with daily activities and a need for assistance as a result of self-reported decline in cognition. These factors are respective of all ages and ethnicities (in the data).

Our clustering analysis will continue using age as it showed the most variation amongst values per feature (more features, larger variation).

Model | All Ages

df = data2.loc[(data2['Age_Group'] != 'Overall') & 
(data2['GenderEthnLabel'] == 'Overall') &
(data2['Ethnicity_Gender'] == 'Overall') &
(data2['Value'] != 0)]
df_age = df[['Topic','Value']]
df_age = df_age.sort_values('Topic').groupby('Topic')['Value'].apply(lambda df_age: df_age.reset_index(drop=True)).unstack().T

Due to the frequency of the questionnaires with several topics being asked periodically versus a handful of topics being asked consistently — an average value fill (per column with respect to that column’s mean) will be used to address null values initially, but these categories will be dropped as the analysis becomes more granular.

Prior to the filling of missing values, any column with more than 50 percent null values is dropped.

df_age.dropna(thresh=24,axis=1,inplace=True) 
# keeping columns with at least half valid vals

Null values are addressed.

# filling values with respective mean
fillcols = df_age.columns
for col in fillcols:
df_age[col] = df_age[col].fillna(value=df_age[col].mean())

Lastly, the data is prepared using a standard scaler and our first round of modeling begins.

Modeling

Modeling is completed in three stages. The first clustering analysis represents both age groups, the second clustering represents only individuals age 50–64 and the last clustering represents individuals age 65 and over. Analysis is completed using dimensionality reduction (Principal Component Analysis) followed by K-Means Clustering.

Both Age Groups | 3 Components, 5 Clusters

In order to retain as much information as possible, the data frame is transformed into three components using PCA. Those components are added to our data frame for analysis.

# Reducing dimensionality | three components
pca = PCA(n_components=3)
principal_comp = pca.fit_transform(df_scaled) # passing scaled data

Cluster Distribution

The optimal number of clusters is selected using the elbow method and silhouette coefficient.

As we tried to profile and inspect the 5 clusters, we decided the data did not draw meaningful and interpretable trends. This is in addition to any effects that may exist as a result of filling nulls with mean values and using a groupby function with a mean aggregation in order to observe trends. Ultimately, we decided to move forward with a clustering analysis per age group.

Model | Age 50–64

The data was divided into two groups, halving the number of values per feature from 48 to 24. Null values were not filled but dropped in order to maintain the integrity of the dataset.

Our multicollinearity plot for this age group reflects a significant positive correlation between lifetime diagnosis of depression and self-reported ‘fair-poor’ health. The same plot illustrated a significant negative relationship between self-reported disability and self-reported ‘good-excellent’ health.

With half as much data, PCA was used to reduce dimensionality to two components prior to selecting the optimal number of clusters for analysis.

# Reducing dimensionality | two components
pca = PCA(n_components=2)
principal_comp = pca.fit_transform(df_scaled) # passing scaled data
# Creating dataframe out of 2 component result
pca_df = pd.DataFrame(data = principal_comp, columns=['PCA1','PCA2'])

Three clusters were selected using the elbow method and silhouette coefficient.

Cluster Distribution

A groupby function using a mean aggregation is used to profile and inspect the clusters.

three = df50_cluster.drop(['PCA1','PCA2'],axis=1).groupby('Three_Clusters').mean().head()
three.style.background_gradient(cmap='Blues')

Group number ‘2’ is highlighted by the highest (or tied for highest) scores across multiple features: binge drinking, lifetime depression diagnosis, fair-poor health (self-reported), influenza vaccine (in the last year) and no leisure (in the last month).

Model | Age 65+

Multicollinearity in features for individuals over 65 years old.

Our multicollinearity plot for this age group reflects fewer relationships between features. A strong negative correlation exists between self-reported disability and having received a pneumococcal vaccine (at any point in life).

PCA was used to reduce dimensionality to two components prior to selecting the optimal number of clusters for analysis.

# Reducing dimensionality | two components
pca = PCA(n_components=2)
principal_comp = pca.fit_transform(df_scaled) # passing scaled data
# Creating dataframe out of 2 component result
pca_df = pd.DataFrame(data = principal_comp, columns=['PCA1','PCA2'])

Three clusters were selected using the elbow method and silhouette coefficient.

Cluster Distribution

A groupby function using a mean aggregation is used to profile and inspect the clusters.

three = df65_cluster.drop(['PCA1','PCA2'],axis=1).groupby('Three_Clusters').mean().head()
three.style.background_gradient(cmap='Reds')

While there is not one group that shows significant scores across the majority of categories, the largest group is cluster ‘2’ with high scores across activity limitations, lifetime diagnosis of depression, and no leisure (in the last month).

Interpretations and Limitations

The CDC’s Alzheimer’s Disease and Healthy Aging data set illustrates strong relationships between features in individuals age 50–64 at the time of the health questionnaire.

The clustering analysis identified a single group with high scores across the majority of the features included. High scores were observed in categories: binge drinking (in the last month), lifetime depression diagnosis, ‘fair-poor’ self-reported health and no leisure (in the past month).

The study is limited due to missing data. Questionnaires were conducted in varying time periods and each questionnaire appeared to address different topics. Each ethnicity was not represented in all cases and this may be due to geographic region but no additional information is available to support this.

Additionally, values were taken through self reporting and outside factors which may have occurred immediately prior to the assessment may have influenced the results.

Future Work

Ultimately, the analysis shows a clear trend in baby boomers who are getting closer to retirement while living on the West Coast. Future work should include a larger variety of topics, ones that may infer social factors, family features, financial health and more. As the next generation of baby boomers age closer to retirement, it is our responsibility to create a cycle of care so that future generations can worry a little less when they prepare for retirement.

Sources Cited

Counseling Staff, 2018. Boom in Aging Adults Could Overwhelm Mental Health Care Field. [online] Counseling.northwestern.edu. Available at: https://counseling.northwestern.edu/blog/boom-in-aging-adults-could-overwhelm-mental-health-care-field/ [Accessed 30 January 2021].

U.S. Department of Health & Human Services, 2020. Alzheimer’s Disease and Healthy Aging Data. [online] Catalog.data.gov. Available at: https://catalog.data.gov/dataset/alzheimers-disease-and-healthy-aging-data [Accessed 28 January 2021]

--

--

Miguel Angel Santana II, MBA

Data Scientist who enjoys awesome collaborative work environments. When not coding, I spend time with family and fight my pug as he barks at strangers.