Starting off as a muggle that naïve to the Math's and Data Science world.

Day 98 (2)

Statistical Treatment of Clustered Data

First step is to understand relationship between independent observation in a single classroom, or intraclass correlation.

Second step is to perform aggregation to the observations such as simple statistics (mean, standard error of the mean, variance, correlation, reliability of aggregates). This is the treatment for two-stage sampling designs.


Why this is important?

1. Avoid Ecological Fallacies 生态谬误

This occur when conclusions about individual data/behavior one drawn from group level data. Can be explained using regression.

IndividualGroup
Level 1Level 2
MicroMacro
Within Group Regression
– Understand how variables interact within a single group or class.

– Eg. Examining the correlation between study time and test scores within a single classroom.
Between Group Regression
– Compare how variables differ or relate between different groups or classes.
– Eg. Comparing the average test scores between different classrooms.
Same classroomClassroom 1 & 2

2. Group level mask individual level variations

Testing group effect, using individual granularity

3. Misinterpret group data lead to incorrect assumptions about individual behavior. Eg.

Group A score: 52, 54, 53, 51, 99, mean = 61.8
Group B score: 30, 61, 63, 65, 66, mean = 57

Group A score is 61.8, Group B score is 57. Wrong. 99 and 30 are being ignored; Individual effect are being ignore.


Method

  • Aggregation
  • Disaggregation

Aggregation

Common procedure is to aggregate micro level data to macro level data to perform two-level data analysis.

However, it would cause “shift of meaning” (Firebaugh, 1978; Huttner, 1981). Eg. Employee satisfaction to company performance. Aggregate all employee regardless department to single value and correlate to company performance. This is because production employee would weight more than helpdesk employee.

Second potential error is ecological fallacy. Eg. Percentage of black inhabitants could related to average political view in the neighborhood. ps. This is done at macro level. Now use the percentage as race to relate with political analysis, Wrong.

Third potential error is neglect original data structure.

(X, Y) values for 5 groups indicated by symbol, group average by “•”

Micro Level

If we look within group, all are positive slope, different group may appear above/below regression line.

Macro Level

if we aggregate into single value per group, it is negative slope, the “•” are almost ontop of regression line.

This would lead us to draw wrong conclusion that there are no group differences.

Last potential error is prevent potential cross-level interaction effects of a specified micro-level variable with an as yet unspecified macro-level variable. Eg. Whether the sentence differential between black and white suspects is different between judges, when allowance is made for differences in seriousness of crimes.

Leave a comment