Starting off as a muggle that naïve to the Math's and Data Science world.

Day 96

Data collected interpretation might have been wrong.


One household is independent to other household; using simple random sampling.

All single units in the population have the same chances of being selected into sample.

The sampling was a two-stage one, and to pretend the secondary units were selected independently. Picking up household from the same housing area, claiming that these household are independent, is the household independent or dependent?

Usually the researcher assumes that each data point that they collected (e.g. housing area X – household) that they examined was chosen separately from one another. In reality these “households” are often choose from the same “housing area”. Which means they are not truly independent; overlooking secondary units lead to erroneous inferences/mislead conclusion.


Type of mistake occur

  • Error rate – wrong predict / total sample
  • Mislead estimate – effect size bias
  • Overconfident – test dataset over perform, valid dataset not working.
  • Invalid generalizations – wrongly represent
  • Inefficient analysis

To avoid wrong sampling, multi stage sampling is introduce. 5/10 cluster is pick, 5/10 observation in the cluster is pick.

Travel 100 housing area interview 10 household “cheaper than” travel 1000 housing area interview 1 household.

Type of question to answer, does household in multi-type housing area different to other housing area?


How to tell primary/secondary unit? In this case, housing area is primary units aka macro-level unit aka cluster; household is secondary units aka micro-level unit aka elementary. Micro level is lower level and macro level is higher level; similar like organization chart.

however, some interesting question arise:

doctor : 1 – 10 : patients; 1 doctor handle 10 patients.

patient : 1 – 5 : doctors; 1 patient attend by 10 different specialist doctor, different date.


The impact of identifying this is crucial. Say for example, same school pupils achievement level is high vary, likely it has to do with individual performance, not school contribution.

In other hand, pupils in the school having small standard deviation, compare with other school; can disclose school capability such as teaching method, resources and environment.


Leave a comment