Biases, exclusions, and oversights in data greatly influence our understanding of disease systems and the underlying interactions between environments, pathogens, and hosts. These data limitations harm our capacity to identify problems and assess solutions for human infectious diseases. In human diseases, data exclusion often impacts the most vulnerable populations. These groups are most likely to be excluded from both baseline population measurements and disease surveillance. In some cases, the environments are also less likely to be mapped or characterized, further obscuring elements of exposure to pathogens, movement and connectivity, and access to health care. Integrating multiple data streams that measure the same elements of a disease system can help compensate or overcome these biases or at least help identify data biases or gaps and minimize their harm. We compare multiple data sources to assess data representation across measurements of environments, populations, and disease burden. We focus on populations with recurring outbreaks of vaccine-preventable diseases. We consider urgent scenarios, such as outbreak response, and routine scenarios, such a health care capacity assessments and routine immunization planning.
In the systems we studied, we found strong evidence that data biases impeded public health efforts. In urgent settings, critical aspects of data bias limited outbreak detection and response efforts in speed, access, and breadth. In routine settings, data biases were skewed to exclude the populations that most needed improvements in basic public health services. In some instances, access to existing health services was linked directly to data collection, creating a feedback loop of missingness that impacted both urgent and routine public health efforts. For example, infectious disease burden data were often collected at points of care. These data triggered outbreak alerts and determined resource allocations for health services. These data excluded populations and individuals lacking access to care, which was influenced by location, transportation, and financial resources. We also found biases in data streams that relied on local digital or technological resources, perpetuating the ‘digital divide.’ To overcome these gaps and biases in data, we integrated conventional public health data sources with unconventional data sources, such as passive surveillance designed for unrelated purposes and small-scale active surveillance. This approach reduced representation biases in some cases and in others, it helped identify data biases but could not overcome them.