Methodology of theVaccine Allocation Planner for COVID-19Ariadne Labs and Surgo FoundationOctober 26, 2020, v1

ContentsA. Context1B. This document1C. General principles about methods1D. The NASEM report2E. VAPC function 1: Select groups to vaccinate21. Estimating county populations by group2a. Employment data6b. Imputation of suppressed data6c. Volunteer firefighters7d. Critical risk workers7e. Comorbidity estimates8f. Older adults in congregate settings9g. Limitations to marginal estimates92. Estimation of overlapping populations10a. Conditional probabilities for each pair of groups10b. Resolving each pair of conditional probabilities1 1c. Generating a covariance matrix for each county1 2d. Monte Carlo simulation for each county1 2e.Final analytic dataset1 2f. Accuracy of marginal and overlap estimates1 3F. VAPC function 2: Count available doses1 4G. VAPC function 3: Allocate doses to counties1 41. Proportional allocation1 42. Adjustment for SVI or CCVI1 4H. Refining the methods1 7Appendix A: Conditional probability estimates1 8

A. ContextAcross the United States, people are eagerly awaiting the arrival of safe, effective vaccinesagainst the coronavirus that causes COVID-19. Vaccines will be the surest sustainable way ofprotecting health, saving lives, and getting the country beyond the pandemic. Health officialsand government decision makers at state and local levels must plan carefully to ensure thatvaccines are distributed as quickly as possible once stocks are made available.But there will not be enough stocks to vaccinate everyone immediately. Officials will have toprioritize and allocate them to the groups most in need. The Vaccine Allocation Planner forCOVID-19 (VAPC) provides state and county decision makers with the localized data they needto plan vaccine distribution, based on available vaccine doses, priority populations, andvulnerable communities in each state.Vaccination against a transmittable disease such as COVID-19 is an individual, community, andgovernmental responsibility that transcends borders. Equitable access to immunization is a corecomponent of the right to health. Strong vaccination allocation systems during extreme resourcescarcity, such as the situation we will soon face, are essential to combatting the virus causingthe current pandemic. Informed decisions and implementation strategies are critical to ensuringthe sustainability of vaccination programs. The full potential of vaccinations can only be realizedthrough learning, continuous improvement and innovation in research and development, as wellas quality improvement across all aspects of vaccination. Through the prioritization ofvaccination schemes to our frontline workers and the most vulnerable in our population toCOVID-19, equitable allocation will have precipitous effects on the remainder of the generalpublic.B. This documentThis document is arranged according to the three main functions in the website:1. Select groups to vaccinate2. Count available doses3. Allocate doses to countiesThe reader may want to open the VAPC site to follow along with each section.Our goal is to provide enough information behind our statistical methods for analysts tounderstand and potentially recreate each step. If you would like more detail or have otherquestions please email [email protected] document does not describe the design or build of the VAPC website itself.C. General principles about methods We strive for transparency at every step.We will be updating the VAPC continuously. Results may change as we refine ourmethods, as recommendations change from various official bodies, and as the qualitiesof the available vaccines become clear.All of the estimates in VAPC reflect our best efforts. We selected the most reliable datasources available, but we did need to make assumptions and imputations at severalpoints, as described in this document. Further, we plan to refine some of our methodsgoing forward, such as providing ranges rather than point estimates. As the statisticianGeorge Box wrote, “All models are wrong but some are useful.” Recognizing that theVAPC will be inaccurate at times, we strive for it to be useful.Methodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v11

The VAPC is centered on the county level because this is the smallest geographic unitwith reasonably reliable data for all the priority populations. If data becomes available atsmaller units, such as the municipality or census tract, we will consider switching.We use only publicly available data, and plan to do so moving forward. We will resistusing proprietary or commercial data unless the gains to accuracy outweigh the goal oftransparency.The data science teams used a mix of R, Python, and SAS to implement these methods.D. The NASEM reportThe VAPC closely reflects the NASEM guidance 1 , relying on their careful ethical deliberationsregarding vaccine prioritization. We recommend reading the full report, which describes theethics and rationale behind the settings reflected in the VAPC.In particular the VAPC centers on the 13 populations arranged in prioritized phases, aspresented in the NASEM report’s “Table 3-2, Applying the Allocation Criteria to SpecificPopulation Groups.” These 13 populations are presented by phase in Table 1 below.The default values in the VAPC reflect the NASEM recommendations, such as the pre-selectionof both populations in phase 1a (high risk health care workers and first responders) and thepre-selected option to take a 10% holdout.The NASEM report also recommends that “Programs should do everything possible to reach allindividuals in one priority group before proceeding to the next one.” (page 4-4) At the moment,the VAPC does not reflect this recommendation, but distributes vaccines among all thepopulations selected by the user, regardless of phase (more information in the section on VAPCfunction 3, below.)E. VAPC function 1: Select groups to vaccinateThe first function of the VAPC is the most complex to calculate, requiring estimates of the size ofthe 13 priority populations and their overlaps in every county.1. Estimating county populations by groupWe estimated population sizes in all US counties for the 13 priority groups. The NASEM reportestimates the national total for each group, which we took as a rough benchmark to match withthe sum of our county-level estimates. The NASEM report does give sources for its totals, andrecognizes that precise estimates are difficult to come by. We followed NASEM’s lead insourcing data, and strove to generally match the NASEM national numbers for each group,unless we had a direct reason for a variance, as described below.The output of this step is a data frame with one row for each county (n 3,142), one column withthe county FIPS code (a standard identifier), one column with the total population of the countyfrom Census estimates, and one column each for the 13 groups, with the number of people(integers) in each group in that county. This section describes how we estimated the 13 groups,and Table 1 summarizes the definitions and data sources for each. National Academies of Sciences, Engineering, and Medicine 2020. Framework forEquitable Allocation of COVID-19 Vaccine. Washington, DC: The NationalAcademies Press. .1Methodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v12

Table 1: Population group definitions and data sourcesGroupPHASE 1A1 High risk workers inhealth care facilities2First respondersSubgroup(s)VAPC data sourceHospitals, physician and other healthpractitioner offices, outpatient care centers,home healthcare services, pharmacies anddrug stores, and nursing and residential carefacilities and homes (skilled nursing, mentalhealth, developmental disability, mental andsubstance abuse, assisted living, retirementcommunities, other residential care)PoliceBureau of Labor Statistics 2020Quarterly Census of Employment andWagesNote: Raw data from BLS QCEW atthe county level is highly suppressed(see main text on the imputationmethod used)Fire protection servicesOther ambulatory health care servicesPHASE 1B3 People with 2 significant comorbidconditions4Older adults incongregate settingsArcGIS, CA Governor’s Office ofEmergency ServicesBureau of Labor Statistics 2020Quarterly Census of Employment andWagesBureau of Labor Statistics 2020Quarterly Census of Employment andWagesObesity (BMI 30 kg/m2), diabetes mellitus,Direct estimates of comorbidity ratesCOPD, heart disease, chronic kidney disease, by county from the CDC (Razzaghi etal. 2020) are adjusted forand any (1 ) conditionmultimorbidity using Clark et al. 2020estimates for 1 and 2 comorbiditypopulationsNursing residentsCenters for Medicare & MedicaidServices - Division of NursingHomes/Quality, Safety, and OversightGroup/Center for Clinical Standardsand QualityResidential care residentsDepartment of Homeland Security Homeland InfrastructureFoundation-Level DataNote: Includes residents of assistedliving facilities for the elderly andcontinuing care retirementcommunitiesCrowded households with adults over 65CDC Social Vulnerability Index American Community Survey2014-2018 5-year EstimatesNote: Calculated as as a product ofcrowding (more people than rooms)and persons over 65Methodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v13

Table 1: Population group definitions and data sources (con’t)GroupSubgroup(s)VAPC data sourcePHASE 25 Critical risk workers Workers in dentist offices, medical and diagnosticBureau of Labor Statistics 2020(part 1)laboratories, food and beverage manufacturingQuarterly Census of Employmentfacilities and stores, gas stations, cosmetic andand Wagesbeauty supply stores, optical goods stores, otherhealth and personal care stores, transportationindustries (air, rail, water, truck, public transit andground passenger, pipeline, support activities), postalservice and other couriers and messengers, generalwarehousing and storage establishments, andpharmaceutical and medicine manufacturing facilities6 Teachers andElementary and secondary school teachersBureau of Labor Statistics 2020school staffQuarterly Census of Employmentand WagesChild day care service staffBureau of Labor Statistics 2020Quarterly Census of Employmentand Wages7 People with 1(see above)Direct estimates of comorbiditysignificant comorbidrates by county from the CDCcondition(Razzaghi et al. 2020) areadjusted for multimorbidity usingClark et al. 2020 estimates for 1and 2 comorbidity populations8 All older adultsPersons over 65CDC Social Vulnerability Index American Community Survey2014-2018 5-year Estimates9 People and staff in People living in non-institutional group quartersCensus Bureau 2010 Decennialhomeless shelters (homeless shelters, group homes for adults,Censusor group homesresidential rehab treatment centers for adults)Note: Will be updated with 2020census data once available.Staff in community food and housing, andBureau of Labor Statistics 2020emergency and other relief services, andQuarterly Census of Employmentvocational rehabilitation servicesand Wages10 Incarcerated /Staff in correctional institution establishmentsBureau of Labor Statistics 2020detained peopleQuarterly Census of Employmentand staffand WagesIncarcerated populationVera Institute of Justice Incarceration Trends DatasetPHASE 311 Young adultsPersons age 18-30Census Bureau 2019 American(18-30)Community Survey12 Children (3-18)Persons age 3-18Census Bureau 2019 AmericanCommunity Survey13 Critical risk workers Workers in the following services, establishments,Bureau of Labor Statistics 2020(part 2)and stores: waste management and remediation,Quarterly Census of Employmenttransportation equipment manufacturing, utilities,and Wagescrop production, specialty trade contractors, oiland gas extraction, animal production andaquaculture, mining (coal, metal ore, nonmetallicmineral), construction of buildings, hardware,clothing and clothing accessories, food servicesMethodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v14

and drinking places, and credit intermediation andrelated activitiesMethodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v15

a. Employment dataFor all professions, unless noted otherwise, we relied on industry data from the Bureau of LaborStatistics (BLS) Quarterly Census of Employment and Wages (QCEW), and our numbers arebased on employment conditions pre-pandemic (January-March 2020) 2 . The industries werelocated through their North American Industry Classification System (NAICS) codes. The BLSQCEW affords the finest geographic granularity of employment data available at the countylevel. While the data is by industry (e.g. education), and not by occupation (e.g. 2nd gradeteacher), QCEW data includes all pertinent staff that work alongside these critical workers andwould therefore need vaccination as well. Employers in the United States can fall into 4 majorownership types including private, federal, state, and local government. The employmentnumbers are dispersed among these ownership types and need to be summed to get the totalnumber of employees per industry per county. The average number of employees per industryis then taken across the first 3 months of 2020.b. Imputation of suppressed dataPrivacy laws, based upon stipulations from a Federal Registry Notice 3 , introduce suppressionissues when accessing employment data from the BLS QCEW at the county level.Approximately 60 percent of the most detailed level data are suppressed for confidentialityreasons.These issues can arise when an industry has few employers within a respective countyor when an industry is dominated by state and local government (e.g. education). There arevarious levels of suppression that can lead to large underestimations at the county level. Theselevels of suppression include primary (dubbed the 80/3 rule) and secondary. Primarysuppression occurs in a county when either a single establishment employs over 80% of theemployees or there are less than three establishments total. Secondary suppression occurswhen the value of the primary suppressed data can be back-calculated with simple arithmeticfrom the data that is not suppressed in that county. Another undisclosed level of suppressionensures the integrity of the hidden data.The regulations for suppression can differ depending on the ownership type. For example,federal data is always disclosable, but state governments have much stricter guidelines on whatdata can be made available. However, all levels of suppression could lead to large differencesbetween the summation of employed individuals per industry from the county level datacompared to the national estimates. For instance, the education industry has employment datain 99% of the 3,142 counties. However at least 1 ownership type (i.e. state, local, or private) hassuppressed data in 89% of those counties. This amount of suppression leads to 5.9 millionemployees being unaccounted for within the education industry.The magnitude of the suppression in almost every industry made it impossible to simply ignore,and we sought a systematic approach to imputing the suppressed data. Since we know severalrules are followed to establish which data gets undisclosed, a multiple imputation method is notwarranted since the data would need to be randomly suppressed 4 . Also, imputing data based on Quarterly Census of Employment and Wages. Retrieved es.htm2 Federal Register, 69, Department of Labor - Bureau of Labor Statistics 19452 (2004 April 13, 2004). correction.pdf3 Sterne, J., White, I. R., Carlin, J., Spratt, M., Royston, P., Kenward, M., Carpenter, J. (2009). Multipleimputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ,338(b2393). doi: of the Vaccine Allocation Planner for COVID-19October 26, 2020, v16

correlations with covariates to industry data (e.g. socioeconomic status, household income, etc.)would require a targeted approach to each individual industry, and we did not have the time inthe project.Instead, we used data from other sources for industries where employment information hadbeen compiled and completely disclosed (e.g. law enforcement employees at the county level).For the majority of industries, other sources did not exist. Therefore, we implemented asimplistic approach given what we knew about the suppressed data. We distributed thedifference between national and county level totals by ownership type to the counties withsuppressed data weighted by population. We know exactly which counties have suppresseddata per industry as well as the ownership type (i.e. state, local, private) in which the data wassuppressed. QCEW also lists the national totals by ownership type per industry. Therefore, wecan distribute the difference between the national and suppressed estimates among thosecounties that actually employ that industry ensuring that no employee is assigned to a countywhere that industry does not exist. Since we can ensure the validity of that industry existing inthe county, it is a reasonable assumption to distribute based on population since moreemployees tend to work in more highly populated areas 5 on average.c. Volunteer firefightersWe could not identify a county-level count of volunteer firefighters, which are more numerousthan their paid counterparts (which are included in the QCEW data above). There are 1,823counties (containing 81m people) that had no paid firefighters listed in the BLS data nor hadtheir count suppressed by BLS. We distributed the 800,000 volunteer firefighters across thesecounties weighted by population, meaning 1% of the population of these counties wasdesignated as being a volunteer firefighter. We considered an alternative approach ofdistributing the 800,000 across all non-urban counties (as most urban counties have paidfirefighters 6 ) however this would have left a substantial number of urban counties with nofirefighters, whereas 2 % of the population in non-urban counties would be counted as avolunteer firefighter.d. Critical risk workersThe critical risk workers were defined according to the national guidelines set out by theDepartment of Homeland Security 7 . The list of industries included under critical risk workers isextensive and includes: healthcare and public health, law enforcement and first responders,education, food and agriculture, energy, water and wastewater, transportation, public works,communications, critical manufacturing, hazardous materials, financial services, chemical,defense, real estate, and hygiene services. Some workers in these “essential” industries do nothave a high risk of exposure to COVID-19. For instance, 37 percent of jobs in the US 8 can be BLS (2020). County Employment and Wages – First Quarter 2020 [Press tr.pdf5Evarts, Ben; Stein, Gary P. (February 2020). "U.S. Fire Department Profile through 2020" . National FireProtection Association Fire Analysis and Research Division. Retrieved May 6, 2020.67Krebs, C. C. (2020). Advisory memorandum on ensuring essential critical infrastructure workers ability towork during the COVID-19 response. Retrieved from Statistics, U.S. BLS blications/Version 4.0 CISA Guidance on Essential Critical Infrastructure Workers FINAL%20AUG%2018v3.pdf8Dingel, J., and B. Neiman. 2020. White Paper: How many jobs can be done at home? Chicago,IL: Becker Friedman Institute for Economics at the University of Chicago.Methodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v17

worked remotely and some of the employees are white collar and can avoid direct interactionswith others. The NASEM guidelines acknowledge that there is no complete list of all workerswho are critical. The critical worker designation involves both working in vital industries to thefunctioning of society and in occupations where they cannot avoid exposure risk.The critical risk workers are separated into 2 groups within the NASEM guidelines: criticalworkers in high-risk settings (Phase 2) and workers at moderately higher risk of exposure(Phase 3). The high-risk setting designation (Phase 2) is defined according to occupations thathave potential exposure both from colleagues and the public as well as the inability to socialdistance or wear protective equipment (PPE) (e.g. cashiers and food store workers, publictransit, etc.). The moderately higher risk worker (Phase 3) are those occupations where thepotential exposure comes from their colleagues or the use of PPE and social distancingmeasures are easily implemented (e.g. factory workers in production, bank tellers, agriculture,etc.).e. Comorbidity estimatesNASEM guidelines split people with comorbidities into two groups. The severe comorbiditiesgroup (Phase 1b) includes people with two or more of the underlying conditions outlined byCDC 9 as leading to an increased risk of severe illness if infected with COVID-19. The high riskgroup (Phase 2) has exactly one of these conditions. Compiling subsets of a county that have 1and 2 or more of these comorbidities quickly becomes inaccurate due to double counting thatcould lead to unrealistic numbers if the overlap between conditions is left unaddressed. Forexample, an obese person may also have type 2 diabetes mellitus and cancer. If only the totalnumbers for each underlying condition are counted, this would lead to a drastic overestimation.Clark et al. (2020) 10 estimated the number of individuals at increased risk of severe disease(defined by the CDC, WHO, and UK public health agencies at the time of publication in June) byage (5-year age groups), sex, and country for 188 countries using prevalence data from theGlobal Burden of Diseases, Injuries, and Risk Factors Study (GBD) and UN populationestimates for 2020. They also analyzed data from Scottish and Chinese multimorbidity studiesto calculate the multimorbidity fraction for each country.However, this approach is biased toward individuals with higher socioeconomic status. It isknown that income and education correlate with the presence of comorbidities. Lowersocioeconomic populations suffer from greater rates of comorbidities. Therefore, projectingnational estimates to the county level will overestimate the true population in wealthier counties,effectively creating an inequitable recommendation for vaccine allocation. We have developed acorrection for these estimates based on county-level prevalence of chronic conditionssusceptible to COVID-19 calculated by the CDC 11 from the Behavioral Risk Factor SurveillanceSystem (BRFSS) state surveys. CDC (2020, October 16, 2020). People with Certain Medical itions.html9 Clark, A., Jit, M., Warren-Gash, C., Guthrie, B., Wang, H. H. X., & Mercer, S. W. (2020). Global,regional, and national estimates of the population at increased risk of severe COVID-19 due to underlyinghealth conditions in 2020: a modelling study. The Lancet Global Health, 8(8), E1003-E1017.doi: azzaghi H, Wang Y, Lu H, et al. Estimated County-Level Prevalence of Selected Underlying MedicalConditions Associated with Increased Risk for Severe COVID-19 Illness — United States, 2018. MMWRMorb Mortal Wkly Rep 2020;69:945–950. DOI: y of the Vaccine Allocation Planner for COVID-19October 26, 2020, v18

The CDC used BRFSS data and small area estimation to generate estimates of county-levelpopulations with any of five conditions (obesity (BMI 30), heart disease, diabetes mellitus,COPD, and CKD), taking overlap into account. Because these are direct measures ofconditions, they already reflect any differences in morbidity by age structure, SES, or any othercovariate. This is a strength. Another strength is that this measure reflects obesity, an importantpredictor for the US, whereas Clark does not. But these estimates pool people with exactly 1condition and those with 2 conditions. We need a way to divide them into two subgroups.According to Clark, 94.3 m people in the US have 1 comorbidities, comprising 64.7m (69%)with exactly 1 comorbidity, and 29.5m (31%) with 2 comorbidities. We divided the BRFSSestimates into two groups based on these proportions. So if a county has 1,000 people with 1 comorbidities according to BRFSS, we assigned 690 to the group with exactly 1 comorbidity and310 to the 2 comorbidities group. Applying this ratio across all counties is clearly a largeassumption that may be inaccurate in some counties. We plan to refine this methodology inparticular moving forward.f. Older adults in congregate settingsAlongside residents in nursing homes and residential care facilities, NASEM includes all adultsover the age of 65 that live below the poverty line in this category as a proxy for older adultsliving in overcrowded settings. Their contention is that older adults living in overcrowded settingsmay live in multigenerational households 12 that can typically be found often in lower-incomecommunities.We found a more direct method of identifying older adults living in overcrowded settings outsideof community assisted living facilities such as nursing homes. The ACS collects data on peopleliving in crowded settings, which is defined as more people to a household than there arerooms. This variable is available at the county level so we can get accurate variability. Thevariable does not disaggregate by age, so we multiplied the percentage of people living incrowded settings by the number of people over 65 to get this subset within the older adults incongregate settings group.g. Limitations to marginal estimatesWhereas some groups are clearly defined in the NASEM recommendation, such as healthcareworkers, teachers, or those with severe comorbidities, other groups are not as clearly defined,such as critical workers. Individual states may have their own definition of which workers arecritical, which would change the allocation of doses.Data on employment from the BLS is currently available for the first quarter of 2020 only (as ofNovember 2020) and may not reflect the current levels given the massive unemployment ratesspurred by the pandemic.Older adults in congregate settings may be an underestimate since we are not accounting formulti-generational households where there are the same number of people as rooms or fewer.12Miller, R. B., and C. A. Nebeker-Adams. 2017. Multigenerational households. In Encyclopediaof couple and family therapy, edited by J. Lebow, A. Chambers, and D. C. Breunlin.Cham, Switzerland: Springer International Publishing. Pp. 1–3.Methodology of the Vaccine Allocation Planner for COVID-19October 26, 2020, v19

2. Estimation of overlapping populationsAfter getting these marginal estimates of the 13 groups for each county, our next task was toestimate the amount of overlap between these groups. Understanding the degree of overlap isimportant because it has a large effect on the number of vaccine courses needed by apopulation. Many groups clearly overlap strongly, such as elderly in congregate settings andthose with 2 comorbidities, and other groups do not overlap at all, such as children and elderly.For this reason, we did not feel we could assume independence.We took a Monte Carlo approach to estimate the overlap between all 13 groups. Given theexpected correlations between groups and the marginal populations, we would randomlygenerate a simulated population of 1 million for each county, with 13 binary variables indicatinggroup membership.Our goal was to estimate the proportion of people in every possible combination of belonging to13 groups. For each county, the desired dataset would have 13 columns of binary variablesindicating membership in each group, and one column with the proportion of people in thatcounty who were members of exactly those groups. This would imply 8,192 (2 13 ) rows for eachof 3,142 counties, for a total dataset of 14 columns and about 26 million rows. Luckily,memberships in more than a few groups is rare, and the final dataset is much smaller and moremanageable.a. Conditional probabilities for each pair of groupsTo generate the simulated population, we needed a correlation matrix for the 13 groups. Ourfirst step was to consider the conditional probabilities of group membership. Within one county,there are 78 pairs of group memberships. Each pair of membership in group X and group Yrelates to a standard 2x2 table of probabilities as shown in Table 2.Table 2: Sample 2x2 tableMember of group Y?Member ofgroup X?YesNoYesABP(X)NoCD1 – P(X)P(Y)1 – P(Y)1We have P(X) and P(Y) from the estimates of population size in each county. Our goal was toestimate cell A for each pair, which would be the amount of overlap between the two groups.Under independence, cell A would equal P(X)*P(Y), but we wanted to reflect more accurateconditional probabilities. In the cases with zero overlap, cell A is zero, as are the conditionalprobabilities. In cases with larger overlap, the conditional probabilities could range widely.Further, there are two “directions” in which to resolve cell A: the probability of being in group Xgiven membership in group Y [ P(X Y) ] and the probability of being in group Y given being ingroup X [ P(X Y) ].We assembled estimates of every P(X Y) and P(Y X) (see appendix A). We began by populatingall the conditional probabilities that are logically zero: Children, young adults, and older adults are mutually exclusive groups (also affects olderadults in congregate settings)Methodology of the Vaccine Allocation Planner for COVID-19October

beauty supply stores, optical goods stores, other health and personal care stores, transportation industries (air, rail, water, truck, public transit and ground passenger, pipeline, support activities), postal service and other couriers and messen