_images/Pseudopeople-logo_FINAL_2023.04.11_psdppl-logo_blue-ombre.png

pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.

🙈 Simulated: These are made-up people! No need to worry about confidentiality.
📝 Versatile: Generate multiple datasets about the same population: censuses, surveys, and administrative records.
✔️ Verifiable: Ground-truth unique identifiers are present in every dataset for checking link correctness.
⚙️ Customizable: Configure the levels of noise in each dataset.
💪 Full-scale: Supports generating datasets at the size of the real-life US population.

The fictional US population was generated by stochastically simulating multiple decades of population dynamics such as fertility, mortality, migration and employment. pseudopeople builds on this fictional population data by simulating errors in the data collection process to create realistic, noisy datasets.

pseudopeople is currently in a public beta release. Things are still in flux! If you notice any issues, please let us know on GitHub.

Introduction

The Simulation Science Team of the University of Washington’s Institute for Health Metrics and Evaluation is excited to introduce pseudopeople, the Python package that simplifies entity resolution research and development. This package generates large-scale, simulated population data according to specifications by the user, to replicate a range of complexities found in real applications of probabilistic record linkage software. With sensitive data often required for entity resolution, accessing and testing new methods and software has been a challenge — until now. Our innovative approach creates realistic, simulated data including name, address, and date of birth, without compromising privacy.

Our work builds on the success of previous data synthesis projects, such as FEBRL, GeCo, and SOG, and generates a simulated population using real, publicly accessible data about the US population by leveraging the power of our simulation platform Vivarium.

Want to know more about pseudopeople? Please visit the pseudopeople project website, where you can find out more about the principles and processes underlying this work.

Quickstart

pseudopeople requires a version of Python between 3.8 and 3.11 (inclusive) to be installed. Once Python is installed, you can install pseudopeople with pip by running the command:

$ pip install pseudopeople

Or, you can install from source on the pseudopeople GitHub repository.

Then, generate a small-scale simulated decennial census:

$ python
>>> import pseudopeople as psp
>>> census = psp.generate_decennial_census()
>>> census
      simulant_id household_id first_name middle_initial  last_name age date_of_birth  ... state zipcode housing_type relationship_to_reference_person     sex race_ethnicity  year
0             0_2          0_7      Diana              P      Kelly  25    05/06/1994  ...    WA   00000    Household                 Reference person  Female            NaN  2020
1             0_3          0_7       Anna              A      Kelly  25    09/29/1994  ...    WA   00000    Household                   Other relative  Female          White  2020
2           0_923       0_8033     Gerald              R      Allen  76    11/03/1943  ...    WA   00000    Household                 Reference person    Male          Black  2020
3          0_2641       0_1066    Loretta              T       Lowe  61    06/01/1958  ...    WA   00000    Household                 Reference person  Female          White  2020
4          0_2801       0_1138    Richard              R     Pinard  73    03/03/1947  ...    WA   00000    Household                 Reference person    Male          White  2020
...           ...          ...        ...            ...        ...  ..           ...  ...   ...     ...          ...                              ...     ...            ...   ...
10215     0_18969       0_7630      Patty              E  Palmisano  87    01/11/1933  ...    WA   00000    Household              Opposite-sex spouse  Female          White  2020
10216     0_19008       0_8361       John              V    Skeeter  58    12/29/1961  ...    WA   00000    Household                 Reference person    Male          Black  2020
10217     0_20165       0_7999   Kimberly              K      Suitt  65    04/05/1955  ...    WA   00000    Household                 Reference person  Female          White  2020
10218     0_19020       0_8130   Virginia              G     Hoover  93    10/02/1926  ...    WA   00000    Household                 Reference person  Female          White  2020
10219     0_20163       0_7998   Victoria              R    Simmons  27    04/21/1992  ...    WA   00000    Household                 Reference person  Female          White  2020

[10220 rows x 18 columns]

And W-2 and 1099 tax forms from the same simulated population:

>>> taxes = psp.generate_taxes_w2_and_1099()
>>> taxes
     simulant_id household_id employer_id          ssn  wages  ... mailing_address_city mailing_address_state mailing_address_zipcode tax_form tax_year
0            0_4          0_8          95  584-16-0130  10192  ...              Anytown                    WA                   00000       W2     2020
1            0_5          0_8          29  854-13-6295  28355  ...              Anytown                    WA                   00000       W2     2020
2            0_5          0_8          30  854-13-6295  18243  ...              Anytown                    WA                   00000       W2     2020
3         0_5621       0_2289          46  674-27-1745   7704  ...              Anytown                    WA                   00000       W2     2020
4         0_5623       0_2289          83  794-23-1522   3490  ...              Anytown                    WA                   00000       W2     2020
...          ...          ...         ...          ...    ...  ...                  ...                   ...                     ...      ...      ...
9911     0_18936       0_7621          23  006-92-7857   9585  ...              Anytown                    WA                   00000       W2     2020
9912     0_18936       0_7621          90  006-92-7857  57906  ...              Anytown                    WA                   00000       W2     2020
9913     0_18937       0_7621           1  182-82-5017  19609  ...              Anytown                    WA                     NaN     1099     2020
9914     0_18937       0_7621         105  182-82-5017   8061  ...              Anytown                    WA                   00000     1099     2020
9915     0_18939       0_7621           9  283-97-5940   4961  ...              Anytown                    WA                   00000       W2     2020

[9916 rows x 24 columns]

The simulated people in these datasets are called “simulants.” Both datasets have a simulant_id column that uniquely identifies an individual. The unique simulant_id present in both datasets provides us with a truth deck, which we wouldn’t have for a linkage task with real, sensitive data.

Note that in the small-scale simulated population that is available by default, these addresses all have their city/state/zip code set to the fictitious location of Anytown, WA 00000. To read more about obtaining large-scale data with more realistic city, state, and zip code data, please see Simulated populations.

>>> # To find how many matches there are between records about a given simulant,
>>> # we need to multiply the number of records about that simulant in the census by
>>> # the number of records about that simulant in taxes
>>> true_matches = census.groupby("simulant_id").size().multiply(
...    taxes.groupby("simulant_id").size(), fill_value=0
... ).sum().astype(int)
>>> print(f"There are {true_matches:,} true matches to find between these datasets!")
There are 9,034 true matches to find between these datasets!

Now, see how many your record linkage method can find – without access to the truth deck, of course!

Not linking in Python? Just save your datasets as files, for example CSV files:

>>> census.to_csv('census.csv')
>>> taxes.to_csv('taxes.csv')

Now you can load these datasets in any environment that can read CSV.

What’s next?

Now that you’ve generated a simulated dataset with pseudopeople, here are some next steps: