pseudopeople is a Python package that generates realistic simulated data about a fictional United States population, designed for use in testing entity resolution (record linkage) methods or other data science algorithms at scale.
The fictional US population was generated by stochastically simulating multiple decades of population dynamics such as fertility, mortality, migration and employment. pseudopeople builds on this fictional population data by simulating errors in the data collection process to create realistic, noisy datasets.
pseudopeople is currently in a public beta release. Things are still in flux! If you notice any issues, please let us know on GitHub.
Introduction
The Simulation Science Team of the University of Washington’s Institute for Health Metrics and Evaluation is excited to introduce pseudopeople, the Python package that simplifies entity resolution research and development. This package generates large-scale, simulated population data according to specifications by the user, to replicate a range of complexities found in real applications of probabilistic record linkage software. With sensitive data often required for entity resolution, accessing and testing new methods and software has been a challenge — until now. Our innovative approach creates realistic, simulated data including name, address, and date of birth, without compromising privacy.
Our work builds on the success of previous data synthesis projects, such as FEBRL, GeCo, and SOG, and generates a simulated population using real, publicly accessible data about the US population by leveraging the power of our simulation platform Vivarium.
Want to know more about pseudopeople? Please visit the pseudopeople project website, where you can find out more about the principles and processes underlying this work.
Quickstart
pseudopeople requires a version of Python between 3.8 and 3.11 (inclusive) to be installed. Once Python is installed, you can install pseudopeople with pip by running the command:
$ pip install pseudopeople
Or, you can install from source on the pseudopeople GitHub repository.
Then, generate a small-scale simulated decennial census:
$ python
>>> import pseudopeople as psp
>>> census = psp.generate_decennial_census()
>>> census
simulant_id household_id first_name middle_initial last_name age date_of_birth ... state zipcode housing_type relationship_to_reference_person sex race_ethnicity year
0 0_2 0_7 Diana P Kelly 25 05/06/1994 ... WA 00000 Household Reference person Female NaN 2020
1 0_3 0_7 Anna A Kelly 25 09/29/1994 ... WA 00000 Household Other relative Female White 2020
2 0_923 0_8033 Gerald R Allen 76 11/03/1943 ... WA 00000 Household Reference person Male Black 2020
3 0_2641 0_1066 Loretta T Lowe 61 06/01/1958 ... WA 00000 Household Reference person Female White 2020
4 0_2801 0_1138 Richard R Pinard 73 03/03/1947 ... WA 00000 Household Reference person Male White 2020
... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ...
10215 0_18969 0_7630 Patty E Palmisano 87 01/11/1933 ... WA 00000 Household Opposite-sex spouse Female White 2020
10216 0_19008 0_8361 John V Skeeter 58 12/29/1961 ... WA 00000 Household Reference person Male Black 2020
10217 0_20165 0_7999 Kimberly K Suitt 65 04/05/1955 ... WA 00000 Household Reference person Female White 2020
10218 0_19020 0_8130 Virginia G Hoover 93 10/02/1926 ... WA 00000 Household Reference person Female White 2020
10219 0_20163 0_7998 Victoria R Simmons 27 04/21/1992 ... WA 00000 Household Reference person Female White 2020
[10220 rows x 18 columns]
And W-2 and 1099 tax forms from the same simulated population:
>>> taxes = psp.generate_taxes_w2_and_1099()
>>> taxes
simulant_id household_id employer_id ssn wages ... mailing_address_city mailing_address_state mailing_address_zipcode tax_form tax_year
0 0_4 0_8 95 584-16-0130 10192 ... Anytown WA 00000 W2 2020
1 0_5 0_8 29 854-13-6295 28355 ... Anytown WA 00000 W2 2020
2 0_5 0_8 30 854-13-6295 18243 ... Anytown WA 00000 W2 2020
3 0_5621 0_2289 46 674-27-1745 7704 ... Anytown WA 00000 W2 2020
4 0_5623 0_2289 83 794-23-1522 3490 ... Anytown WA 00000 W2 2020
... ... ... ... ... ... ... ... ... ... ... ...
9911 0_18936 0_7621 23 006-92-7857 9585 ... Anytown WA 00000 W2 2020
9912 0_18936 0_7621 90 006-92-7857 57906 ... Anytown WA 00000 W2 2020
9913 0_18937 0_7621 1 182-82-5017 19609 ... Anytown WA NaN 1099 2020
9914 0_18937 0_7621 105 182-82-5017 8061 ... Anytown WA 00000 1099 2020
9915 0_18939 0_7621 9 283-97-5940 4961 ... Anytown WA 00000 W2 2020
[9916 rows x 24 columns]
The simulated people in these datasets are called “simulants.”
Both datasets have a simulant_id
column that uniquely identifies an individual.
The unique simulant_id
present in both datasets provides us with a truth deck,
which we wouldn’t have for a linkage task with real, sensitive data.
Note that in the small-scale simulated population that is available by default, these addresses all have their city/state/zip code set to the fictitious location of Anytown, WA 00000. To read more about obtaining large-scale data with more realistic city, state, and zip code data, please see Simulated populations.
>>> # To find how many matches there are between records about a given simulant,
>>> # we need to multiply the number of records about that simulant in the census by
>>> # the number of records about that simulant in taxes
>>> true_matches = census.groupby("simulant_id").size().multiply(
... taxes.groupby("simulant_id").size(), fill_value=0
... ).sum().astype(int)
>>> print(f"There are {true_matches:,} true matches to find between these datasets!")
There are 9,034 true matches to find between these datasets!
Now, see how many your record linkage method can find – without access to the truth deck, of course!
Not linking in Python? Just save your datasets as files, for example CSV files:
>>> census.to_csv('census.csv')
>>> taxes.to_csv('taxes.csv')
Now you can load these datasets in any environment that can read CSV.
What’s next?
Now that you’ve generated a simulated dataset with pseudopeople, here are some next steps:
To get started with customizing the noise in your datasets, try out the tutorial on configuring noise.
To learn more about the kinds of simulated datasets that are available, check out our Datasets page.
If you need larger datasets with millions instead of thousands of rows, take a look at the Simulated populations page.
To dive deeper into noise, read the docs about noise and noise configuration.
To stay informed and recieve updates about this software package join the mailing list here.