Datasets

Here we cover the realistic simulated datasets, which are analogous to “real world” administrative records such as tax documents and routinely generated files of social security numbers, that users can generate using Pseudopeople for developing and testing Entity Resolution algorithms and software.

Each of the datasets that can be generated using pseudopeople has “noise” added to it, thereby realistically simulating how population data can be corrupted or distorted, which creates challenges in linking those records. To read more about the different kinds of noise that can be applied to the different datasets, please see the Noise page.

pseudopeople generates datasets about a single simulated US population, which is followed through time between January 1st, 2019 and May 1st, 2041. Most datasets are yearly and can be generated for any year between 2019 and 2041 (inclusive), though 2041 data will be partial.

There are two kinds of street addresses present in pseudopeople datasets: physical addresses and mailing addresses. A physical address represents the physical location where a simulant lives, which is where they are recorded in the Decennial Census and surveys. A mailing address represents the address a simulant uses to receive mail, which may be different – for example, a PO box. Mailing addresses, not physical addresses, are recorded in tax filings.

Note that in the small-scale simulated population that is available by default, these addresses all have their city/state/zip code set to the fictitious location of Anytown, WA 00000. This is to ensure that linking is not unrealistically easy with the sample population (i.e., using these attributes to eliminate clear non-matches is not possible, as they are all identical). To read more about obtaining large-scale data with more realistic city, state, and zip code data, please see Simulated populations.

Some fields are not applicable to every record in a simulated dataset, so some columns may contain “missing” values, even if no noise has been added to the data. For example, most addresses do not have a unit number, and some do not have a street number, so the unit_number and/or street_number fields will be “missing” for many rows in any dataset that contains addresses. Similarly, columns pertaining to spouse or dependents in the 1040 tax dataset are not applicable to every simulant, so these columns also contain missing values. Values that are missing because they are not applicable are represented by numpy.nan.

The datasets that can be generated are listed below.

US Decennial Census

The Decennial Census dataset is a simulated enumeration of the US Census Bureau’s Decennial Census of Population and Housing. To find out more about the Decennial Census, please visit the Decennial Census homepage.

It is only possible to generate Decennial Census data for decennial years – 2020, 2030, and 2040.

Generate Decennial Census data with pseudopeople.generate_decennial_census().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

First name

first_name

Middle initial

middle_initial

Last name

last_name

Age

age

Rounded down to an integer.

Date of birth

date_of_birth

Formatted as MM/DD/YYYY.

Physical address street number

street_number

Physical address street name

street_name

Physical address unit number

unit_number

Physical address city

city

Default simulated population always has value “Anytown”

Physical address state

state

Default simulated population always has value “WA”

Physical address ZIP code

zipcode

Default simulated population always has value “00000”

Housing type

housing_type

Possible values for housing type are “Household” for an individual household, or one of six different types of group quarters. The types of institutional group quarters are “Carceral”, “Nursing home”, and “Other institutional”. The types of noninstitutional group quarters are “College”, “Military”, and “Other noninstitutional”.

Relationship to reference person

relationship_to_reference_person

Possible values for this field include: “Reference person”; “Opposite-sex spouse”; “Opposite-sex unmarried partner”; “Same-sex spouse”; “Same-sex unmarried partner”; “Biological child”; “Adopted child”; “Stepchild”; “Sibling”; “Parent”; “Grandchild”; “Parent-in-law”; “Child-in-law”; “Other relative”; “Roommate or housemate”; “Foster child”; “Other nonrelative”; “Institutionalized group quarters population”; and “Noninstitutionalized group quarters population”.

Sex

sex

Binary; “male” or “female”.

Race/ethnicity

race_ethnicity

The categories for the single composite “race/ethnicity” field are as follows: “White”; “Black”; “Latino”; “American Indian and Alaskan Native (AIAN)”; “Asian”; “Native Hawaiian and Other Pacific Islander (NHOPI)”; and “Multiracial or Some Other Race”.

Year

year

Year in which data were collected; metadata that would not be collected directly; not affected by noise.

American Community Survey (ACS)

ACS is one of two household surveys that can currently be simulated using Pseudopeople. ACS is an ongoing household survey conducted by the US Census Bureau that gathers information on a rolling basis about American community populations. Information collected includes ancestry, citizenship, education, income, language proficienccy, migration, employment, disability, and housing characteristics. To find out more about ACS, please visit the ACS homepage.

pseudopeople can generate ACS data for a user-specified year, which will include records from simulated surveys conducted throughout that calendar year.

Generate ACS data with pseudopeople.generate_american_community_survey().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Survey date

survey_date

Date on which the survey was conducted; metadata that would not be collected directly; not affected by noise. Stored as a pandas.Timestamp, which displays in YYYY-MM-DD format by default.

First name

first_name

Middle initial

middle_initial

Last name

last_name

Age

age

Rounded down to an integer.

Date of birth

date_of_birth

Formatted as MM/DD/YYYY.

Physical address street number

street_number

Physical address street name

street_name

Physical address unit number

unit_number

Physical address city

city

Default simulated population always has value “Anytown”

Physical address state

state

Default simulated population always has value “WA”

Physical address ZIP code

zipcode

Default simulated population always has value “00000”

Housing type

housing_type

Possible values for housing type are “Household” for an individual household, or one of six different types of group quarters. The types of institutional group quarters are “Carceral”, “Nursing home”, and “Other institutional”. The types of noninstitutional group quarters are “College”, “Military”, and “Other noninstitutional”.

Relationship to reference person

relationship_to_reference_person

Possible values for this field include: “Reference person”; “Opposite-sex spouse”; “Opposite-sex unmarried partner”; “Same-sex spouse”; “Same-sex unmarried partner”; “Biological child”; “Adopted child”; “Stepchild”; “Sibling”; “Parent”; “Grandchild”; “Parent-in-law”; “Child-in-law”; “Other relative”; “Roommate or housemate”; “Foster child”; “Other nonrelative”; “Institutionalized group quarters population”; and “Noninstitutionalized group quarters population”.

Sex

sex

Binary; “male” or “female”

Race/ethnicity

race_ethnicity

The categories for the single composite “race/ethnicity” field are as follows: “White”; “Black”; “Latino”; “American Indian and Alaskan Native (AIAN)”; “Asian”; “Native Hawaiian and Other Pacific Islander (NHOPI)”; and “Multiracial or Some Other Race”.

Current Population Survey (CPS)

CPS is another household survey that can be simulated using Pseudopeople. CPS is conducted jointly by the US Census Bureau and the US Bureau of Labor Statistics. CPS collects labor force data, such as annual work activity and income, veteran status, school enrollment, contingent employment, worker displacement, job tenure, and more. To find out more about CPS, please visit the CPS homepage.

pseudopeople can generate CPS data for a user-specified year, which will include records from simulated surveys conducted throughout that calendar year.

Generate CPS data with pseudopeople.generate_current_population_survey().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Survey date

survey_date

Date on which the survey was conducted; metadata that would not be collected directly; not affected by noise. Stored as a pandas.Timestamp, which displays in YYYY-MM-DD format by default.

First name

first_name

Middle initial

middle_initial

Last name

last_name

Age

age

Rounded down to an integer.

Date of birth

date_of_birth

Formatted as MM/DD/YYYY.

Physical address street number

street_number

Physical address street name

street_name

Physical address unit number

unit_number

Physical address city

city

Default simulated population always has value “Anytown”

Physical address state

state

Default simulated population always has value “WA”

Physical address ZIP code

zipcode

Default simulated population always has value “00000”

Sex

sex

Binary; “male” or “female”

Race/ethnicity

race_ethnicity

The categories for the single composite “race/ethnicity” field are as follows: “White”; “Black”; “Latino”; “American Indian and Alaskan Native (AIAN)”; “Asian”; “Native Hawaiian and Other Pacific Islander (NHOPI)”; and “Multiracial or Some Other Race”.

Women, Infants, and Children (WIC)

The Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) is a government benefits program designed to support mothers and young children. The main qualifications are income and the presence of young children in the home. To find out more about this service, please visit the WIC homepage.

pseudopeople can generate a simulated version of the administrative data that would be recorded by WIC. This is a yearly file of information about all simulants enrolled in the program as of the end of that year. For the final year available, 2041, the file includes those enrolled as of May 1st, because this is the end of our simulated timespan.

Generate WIC data with pseudopeople.generate_women_infants_and_children().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

First name

first_name

Middle initial

middle_initial

Last name

last_name

Date of birth

date_of_birth

Formatted as MMDDYYYY.

Physical address street number

street_number

Physical address street name

street_name

Physical address unit number

unit_number

Physical address city

city

Default simulated population always has value “Anytown”

Physical address state

state

Default simulated population always has value “WA”

Physical address ZIP code

zipcode

Default simulated population always has value “00000”

Sex

sex

Binary; “male” or “female”

Race/ethnicity

race_ethnicity

The categories for the single composite “race/ethnicity” field are as follows: “White”; “Black”; “Latino”; “American Indian and Alaskan Native (AIAN)”; “Asian”; “Native Hawaiian and Other Pacific Islander (NHOPI)”; and “Multiracial or Some Other Race”.

Year

year

Year in which benefits were received; metadata that would not be collected directly; not affected by noise.

Social Security Administration

The Social Security Administration (SSA) is the US federal government agency that administers Social Security, the social insurance program that consists of retirement, disability and survivor benefits. To find out more about this program, visit the SSA homepage.

pseudopeople can generate a simulated version of a subset of the administrative data that would be recorded by SSA. Currently, the simulated SSA data includes records of SSN creation and dates of death. This is a yearly data file that is cumulative – when you specify a year, you will recieve all records up to the end of that year.

The simulated SSA data files will not include records about simulants who died before 2019 (the start of our simulated timespan). Therefore, while SSA data files can be generated for years prior to 2019, they will only include records for SSN creation, and only for simulants who were still alive in 2019.

Generate SSA data with pseudopeople.generate_social_security().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Social security number

ssn

By default, the SSN column in the SSA dataset has no column-based noise. However, it can be configured to have noise if desired.

First name

first_name

Middle name

middle_name

Last name

last_name

Date of birth

date_of_birth

Formatted as YYYYMMDD.

Sex

sex

Binary; “male” or “female”

Type of event

event_type

Possible values are “Creation” and “Death”.

Date of event

event_date

Formatted as YYYYMMDD.

Tax forms: W-2 & 1099

Administrative data reported in annual tax forms, such as W-2s and 1099s, can also be simulated by Pseudopeople. 1099 forms are used for independent contractors or self-employed individuals, while a W-2 form is submitted by an employer for their employee (as the employer withholds payroll taxes from employee earnings).

pseudopeople can generate a simulated version of the data collected by W-2 and 1099 forms. This is a yearly dataset, where the user-specified year is the tax year of the data. That is, the data for 2022 will be the result of tax forms filed in early 2023. Tax data can be generated for tax years 2019 through 2040 (inclusive).

Generate W-2 and 1099 data with pseudopeople.generate_taxes_w2_and_1099().

The following columns are included in these datasets:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Employer ID

employer_id

Social security number

ssn

Wages

wages

Employer Name

employer_name

Employer street number

employer_street_number

Employer street name

employer_street_name

Employer unit number

employer_unit_number

Employer city

employer_city

Default simulated population always has value “Anytown”

Employer state

employer_state

Default simulated population always has value “WA”

Employer ZIP code

employer_zipcode

Default simulated population always has value “00000”

First name

first_name

Middle initial

middle_initial

Last name

last_name

Mailing address street number

mailing_address_street_number

Mailing address street name

mailing_address_street_name

Mailing address unit number

mailing_address_unit_number

Mailing address PO Box

mailing_address_po_box

Mailing address city

mailing_address_city

Default simulated population always has value “Anytown”

Mailing address state

mailing_address_state

Default simulated population always has value “WA”

Mailing address ZIP code

mailing_address_zipcode

Default simulated population always has value “00000”

Type of tax form

tax_form

Possible values are “W2” or “1099”.

Tax year

tax_year

Year for which tax data were collected; metadata that would not be collected directly; not affected by noise.

Tax form: 1040

As with data collected from W-2 and 1099 forms, pseudopeople enables the simulation of administrative records from 1040 forms, which are also reported to the IRS on an annual basis. While W-2 forms are submitted by an employer to the IRS, 1040 forms are submitted by the employee. To find out more about the 1040 tax form, visit the IRS information page.

A single row in a pseudopeople-generated 1040 dataset may contain information about several simulants: the primary filer, the primary filer’s joint filer (spouse) if they are married filing jointly, and up to four claimed dependents. When not applicable, all relevant fields are numpy.nan; for example, a row representing a 1040 filed by only one simulant, without a joint filer, would have missingness in all joint filer columns.

If a simulant claims fewer than four dependents, they will be filled in starting with dependent_1. For example, a simulant claiming three dependents would have missingness in all dependent_4 columns. A simulant may claim more than four dependents, but only four will appear in the dataset; the rest are omitted.

All columns not otherwise labeled are about the primary filer; for example, the first_name column is the first name of the primary filer. The simulant_id and household_id columns represent the “ground truth” of which simulant is the primary filer, and which household that primary filer lives in. It is not guaranteed that all simulants described in a 1040 row live in the same household; for example, college students may be claimed as dependents while living elsewhere.

A single simulant can appear in multiple rows in this dataset, for example if they filed a 1040 and were also claimed as a dependent on another simulant’s 1040.

This is a yearly dataset, where the user-specified year is the tax year of the data. 1040 data can be generated for tax years 2019 through 2040 (inclusive).

Generate 1040 data with pseudopeople.generate_taxes_1040().

The following columns are included in this dataset:

Dataset columns

Attribute Name

Column Name

Notes

Unique simulant ID

simulant_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

Unique household ID

household_id

Not affected by noise; intended use is “ground truth” for testing and validation; consistent across all datasets.

First name

first_name

Middle initial

middle_initial

Last name

last_name

Social Security Number (SSN)

ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Mailing address street number

mailing_address_street_number

Mailing address street name

mailing_address_street_name

Mailing address unit number

mailing_address_unit_number

Mailing address PO box

mailing_address_po_box

Mailing address city

mailing_address_city

Default simulated population always has value “Anytown”

Mailing address state

mailing_address_state

Default simulated population always has value “WA”

Mailing address ZIP code

mailing_address_zipcode

Default simulated population always has value “00000”

Joint filer first name

spouse_first_name

Joint filer middle initial

spouse_middle_initial

Joint filer last name

spouse_last_name

Joint filer social security number

spouse_ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Dependent 1 first name

dependent_1_first_name

Dependent 1 last name

dependent_1_last_name

Dependent 1 Social Security Number (SSN)

dependent_1_ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Dependent 2 first name

dependent_2_first_name

Dependent 2 last name

dependent_2_last_name

Dependent 2 social security number

dependent_2_ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Dependent 3 first name

dependent_3_first_name

Dependent 3 last name

dependent_3_last_name

Dependent 3 social security number

dependent_3_ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Dependent 4 first name

dependent_4_first_name

Dependent 4 last name

dependent_4_last_name

Dependent 4 social security number

dependent_4_ssn

Individual Taxpayer Identification Number (ITIN) if no SSN

Tax year

tax_year

Year for which tax data were collected; metadata that would not be collected directly; not affected by noise.