Dataset Generation Functions

Each of the following functions generates one of the simulated datasets documented on the Datasets page. For example, pseudopeople.generate_decennial_census() generates the Decennial Census dataset.

All of the dataset generation functions have the same (optional) parameters. Notable parameters include:

a source path to the root directory of pseudopeople simulated population data (defaults to using the sample population included with pseudopeople).

a config path to a YAML file, a Python dictionary, or the special value pseudopeople.NO_NOISE, to override the default configuration.

a year (defaults to 2020).

For applied examples of using these functions, see the Quickstart and tutorials.

pseudopeople.generate_american_community_survey(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople ACS dataset which represents simulated responses to the ACS survey.

The American Community Survey (ACS) is an ongoing household survey conducted by the US Census Bureau that gathers information on a rolling basis about American community populations. Information collected includes ancestry, citizenship, education, income, language proficiency, migration, employment, disability, and housing characteristics.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The year for which to generate simulated American Community Surveys of the simulated population (format YYYY, e.g., 2036); the simulated dataset will contain records for surveys conducted on any date in the specified year. Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate simulated American Community Surveys of the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain survey data for simulants living in the specified state during the specified year. Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated ACS data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame

pseudopeople.generate_current_population_survey(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople CPS dataset which represents simulated responses to the CPS survey.

The Current Population Survey (CPS) is a household survey conducted by the US Census Bureau and the US Bureau of Labor Statistics. This survey is administered by Census Bureau field representatives across the country through both personal and telephone interviews. CPS collects labor force data, such as annual work activity and income, veteran status, school enrollment, contingent employment, worker displacement, job tenure, and more.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The year for which to generate simulated Current Population Surveys of the simulated population (format YYYY, e.g., 2036); the simulated dataset will contain records for surveys conducted on any date in the specified year. Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate simulated Current Population Surveys of the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain survey data for simulants living in the specified state during the specified year. Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated CPS data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame

pseudopeople.generate_decennial_census(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople decennial census dataset which represents simulated responses to the US Census Bureau’s Census of Population and Housing.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The year for which to generate a simulated decennial census of the simulated population (format YYYY, e.g., 2030). Must be a decennial year (e.g., 2020, 2030, 2040). Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate a simulated census of the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain data for simulants living in the specified state on Census Day (April 1) of the specified year. Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated decennial census data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame

pseudopeople.generate_social_security(source=None, seed=0, config=None, year=2020, verbose=False, engine='pandas')[source]

Generates a pseudopeople SSA dataset which represents simulated Social Security Administration (SSA) data.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The final year of simulated social security records to include in the dataset (format YYYY, e.g., 2036); will also include records from all previous years. Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated SSA data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or any prior years.

Return type:

DataFrame

pseudopeople.generate_taxes_1040(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople 1040 tax dataset which represents simulated tax form data.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The tax year for which to generate records (format YYYY, e.g., 2036); the simulated dataset will contain the 1040 tax forms filed by simulants for the specified year. Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate tax records from the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain 1040 tax forms filed by simulants living in the specified state during the specified tax year. Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated 1040 tax data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame

pseudopeople.generate_taxes_w2_and_1099(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople W2 and 1099 tax dataset which represents simulated tax form data.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The tax year for which to generate records (format YYYY, e.g., 2036); the simulated dataset will contain the W2 & 1099 tax forms filed by simulated employers for the specified year. Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate tax records from the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain W2 & 1099 tax forms filed for simulants living in the specified state during the specified tax year. Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated W2 and 1099 tax data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame

pseudopeople.generate_women_infants_and_children(source=None, seed=0, config=None, year=2020, state=None, verbose=False, engine='pandas')[source]

Generates a pseudopeople WIC dataset which represents a simulated version of the administrative data that would be recorded by WIC. This is a yearly file of information about all simulants enrolled in the program as of the end of that year.

The Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) is a government benefits program designed to support mothers and young children. The main qualifications are income and the presence of young children in the home.

Parameters:

source (Path | str) – The root directory containing pseudopeople simulated population data. Defaults to using the included sample population when source is None.
seed (int) – An integer seed for randomness. Defaults to 0.
config (Path | str | Dict[str, Dict]) – An optional override to the default configuration. Can be a path to a configuration YAML file, a configuration dictionary, or the sentinel value pseudopeople.NO_NOISE, which will generate a dataset without any configurable noise.
year (int | None) – The year for which to generate WIC administrative records (format YYYY, e.g., 2036); the simulated dataset will contain records for simulants enrolled in WIC at the end of the specified year (or on May 1, 2041 if year=2041 since that is the end date of the simulation). Default is 2020. If None is passed instead, data for all available years are included in the returned dataset.
state (str | None) – The US state for which to generate WIC administrative records from the simulated population, or None (default) to generate data for all available US states. The returned dataset will contain records for enrolled simulants living in the specified state at the end of the specified year (or on May 1, 2041 if year=2041 since that is the end date of the simulation). Can be a full state name or a state abbreviation (e.g., “Ohio” or “OH”).
verbose (bool) – Log with verbosity if True. Default is False.
engine (Literal['pandas', 'dask']) – Engine to use for loading data. Determines the return type. Default is “pandas” which returns a pandas DataFrame. “dask” returns a Dask DataFrame and requires Dask to be installed (e.g. pip install pseudopeople[dask]). It runs the dataset generation on a Dask cluster, which can parallelize and run out-of-core.

Returns:

A DataFrame of simulated WIC data.

Raises:

ConfigurationError – An invalid config is provided.
DataSourceError – An invalid pseudopeople simulated population data source is provided.
ValueError – The simulated population has no data for this dataset in the specified year or state.

Return type:

DataFrame