.. _tutorial_configuring_noise: ================= Configuring Noise ================= In this tutorial, we will walk through an example of how to customize the amount of noise in a simulated dataset generated by pseudopeople. If you haven't already used pseudopeople to generate a dataset with the default settings, follow the :ref:`quickstart` before continuing with this tutorial. The problem of fake names ------------------------- Sometimes when people respond to a survey, they don't want to share their personal information. If the survey (whether online, on paper, or in person) requires a response, they might just make something up. pseudopeople has a noise type to simulate these sorts of responses for first and/or last names: :ref:`"Use a fake name." ` With pseudopeople's default settings, this happens just 1% of the time. But let's say we're concerned it will happen more often in the future, and we want to see how robust our entity resolution methods are to this issue. Generating a simulated Current Population Survey ------------------------------------------------ Let's generate some 2025 Current Population Survey (CPS) data in a future scenario where 30% of people use fake last names on the survey. To do this, we will pass some **configuration** to the :func:`pseudopeople.generate_current_population_survey` function. .. code-block:: python cps_2025 = psp.generate_current_population_survey(year=2025, config={ 'current_population_survey': { 'column_noise': { 'last_name': { 'use_fake_name': { 'cell_probability': 0.3, }, }, }, }, }) In English, our configuration says: in the **CPS** dataset, for the **last name** column, the **fake name** noise type's **cell probability** parameter should be 0.3 (30%). The full set of parameters for each noise type is documented at :ref:`Noise Type Details `. The :code:`column_noise` key specifies that we are configuring column-based noise; the categories of noise are explained in more detail on the :ref:`noise page `. Let's take a look at the names in our generated CPS dataset: .. code-block:: pycon >>> cps_2025[['first_name', 'middle_initial', 'last_name']] first_name middle_initial last_name 0 Gregory J Four 1 Bridgette J Kennedy 2 William G Phillips 3 Valerie M Male 4 Molly A Wheeler 5 Thomas K Eastep 6 Kenneth C Harper 7 Daniel M Harper 8 Susan M Adult 9 Dorothy P Gaytan 10 Daisy R Williams 11 Mark T Rock 12 Mohamed C Person 13 Giselle L Weber 14 Jean F Stull 15 Lila L C 16 Carli G Mckamey 17 Justin B E 18 Ana S Davidson Granados 19 Rose K Carrillo 20 Nayeli A Carrillo 21 Robert O Carrillo 22 Emilio P Carrillo 23 Mindy K Walton 24 Lee J Household 25 Janautica K Clapper 26 Tanner M Clapper 27 Helen K Of The Home 28 Tonya A Y 29 Chris C Frazier 30 Arnold J Friend 31 Patricia C Wife 32 Jessie L Madden 33 Laura V Fortenberry 34 Zaid B Fortenberry 35 Sammy R Mcfarlan As expected, we see a number of strings in the last name column that are unlikely to be true last names. We can check exactly which ones are fake names by comparing to the same dataset without fake name noise in the last name column. For brevity, we do not show the steps to do this here, but we would find that there are eleven such strings, which is almost exactly 30% of our 36 respondents. .. Code to do this: cps_2025_nonoise = psp.generate_current_population_survey(year=2025, config=psp.NO_NOISE) difference = cps_2025.set_index('simulant_id').pipe(lambda s: s.last_name != cps_2025_nonoise.set_index('simulant_id').loc[s.index].last_name) print(f'there are {difference.sum()} such strings, {difference.mean():.5%} of our {len(cps_2025)} respondents') Increasing noise in first names ------------------------------- Imagine we also want to increase the probability of a fake first name from its default of 1%. We can do this by modifying the configuration dictionary. This time, we'll save the configuration dictionary to a variable before using it to generate the dataset: .. code-block:: python config = { 'current_population_survey': { 'column_noise': { 'last_name': { 'use_fake_name': { 'cell_probability': 0.3, }, }, 'first_name': { 'use_fake_name': { 'cell_probability': 0.2, }, }, }, }, } cps_2025 = psp.generate_current_population_survey(year=2025, config=config) By specifying multiple keys within :code:`column_noise`, we are able to independently adjust noise settings for different columns. Here we have set the probability of a fake first name to 0.2 (20%) while retaining the 0.3 (30%) probability of a fake last name. Let's see how our CPS data look now: .. code-block:: pycon >>> cps_2025[['first_name', 'middle_initial', 'last_name']] first_name middle_initial last_name 0 Gregory J Four 1 Bridgette J Kennedy 2 William G Phillips 3 Valerie M Male 4 Molly A Wheeler 5 Thomas K Eastep 6 Man In The C Harper 7 Man M Harper 8 R M Adult 9 Granddaughter P Gaytan 10 G R Williams 11 Mark T Rock 12 Girl C Person 13 Giselle L Weber 14 Friend F Stull 15 Lila L C 16 Carli G Mckamey 17 Justin B E 18 Ana S Davidson Granados 19 Rose K Carrillo 20 Son Of A Carrillo 21 Sister O Carrillo 22 Emilio P Carrillo 23 Mindy K Walton 24 Lee J Household 25 Janautica K Clapper 26 Tanner M Clapper 27 Helen K Of The Home 28 Tonya A Y 29 Chris C Frazier 30 Arnold J Friend 31 Brother C Wife 32 House L Madden 33 T V Fortenberry 34 Zaid B Fortenberry 35 H R Mcfarlane Here we see that 13 respondents have used fake first names. Why aren't there closer to :math:`0.2 * 36 = 7.2` respondents with fake first names? Remember that the parameter we set was called cell **probability** -- there is randomness involved in whether or not each cell in the column actually receives noise. .. code to check which respondents used fake first names: cps_2025_nonoise = psp.generate_current_population_survey(year=2025, config=psp.NO_NOISE) difference = cps_2025.set_index('simulant_id').pipe(lambda s: s.first_name != cps_2025_nonoise.set_index('simulant_id').loc[s.index].first_name) cps_2025[cps_2025.simulant_id.isin(difference[difference].index)][['first_name', 'middle_initial', 'last_name']] An alternate format for configuration ------------------------------------- It is also possible to specify configuration in a YAML file. The file equivalent to our final configuration above would be: .. literalinclude:: configuration_example.yaml :caption: If :code:`configuration_example.yaml` is in the current working directory, it can be used like so: .. code-block:: python cps_2025 = psp.generate_current_population_survey(year=2025, config='configuration_example.yaml') For more on configuration, see the :ref:`Configuration page `.