.. _tutorial_configuring_noise:

=================
Configuring Noise
=================

In this tutorial, we will walk through an example of how to customize the amount of noise in
a simulated dataset generated by pseudopeople.

If you haven't already used pseudopeople to generate a dataset with the default settings,
follow the :ref:`quickstart` before continuing with this tutorial.

The problem of fake names
-------------------------

Sometimes when people respond to a survey, they don't want to share their personal information.
If the survey (whether online, on paper, or in person) requires a response, they might just make
something up.

pseudopeople has a noise type to simulate these sorts of responses for first and/or last names:
:ref:`"Use a fake name." <use_a_fake_name>`
With pseudopeople's default settings, this happens just 1% of the time.
But let's say we're concerned it will happen more often in the future, and we want to see how
robust our entity resolution methods are to this issue.

Generating a simulated Current Population Survey
------------------------------------------------

Let's generate some 2025 Current Population Survey (CPS) data in a future scenario where 30% of people use fake last names on the survey.
To do this, we will pass some **configuration** to the :func:`pseudopeople.generate_current_population_survey` function.

.. code-block:: python

    cps_2025 = psp.generate_current_population_survey(year=2025, config={
        'current_population_survey': {
            'column_noise': {
                'last_name': {
                    'use_fake_name': {
                        'cell_probability': 0.3,
                    },
                },
            },
        },
    })

In English, our configuration says: in the **CPS** dataset, for the **last name** column, the **fake name** noise type's **cell probability**
parameter should be 0.3 (30%).
The full set of parameters for each noise type is documented at :ref:`Noise Type Details <noise_type_details>`.
The :code:`column_noise` key specifies that we are configuring column-based noise; the categories of noise are explained in more detail
on the :ref:`noise page <noise_main>`.

Let's take a look at the names in our generated CPS dataset:

.. code-block:: pycon

    >>> cps_2025[['first_name', 'middle_initial', 'last_name']]
       first_name middle_initial          last_name
    0     Gregory              J               Four
    1   Bridgette              J            Kennedy
    2     William              G           Phillips
    3     Valerie              M               Male
    4       Molly              A            Wheeler
    5      Thomas              K             Eastep
    6     Kenneth              C             Harper
    7      Daniel              M             Harper
    8       Susan              M              Adult
    9     Dorothy              P             Gaytan
    10      Daisy              R           Williams
    11       Mark              T               Rock
    12    Mohamed              C             Person
    13    Giselle              L              Weber
    14       Jean              F              Stull
    15       Lila              L                  C
    16      Carli              G            Mckamey
    17     Justin              B                  E
    18        Ana              S  Davidson Granados
    19       Rose              K           Carrillo
    20     Nayeli              A           Carrillo
    21     Robert              O           Carrillo
    22     Emilio              P           Carrillo
    23      Mindy              K             Walton
    24        Lee              J          Household
    25  Janautica              K            Clapper
    26     Tanner              M            Clapper
    27      Helen              K        Of The Home
    28      Tonya              A                  Y
    29      Chris              C            Frazier
    30     Arnold              J             Friend
    31   Patricia              C               Wife
    32     Jessie              L             Madden
    33      Laura              V        Fortenberry
    34       Zaid              B        Fortenberry
    35      Sammy              R          Mcfarlan

As expected, we see a number of strings in the last name column that are unlikely to be true last names.
We can check exactly which ones are fake names by comparing to the same dataset without fake name noise in
the last name column.
For brevity, we do not show the steps to do this here, but
we would find that there are eleven such strings, which is almost exactly 30% of our 36 respondents.

..
    Code to do this:
    cps_2025_nonoise = psp.generate_current_population_survey(year=2025, config=psp.NO_NOISE)
    difference = cps_2025.set_index('simulant_id').pipe(lambda s: s.last_name != cps_2025_nonoise.set_index('simulant_id').loc[s.index].last_name)
    print(f'there are {difference.sum()} such strings, {difference.mean():.5%} of our {len(cps_2025)} respondents')

Increasing noise in first names
-------------------------------

Imagine we also want to increase the probability of a fake first name from its default of 1%.
We can do this by modifying the configuration dictionary. This time, we'll save the configuration dictionary to a variable before using it to generate the dataset:

.. code-block:: python

    config = {
        'current_population_survey': {
            'column_noise': {
                'last_name': {
                    'use_fake_name': {
                        'cell_probability': 0.3,
                    },
                },
                'first_name': {
                    'use_fake_name': {
                        'cell_probability': 0.2,
                    },
                },
            },
        },
    }
    cps_2025 = psp.generate_current_population_survey(year=2025, config=config)

By specifying multiple keys within :code:`column_noise`, we are able to independently adjust noise settings
for different columns.
Here we have set the probability of a fake first name to 0.2 (20%) while retaining the 0.3 (30%) probability
of a fake last name.
Let's see how our CPS data look now:

.. code-block:: pycon

    >>> cps_2025[['first_name', 'middle_initial', 'last_name']]
           first_name middle_initial          last_name
    0         Gregory              J               Four
    1       Bridgette              J            Kennedy
    2         William              G           Phillips
    3         Valerie              M               Male
    4           Molly              A            Wheeler
    5          Thomas              K             Eastep
    6      Man In The              C             Harper
    7             Man              M             Harper
    8               R              M              Adult
    9   Granddaughter              P             Gaytan
    10              G              R           Williams
    11           Mark              T               Rock
    12           Girl              C             Person
    13        Giselle              L              Weber
    14         Friend              F              Stull
    15           Lila              L                  C
    16          Carli              G            Mckamey
    17         Justin              B                  E
    18            Ana              S  Davidson Granados
    19           Rose              K           Carrillo
    20         Son Of              A           Carrillo
    21         Sister              O           Carrillo
    22         Emilio              P           Carrillo
    23          Mindy              K             Walton
    24            Lee              J          Household
    25      Janautica              K            Clapper
    26         Tanner              M            Clapper
    27          Helen              K        Of The Home
    28          Tonya              A                  Y
    29          Chris              C            Frazier
    30         Arnold              J             Friend
    31        Brother              C               Wife
    32          House              L             Madden
    33              T              V        Fortenberry
    34           Zaid              B        Fortenberry
    35              H              R          Mcfarlane

Here we see that 13 respondents have used fake first names.
Why aren't there closer to :math:`0.2 * 36 = 7.2` respondents with fake first names?
Remember that the parameter we set was called cell **probability** -- there is randomness
involved in whether or not each cell in the column actually receives noise.

..
    code to check which respondents used fake first names:

    cps_2025_nonoise = psp.generate_current_population_survey(year=2025, config=psp.NO_NOISE)
    difference = cps_2025.set_index('simulant_id').pipe(lambda s: s.first_name != cps_2025_nonoise.set_index('simulant_id').loc[s.index].first_name)
    cps_2025[cps_2025.simulant_id.isin(difference[difference].index)][['first_name', 'middle_initial', 'last_name']]

An alternate format for configuration
-------------------------------------

It is also possible to specify configuration in a YAML file.
The file equivalent to our final configuration above would be:

.. literalinclude:: configuration_example.yaml
    :caption:

If :code:`configuration_example.yaml` is in the current working directory,
it can be used like so:

.. code-block:: python

    cps_2025 = psp.generate_current_population_survey(year=2025, config='configuration_example.yaml')

For more on configuration, see the :ref:`Configuration page <configuration_main>`.