Configuring Noise

In this tutorial, we will walk through an example of how to customize the amount of noise in a simulated dataset generated by pseudopeople.

If you haven’t already used pseudopeople to generate a dataset with the default settings, follow the Quickstart before continuing with this tutorial.

The problem of fake names

Sometimes when people respond to a survey, they don’t want to share their personal information. If the survey (whether online, on paper, or in person) requires a response, they might just make something up.

pseudopeople has a noise type to simulate these sorts of responses for first and/or last names: “Use a fake name.” With pseudopeople’s default settings, this happens just 1% of the time. But let’s say we’re concerned it will happen more often in the future, and we want to see how robust our entity resolution methods are to this issue.

Generating a simulated Current Population Survey

Let’s generate some 2025 Current Population Survey (CPS) data in a future scenario where 30% of people use fake last names on the survey. To do this, we will pass some configuration to the pseudopeople.generate_current_population_survey() function.

cps_2025 = psp.generate_current_population_survey(year=2025, config={
    'current_population_survey': {
        'column_noise': {
            'last_name': {
                'use_fake_name': {
                    'cell_probability': 0.3,
                },
            },
        },
    },
})

In English, our configuration says: in the CPS dataset, for the last name column, the fake name noise type’s cell probability parameter should be 0.3 (30%). The full set of parameters for each noise type is documented at Noise Type Details. The column_noise key specifies that we are configuring column-based noise; the categories of noise are explained in more detail on the noise page.

Let’s take a look at the names in our generated CPS dataset:

>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
   first_name middle_initial          last_name
0     Gregory              J               Four
1   Bridgette              J            Kennedy
2     William              G           Phillips
3     Valerie              M               Male
4       Molly              A            Wheeler
5      Thomas              K             Eastep
6     Kenneth              C             Harper
7      Daniel              M             Harper
8       Susan              M              Adult
9     Dorothy              P             Gaytan
10      Daisy              R           Williams
11       Mark              T               Rock
12    Mohamed              C             Person
13    Giselle              L              Weber
14       Jean              F              Stull
15       Lila              L                  C
16      Carli              G            Mckamey
17     Justin              B                  E
18        Ana              S  Davidson Granados
19       Rose              K           Carrillo
20     Nayeli              A           Carrillo
21     Robert              O           Carrillo
22     Emilio              P           Carrillo
23      Mindy              K             Walton
24        Lee              J          Household
25  Janautica              K            Clapper
26     Tanner              M            Clapper
27      Helen              K        Of The Home
28      Tonya              A                  Y
29      Chris              C            Frazier
30     Arnold              J             Friend
31   Patricia              C               Wife
32     Jessie              L             Madden
33      Laura              V        Fortenberry
34       Zaid              B        Fortenberry
35      Sammy              R          Mcfarlan

As expected, we see a number of strings in the last name column that are unlikely to be true last names. We can check exactly which ones are fake names by comparing to the same dataset without fake name noise in the last name column. For brevity, we do not show the steps to do this here, but we would find that there are eleven such strings, which is almost exactly 30% of our 36 respondents.

Increasing noise in first names

Imagine we also want to increase the probability of a fake first name from its default of 1%. We can do this by modifying the configuration dictionary. This time, we’ll save the configuration dictionary to a variable before using it to generate the dataset:

config = {
    'current_population_survey': {
        'column_noise': {
            'last_name': {
                'use_fake_name': {
                    'cell_probability': 0.3,
                },
            },
            'first_name': {
                'use_fake_name': {
                    'cell_probability': 0.2,
                },
            },
        },
    },
}
cps_2025 = psp.generate_current_population_survey(year=2025, config=config)

By specifying multiple keys within column_noise, we are able to independently adjust noise settings for different columns. Here we have set the probability of a fake first name to 0.2 (20%) while retaining the 0.3 (30%) probability of a fake last name. Let’s see how our CPS data look now:

>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
       first_name middle_initial          last_name
0         Gregory              J               Four
1       Bridgette              J            Kennedy
2         William              G           Phillips
3         Valerie              M               Male
4           Molly              A            Wheeler
5          Thomas              K             Eastep
6      Man In The              C             Harper
7             Man              M             Harper
8               R              M              Adult
9   Granddaughter              P             Gaytan
10              G              R           Williams
11           Mark              T               Rock
12           Girl              C             Person
13        Giselle              L              Weber
14         Friend              F              Stull
15           Lila              L                  C
16          Carli              G            Mckamey
17         Justin              B                  E
18            Ana              S  Davidson Granados
19           Rose              K           Carrillo
20         Son Of              A           Carrillo
21         Sister              O           Carrillo
22         Emilio              P           Carrillo
23          Mindy              K             Walton
24            Lee              J          Household
25      Janautica              K            Clapper
26         Tanner              M            Clapper
27          Helen              K        Of The Home
28          Tonya              A                  Y
29          Chris              C            Frazier
30         Arnold              J             Friend
31        Brother              C               Wife
32          House              L             Madden
33              T              V        Fortenberry
34           Zaid              B        Fortenberry
35              H              R          Mcfarlane

Here we see that 13 respondents have used fake first names. Why aren’t there closer to \(0.2 * 36 = 7.2\) respondents with fake first names? Remember that the parameter we set was called cell probability – there is randomness involved in whether or not each cell in the column actually receives noise.

An alternate format for configuration

It is also possible to specify configuration in a YAML file. The file equivalent to our final configuration above would be:

configuration_example.yaml
current_population_survey:
  column_noise:
    last_name:
      use_fake_name:
        cell_probability: 0.3
    first_name:
      use_fake_name:
        cell_probability: 0.2

If configuration_example.yaml is in the current working directory, it can be used like so:

cps_2025 = psp.generate_current_population_survey(year=2025, config='configuration_example.yaml')

For more on configuration, see the Configuration page.