Configuring Noise

In this tutorial, we will walk through an example of how to customize the amount of noise in a simulated dataset generated by pseudopeople.

If you haven’t already used pseudopeople to generate a dataset with the default settings, follow the Quickstart before continuing with this tutorial.

The problem of fake names

Sometimes when people respond to a survey, they don’t want to share their personal information. If the survey (whether online, on paper, or in person) requires a response, they might just make something up.

pseudopeople has a noise type to simulate these sorts of responses for first and/or last names: “Use a fake name.” With pseudopeople’s default settings, this happens just 1% of the time. But let’s say we’re concerned it will happen more often in the future, and we want to see how robust our entity resolution methods are to this issue.

Generating a simulated Current Population Survey

Let’s generate some 2025 Current Population Survey (CPS) data in a future scenario where 30% of people use fake last names on the survey. To do this, we will pass some configuration to the pseudopeople.generate_current_population_survey() function.

cps_2025 = psp.generate_current_population_survey(year=2025, config={
    'current_population_survey': {
        'column_noise': {
            'last_name': {
                'use_fake_name': {
                    'cell_probability': 0.3,
                },
            },
        },
    },
})

In English, our configuration says: in the CPS dataset, for the last name column, the fake name noise type’s cell probability parameter should be 0.3 (30%). The full set of parameters for each noise type is documented at Noise Type Details. The column_noise key specifies that we are configuring column-based noise; the categories of noise are explained in more detail on the noise page.

Let’s take a look at the names in our generated CPS dataset:

>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
   first_name middle_initial          last_name
   Gregory              J               Four
 Bridgette              J            Kennedy
   William              G           Phillips
   Valerie              M               Male
     Molly              A            Wheeler
    Thomas              K             Eastep
   Kenneth              C             Harper
    Daniel              M             Harper
     Susan              M              Adult
   Dorothy              P             Gaytan
    Daisy              R           Williams
     Mark              T               Rock
  Mohamed              C             Person
  Giselle              L              Weber
     Jean              F              Stull
     Lila              L                  C
    Carli              G            Mckamey
   Justin              B                  E
      Ana              S  Davidson Granados
     Rose              K           Carrillo
   Nayeli              A           Carrillo
   Robert              O           Carrillo
   Emilio              P           Carrillo
    Mindy              K             Walton
      Lee              J          Household
Janautica              K            Clapper
   Tanner              M            Clapper
    Helen              K        Of The Home
    Tonya              A                  Y
    Chris              C            Frazier
   Arnold              J             Friend
 Patricia              C               Wife
   Jessie              L             Madden
    Laura              V        Fortenberry
     Zaid              B        Fortenberry
    Sammy              R          Mcfarlan

As expected, we see a number of strings in the last name column that are unlikely to be true last names. We can check exactly which ones are fake names by comparing to the same dataset without fake name noise in the last name column. For brevity, we do not show the steps to do this here, but we would find that there are eleven such strings, which is almost exactly 30% of our 36 respondents.

Increasing noise in first names

Imagine we also want to increase the probability of a fake first name from its default of 1%. We can do this by modifying the configuration dictionary. This time, we’ll save the configuration dictionary to a variable before using it to generate the dataset:

config = {
    'current_population_survey': {
        'column_noise': {
            'last_name': {
                'use_fake_name': {
                    'cell_probability': 0.3,
                },
            },
            'first_name': {
                'use_fake_name': {
                    'cell_probability': 0.2,
                },
            },
        },
    },
}
cps_2025 = psp.generate_current_population_survey(year=2025, config=config)

By specifying multiple keys within column_noise, we are able to independently adjust noise settings for different columns. Here we have set the probability of a fake first name to 0.2 (20%) while retaining the 0.3 (30%) probability of a fake last name. Let’s see how our CPS data look now:

>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
       first_name middle_initial          last_name
       Gregory              J               Four
     Bridgette              J            Kennedy
       William              G           Phillips
       Valerie              M               Male
         Molly              A            Wheeler
        Thomas              K             Eastep
    Man In The              C             Harper
           Man              M             Harper
             R              M              Adult
 Granddaughter              P             Gaytan
            G              R           Williams
         Mark              T               Rock
         Girl              C             Person
      Giselle              L              Weber
       Friend              F              Stull
         Lila              L                  C
        Carli              G            Mckamey
       Justin              B                  E
          Ana              S  Davidson Granados
         Rose              K           Carrillo
       Son Of              A           Carrillo
       Sister              O           Carrillo
       Emilio              P           Carrillo
        Mindy              K             Walton
          Lee              J          Household
    Janautica              K            Clapper
       Tanner              M            Clapper
        Helen              K        Of The Home
        Tonya              A                  Y
        Chris              C            Frazier
       Arnold              J             Friend
      Brother              C               Wife
        House              L             Madden
            T              V        Fortenberry
         Zaid              B        Fortenberry
            H              R          Mcfarlane

Here we see that 13 respondents have used fake first names. Why aren’t there closer to \(0.2 * 36 = 7.2\) respondents with fake first names? Remember that the parameter we set was called cell probability – there is randomness involved in whether or not each cell in the column actually receives noise.

An alternate format for configuration

It is also possible to specify configuration in a YAML file. The file equivalent to our final configuration above would be:

configuration_example.yaml

current_population_survey:
  column_noise:
    last_name:
      use_fake_name:
        cell_probability: 0.3
    first_name:
      use_fake_name:
        cell_probability: 0.2

If configuration_example.yaml is in the current working directory, it can be used like so:

cps_2025 = psp.generate_current_population_survey(year=2025, config='configuration_example.yaml')

For more on configuration, see the Configuration page.