Configuring Noise
In this tutorial, we will walk through an example of how to customize the amount of noise in a simulated dataset generated by pseudopeople.
If you haven’t already used pseudopeople to generate a dataset with the default settings, follow the Quickstart before continuing with this tutorial.
The problem of fake names
Sometimes when people respond to a survey, they don’t want to share their personal information. If the survey (whether online, on paper, or in person) requires a response, they might just make something up.
pseudopeople has a noise type to simulate these sorts of responses for first and/or last names: “Use a fake name.” With pseudopeople’s default settings, this happens just 1% of the time. But let’s say we’re concerned it will happen more often in the future, and we want to see how robust our entity resolution methods are to this issue.
Generating a simulated Current Population Survey
Let’s generate some 2025 Current Population Survey (CPS) data in a future scenario where 30% of people use fake last names on the survey.
To do this, we will pass some configuration to the pseudopeople.generate_current_population_survey() function.
cps_2025 = psp.generate_current_population_survey(year=2025, config={
'current_population_survey': {
'column_noise': {
'last_name': {
'use_fake_name': {
'cell_probability': 0.3,
},
},
},
},
})
In English, our configuration says: in the CPS dataset, for the last name column, the fake name noise type’s cell probability
parameter should be 0.3 (30%).
The full set of parameters for each noise type is documented at Noise Type Details.
The column_noise key specifies that we are configuring column-based noise; the categories of noise are explained in more detail
on the noise page.
Let’s take a look at the names in our generated CPS dataset:
>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
first_name middle_initial last_name
0 Gregory J Four
1 Bridgette J Kennedy
2 William G Phillips
3 Valerie M Male
4 Molly A Wheeler
5 Thomas K Eastep
6 Kenneth C Harper
7 Daniel M Harper
8 Susan M Adult
9 Dorothy P Gaytan
10 Daisy R Williams
11 Mark T Rock
12 Mohamed C Person
13 Giselle L Weber
14 Jean F Stull
15 Lila L C
16 Carli G Mckamey
17 Justin B E
18 Ana S Davidson Granados
19 Rose K Carrillo
20 Nayeli A Carrillo
21 Robert O Carrillo
22 Emilio P Carrillo
23 Mindy K Walton
24 Lee J Household
25 Janautica K Clapper
26 Tanner M Clapper
27 Helen K Of The Home
28 Tonya A Y
29 Chris C Frazier
30 Arnold J Friend
31 Patricia C Wife
32 Jessie L Madden
33 Laura V Fortenberry
34 Zaid B Fortenberry
35 Sammy R Mcfarlan
As expected, we see a number of strings in the last name column that are unlikely to be true last names. We can check exactly which ones are fake names by comparing to the same dataset without fake name noise in the last name column. For brevity, we do not show the steps to do this here, but we would find that there are eleven such strings, which is almost exactly 30% of our 36 respondents.
Increasing noise in first names
Imagine we also want to increase the probability of a fake first name from its default of 1%. We can do this by modifying the configuration dictionary. This time, we’ll save the configuration dictionary to a variable before using it to generate the dataset:
config = {
'current_population_survey': {
'column_noise': {
'last_name': {
'use_fake_name': {
'cell_probability': 0.3,
},
},
'first_name': {
'use_fake_name': {
'cell_probability': 0.2,
},
},
},
},
}
cps_2025 = psp.generate_current_population_survey(year=2025, config=config)
By specifying multiple keys within column_noise, we are able to independently adjust noise settings
for different columns.
Here we have set the probability of a fake first name to 0.2 (20%) while retaining the 0.3 (30%) probability
of a fake last name.
Let’s see how our CPS data look now:
>>> cps_2025[['first_name', 'middle_initial', 'last_name']]
first_name middle_initial last_name
0 Gregory J Four
1 Bridgette J Kennedy
2 William G Phillips
3 Valerie M Male
4 Molly A Wheeler
5 Thomas K Eastep
6 Man In The C Harper
7 Man M Harper
8 R M Adult
9 Granddaughter P Gaytan
10 G R Williams
11 Mark T Rock
12 Girl C Person
13 Giselle L Weber
14 Friend F Stull
15 Lila L C
16 Carli G Mckamey
17 Justin B E
18 Ana S Davidson Granados
19 Rose K Carrillo
20 Son Of A Carrillo
21 Sister O Carrillo
22 Emilio P Carrillo
23 Mindy K Walton
24 Lee J Household
25 Janautica K Clapper
26 Tanner M Clapper
27 Helen K Of The Home
28 Tonya A Y
29 Chris C Frazier
30 Arnold J Friend
31 Brother C Wife
32 House L Madden
33 T V Fortenberry
34 Zaid B Fortenberry
35 H R Mcfarlane
Here we see that 13 respondents have used fake first names. Why aren’t there closer to \(0.2 * 36 = 7.2\) respondents with fake first names? Remember that the parameter we set was called cell probability – there is randomness involved in whether or not each cell in the column actually receives noise.
An alternate format for configuration
It is also possible to specify configuration in a YAML file. The file equivalent to our final configuration above would be:
current_population_survey:
column_noise:
last_name:
use_fake_name:
cell_probability: 0.3
first_name:
use_fake_name:
cell_probability: 0.2
If configuration_example.yaml is in the current working directory,
it can be used like so:
cps_2025 = psp.generate_current_population_survey(year=2025, config='configuration_example.yaml')
For more on configuration, see the Configuration page.