Glossary
- Configuration
Settings that allow you to customize the noise present in the datasets generated by pseudopeople. Noise is configurable at a very fine-grained level, with settings specific to the dataset, noise type, and column (where applicable).
- Datasets
The types of data that can be simulated with pseudopeople, each of which is the simulated analog of a “real world” database from a census, survey, or administrative source. For example, pseudopeople’s American Community Survey (ACS) dataset is analogous to the data that would be collected by that survey in real life.
- Entity resolution (ER)
The task of identifying the unique entities associated with a set of records, where multiple records may refer to the same entity. Also called “record linkage,” among other names.
- Noise
Errors introduced to a pseudopeople dataset. These simulate data errors that would be found in real-life survey and administrative data.
- Noise types
The types of error that can be introduced to a pseudopeople dataset. Each one simulates a specific type of mistake or inaccuracy that could occur in a real-life data collection or generation process. For example, one of the noise types in pseudopeople is a simulant choosing the wrong option from a list of choices on a form.
- Probabilistic record linkage (PRL)
Entity resolution (“record linkage”) methods that internally use probabilities of some kind to represent uncertainty about which records belong to which entities.
- Record linkage
Another term for entity resolution.
- Simulant
A simulated person represented in a pseudopeople-generated dataset.