Hierarchical Anonymization

Node for anonymizing sensitive personal data. The underlying tools used by the node is based on ARX Data Anonymization Tool



Attribute type. Possible options:
0 | Identifying
1 | Quasi-identifying
2 | Sensitive
3 | Insensitive
(Either index or name could be used in flow variables)
Hierarchy file (*.ahs).
Transformation mode options:
0 | Generalization
1 | Microaggregation
2 | Clustering and microaggregation
(Either index or name could be used in flow variables)
Attribute weight. Value in range [0.0, 1.0]. Default is 0.5. The attributes with lesser weights will be anonymized more and vice versa for attributes with higher weights.
Minimum fixed generalization level.
Maximum fixed generalization level.
Attribute processing function. Possible options:
0 | Arithmetic mean
1 | Geometric mean
2 | Median
3 | Interval
4 | Mode
(Either index or name could be used in flow variables)
Ignore Missing Data
Defines if the generalization function ignores missing data or not.

Anonymization Config

Number of threads
Number of partitions (threads). Input data will be split into a number of partitions to run in different threads simultaneously. Might decrease the time of data anonymization, but lead affect the quality of anonymization.
Partition by column
Partition table by specified column. When unchecked - table will be partitioned into Number of threads parts of equal size. For string columns the table will be partitioned by distinct values for this column, an error will be raised in case there are more distinct values than specified Number of threads. For decimal and Date&Time columns - range of possible values will be split into Number of threads of equal length intervals.
Suppression limit
Define the suppression limit, which is the maximal number of records that can be removed from the input dataset (in fraction). Value between 0.0 and 1.0.
Approximate: assume practical monotonicity
The option "Approximate" can be enabled to compute an approximate solution with potentially significantly reduce execution times. The solution is guaranteed to fulfill the given privacy settings, but it might not be optimal regarding the data utility model specified.
Re-identification Risk Threshold
Thresholds for the highest risk of any record. Used for measuring re-identification risks for three different attacker models: (1) the prosecutor scenario, (2) the journalist scenario and (3) the marketer scenario.
Add Class column to output table
Option for including additional column representing equivalence class - a set of records which are indistinguishable regarding the specified quasi-identifying variables.
Omit rows with missing cells
Exclude rows with 'missing cells' from the input table. Throw an error if table contains missing cell when option is disabled.
Omit identifying columns
Exclude 'identifying' columns from the result table.
Heuristic Search Enabled
Defines whether a heuristic search strategy is used.
Limited number of steps
The heuristic search algorithm will terminate after the given number of transformations have been checked.
Limited time [ms]
The heuristic search algorithm will terminate after the given number of milliseconds.
Utility measure
The model for quantifying data quality which will be used as an optimization function during the anonymization process.
Possible options:
0 | Average equivalence class size
1 | Discernability
2 | Height
3 | Loss
4 | Non-uniform entropy
5 | Precision
6 | Ambiguity
7 | Normalized non-uniform entropy
8 | KL-Divergence
9 | Publisher payout (prosecutor)
10| Publisher payout (journalist)
11| Entropy-based information loss
12| Classification accuracy
(Either index or name could be used in flow variables)
Generalization/Suppression Factor
Value between 0 (generalization) and 1 (suppression) specifying whether generalization or suppression should be preferred when transforming data.
Enable precomputation
Precomputation is switched on when, for each quasi-identifier, the number of distinct data values divided by the total number of records in the dataset is lower than the configured Precomputation threshold.
Precomputation threshold
Value between 0.0 and 1.0.
Aggregate Function
Aggregation function will be used to compile the estimates obtained for the individual attributes of a dataset into a global value. Possible options:
0 | SUM
1 | MAX
4 | RANK
(Either index or name could be used in flow variables)
Population model is used by K-Map privacy model and for estimating re-identification risks. Note: Privacy models based on population uniqueness assume that the dataset is a uniform sample of the population. If this is not the case, results may be inaccurate.
One of the regions with predefined population size.
Population size
Population size could be entered manually.

Privacy Models

Privacy Models
Configure privacy models. Refer documentation for details.

Research sample

Do not specify research sample.
Use entire input table as sample subset.
Random selection
Selecting records by random sampling.
Random sampling probability. Value between 0.0 and 1.0.
Query selection
Selecting records by querying the dataset
The query syntax is as follows: fields and constants must be enclosed in single quotes. The following operators are supported: >, >=, <, <=, =, or, and, ( and ). Example:
'age'<'40' and 'gender'='M'
Flow variable holding sample selection mode. Possible values:
0 | NONE
1 | ALL

Input Ports

Input data table
Hierarchy Configuration

Output Ports

Result table with anonymized data
Statistics table
Suppressed records
Attribute Risks
Statistics converted to flow variables. If partitioning is enabled only first row of statistic table is used


Interactive View: Transformation View (JS)
Select transformation from available options.




You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.