Data Masking for testing and tuning on realistic data

rade · April 17, 2017, 5:44am

SQL performance tuning requires access to a database with a realistic volume, data distribution and relationships, but with all private data masked, while preserving mandatory formats.There is a need for both on-the-fly masking while accessing the database on-line through a testing PostgreSQL user with limited privileges and to export the schema having all private data masked.
It would be nice to have a simple solution that can be easily adopted by the wider PostgreSQL community and eventually included in the upstream PostgreSQL project.
The definition of private fields and masking expressions should be simple and use just the PostgreSQL data dictionary for saving the definitions, as important masking data (primary and foreign keys, check constraints, table and column names, etc) is already there. The column comment can be used to store the masking expression, using PostgreSQL SQL expression syntax and marked by some special tag (any suggestions for the tag syntax?). The masking expressions should be irreversible and lightweight, not influencing the performance significantly. Some open source tool that can present the schema though ER diagrams and allow easy editing of column comments and having an SQL interpreter for testing masking expressions would be ideal for this task. I think that DBeaver can be used. As it is an Eclipse plugin, it fits into the metasfresh development environment. DBeaver itself has plugins, so it would be possible to make a special masking took plugin.
In the testing user schema, a set of views that mirror the existing tables should be generated by a SQL script. Ideally, no foreign key translation lookup tables should be needed if the masking expressions are deterministic.The tricky part is the need to create shadow indices using the masking expressions in the place of private columns, so the planner can hopefully generate equivalent execution plans when SELECT statements are executed by the testing user instead of the production user. These new indices would require plenty of space, so on-the-fly masking should be an option for special circumstances, and maybe not developed in the first version. The testing user schema can be used to export masked data by a custom SQL script or pg_dump modified to use the original table definitions for generating DDL and using shadow views for generating insert values. The exported schema should behave exactly as the original one, so no shadow indices are needed there.
These are just first ideas, so anybody interested, please comment and suggest your own!

rade · April 20, 2017, 8:33am

Talend released free Data Preparation with a nice data masking function. Talend Open Studio can be used to create an ETL job to extract data from PostgreSQL and mask it with Data Preparation. Gave up after I tried with defining the whole metasfresh public schema in Open Studio, but it froze, it is obviously to big for that tool.
Continuing with creating a script to generate an export of masked data in as plain sql (using copy commands), similar to pg_dump. The schema metadata can be exported with pg_dump in schema-only mode, and then the custom script can emulate the pg_dump in data only mode.

rade · April 26, 2017, 8:07pm

Created the script, but still need to make a collection of deterministic and random masking functions for different types of data. Will try to finish it this weekend.

metasnw · April 29, 2017, 7:19am

hi Rade,

thx for posting your progress !

cheers,
Norbert

rade · April 30, 2017, 11:44am

I need a medium sized database dump to test the masking and some suggestions what columns to mask.

metasnw · May 1, 2017, 8:58am

hi,

as we constantly keep adding columns it might be best to add a column defintion in the appilication dictionary (table ad_column) which allows the dev to mark any column as “sensitive” and then its considered by your script.

What do you think?

rade · May 3, 2017, 7:37pm

That would be OK, fro the start. My idea is to define the desired mode of masking, deterministic or random. We can put ‘det’ or ‘rand’ or a full formula in that column.
det should be used when consistency of values is needed - same values mask into same values.
An explicit formula can be used for text columns when some human readable value is desired, as my default det and rand functions produce just gibberish, within the length or the original data.
rand should be used when this kind of consistency is not needed, as it provides better obfuscation.
As all the primary keys in metasfresh are synthetic and so are the foreign keys, they probably do not need to be masked, so there will be no issues with integrity constraints.
I can change my script can read directly from this additional column in ad_columns. Currently, it is reading the database column description, so no modification is needed in ad_columns, and it is universal for any PostgreSQL database. It finds “msk"det” or “msk"rand” or “msk"some–SQL-expression” in the column description.

rade · May 13, 2017, 9:50pm

Created project https://github.com/rpobulic/pg-mask/

rade · May 24, 2017, 3:35am

I suggest that the developers mark the columns by using PostgreSQL column descriptions as explained in the README at https://github.com/rpobulic/pg-mask/ or to populate the additional column in ad_columns. I can make a script to copy the ad_columns additional column to column descriptions. Once you populate columns of different data types, we can test the masking and I can start working on real performance problems with real customer masked data.

metasnw · May 25, 2017, 10:49am

hi Rade,

thanks for pointing to your masking tool.

Why did you chose to ask for an owner to export and not only using the database name?

cheers,
norbert

rade · May 25, 2017, 1:24pm

I think that database is not granular enough, an average customer can easily have some additional owners and schemata which he does not need to export. As far as I can see, all metasfresh tables belong to s single owner, so for the purpose of tuning metasfresh SQL, the owner level export is adequate. I am actually thinking of adding some optional levels of granularity to my project, to make it more generic, but that is not relevant for metasfresh exports.

rade · August 17, 2017, 8:46am

Did anybody try the masking tool? Please, submit any issues or feature requests using the Github Issues https://github.com/rpobulic/pg-mask/issues
Thanks!