SQL performance tuning requires access to a database with a realistic volume, data distribution and relationships, but with all private data masked, while preserving mandatory formats.There is a need for both on-the-fly masking while accessing the database on-line through a testing PostgreSQL user with limited privileges and to export the schema having all private data masked.
It would be nice to have a simple solution that can be easily adopted by the wider PostgreSQL community and eventually included in the upstream PostgreSQL project.
The definition of private fields and masking expressions should be simple and use just the PostgreSQL data dictionary for saving the definitions, as important masking data (primary and foreign keys, check constraints, table and column names, etc) is already there. The column comment can be used to store the masking expression, using PostgreSQL SQL expression syntax and marked by some special tag (any suggestions for the tag syntax?). The masking expressions should be irreversible and lightweight, not influencing the performance significantly. Some open source tool that can present the schema though ER diagrams and allow easy editing of column comments and having an SQL interpreter for testing masking expressions would be ideal for this task. I think that DBeaver can be used. As it is an Eclipse plugin, it fits into the metasfresh development environment. DBeaver itself has plugins, so it would be possible to make a special masking took plugin.
In the testing user schema, a set of views that mirror the existing tables should be generated by a SQL script. Ideally, no foreign key translation lookup tables should be needed if the masking expressions are deterministic.The tricky part is the need to create shadow indices using the masking expressions in the place of private columns, so the planner can hopefully generate equivalent execution plans when SELECT statements are executed by the testing user instead of the production user. These new indices would require plenty of space, so on-the-fly masking should be an option for special circumstances, and maybe not developed in the first version. The testing user schema can be used to export masked data by a custom SQL script or pg_dump modified to use the original table definitions for generating DDL and using shadow views for generating insert values. The exported schema should behave exactly as the original one, so no shadow indices are needed there.
These are just first ideas, so anybody interested, please comment and suggest your own!