Skip to main content
Skip to main content
Edit this page

Data masking in ClickHouse

Data masking is a technique used for data protection, in which the original data is replaced with a version of the data which maintains its format and structure while removing any personally identifiable information (PII) or sensitive information.

This guide shows you how you can mask data in ClickHouse.

Use string replacement functions

For basic data masking use cases, the replace family of functions offers a convenient way to mask data:

FunctionDescription
replaceOneReplaces the first occurrence of a pattern in a haystack string with the provided replacement string.
replaceAllReplaces all occurrences of a pattern in a haystack string with the provided replacement string.
replaceRegexpOneReplaces the first occurrence of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string.
replaceRegexpAllReplaces all occurrences of a substring matching a regular expression pattern (in re2 syntax) in a haystack with the provided replacement string.

For example, you can replace the name "John Smith" with a placeholder [CUSTOMER_NAME] using the replaceOne function:

SELECT replaceOne(
    'Customer John Smith called about his account',
    'John Smith',
    '[CUSTOMER_NAME]'
) AS anonymized_text;
┌─anonymized_text───────────────────────────────────┐
│ Customer [CUSTOMER_NAME] called about his account │
└───────────────────────────────────────────────────┘

More generically, you can use the replaceRegexpOne to replace any customer name:

SELECT 
    replaceRegexpAll(
        'Customer John Smith called. Later, Mary Johnson and Bob Wilson also called.',
        '\\b[A-Z][a-z]+ [A-Z][a-z]+\\b',
        '[CUSTOMER_NAME]'
    ) AS anonymized_text;
┌─anonymized_text───────────────────────────────────────────────────────────────────────┐
│ [CUSTOMER_NAME] Smith called. Later, [CUSTOMER_NAME] and [CUSTOMER_NAME] also called. │
└───────────────────────────────────────────────────────────────────────────────────────┘

Or you could mask a social security number, leaving only the last 4 digits using the replaceRegexpAll function.

SELECT replaceRegexpAll(
    'SSN: 123-45-6789',
    '(\d{3})-(\d{2})-(\d{4})',
    'XXX-XX-\3'
) AS masked_ssn;

In the query above \3 is used to substitute the third capture group into the resulting string, which produces:

┌─masked_ssn───────┐
│ SSN: XXX-XX-6789 │
└──────────────────┘

Create masked VIEWs

A VIEW can be used in conjunction with the aforementioned string functions to apply transformations to columns containing sensitive data, before they are presented to the user. In this way, the original data remains unchanged, and users querying the view see only the masked data.

To demonstrate, let's imagine that we have a table which stores records of customer orders. We want to make sure that a group of employees can view the information, but we don't want them to see the full information of the customers.

Run the query below to create an example table orders and insert some fictional customer order records into it:

CREATE TABLE orders (
    user_id UInt32,
    name String,
    email String,
    phone String,
    total_amount Decimal(10,2),
    order_date Date,
    shipping_address String
)
ENGINE = MergeTree()
ORDER BY user_id;

INSERT INTO orders VALUES
    (1001, 'John Smith', '[email protected]', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
    (1002, 'Sarah Johnson', '[email protected]', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
    (1003, 'Michael Brown', '[email protected]', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
    (1004, 'Emily Rogers', '[email protected]', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
    (1005, 'David Wilson', '[email protected]', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');

Create a view called masked_orders:

CREATE VIEW masked_orders AS
SELECT
    user_id,
    replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****') AS name,
    replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2') AS email,
    replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3') AS phone,
    total_amount,
    order_date,
    replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1') AS shipping_address
FROM orders;

In the SELECT clause of the view creation query above, we define transformations using the replaceRegexpOne on the name, email, phone and shipping_address fields, which are the fields containing sensitive information that we wish to partially mask.

Select the data from the view:

SELECT * FROM masked_orders
┌─user_id─┬─name─────────┬─email──────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address──────────┐
│    1001 │ John ****    │ jo****@gmail.com   │ 555-***-4567 │       299.99 │ 2024-01-15 │ *** New York, NY 10001    │
│    1002 │ Sarah ****   │ sa****@outlook.com │ 555-***-6543 │        149.5 │ 2024-01-16 │ *** Los Angeles, CA 90210 │
│    1003 │ Michael **** │ mb****@company.com │ 555-***-7890 │          599 │ 2024-01-17 │ *** Chicago, IL 60601     │
│    1004 │ Emily ****   │ em****@yahoo.com   │ 555-***-0987 │        89.99 │ 2024-01-18 │ *** Houston, TX 77001     │
│    1005 │ David ****   │ dw****@email.net   │ 555-***-3210 │       449.75 │ 2024-01-19 │ *** Phoenix, AZ 85001     │
└─────────┴──────────────┴────────────────────┴──────────────┴──────────────┴────────────┴───────────────────────────┘

Notice that the data returned from the view is partially masked, obfuscating the sensitive information. You can also create multiple views, with differing levels of obfuscation depending on the level of privileged access to information the viewer has.

To ensure that users are only able to access the view returning the masked data, and not the table with the original unmasked data, you should use Role Based Access Control to ensure that specific roles only have grants to select from the view.

First create the role:

CREATE ROLE masked_orders_viewer;

Next grant SELECT privileges on the view to the role:

GRANT SELECT ON masked_orders TO masked_orders_viewer;

Because ClickHouse roles are additive, you must ensure that users who should only see the masked view do not have any SELECT privilege on the base table via any role.

As such, you should explicitly revoke base-table access to be safe:

REVOKE SELECT ON orders FROM masked_orders_viewer;

Finally, assign the role to the appropriate users:

GRANT masked_orders_viewer TO your_user;

This ensures that users with the masked_orders_viewer role are only able to see the masked data from the view and not the original unmasked data from the table.

Use MATERIALIZED columns and column-level access restrictions

In cases where you don't want to create a separate view, you can store masked versions of your data alongside the original data. To do so, you can use materialized columns. Values of such columns are automatically calculated according to the specified materialized expression when rows are inserted, and we can use them to create new columns with masked versions of the data.

Taking the example before, instead of creating a separate VIEW for the masked data, we'll now create masked columns using MATERIALIZED:

DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
    user_id UInt32,
    name String,
    name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'),
    email String,
    email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{0})[^@]*(@.*)$', '\\1****\\2'),
    phone String,
    phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'),
    total_amount Decimal(10,2),
    order_date Date,
    shipping_address String,
    shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^[^,]+,\\s*(.*)$', '*** \\1')
)
ENGINE = MergeTree()
ORDER BY user_id;

INSERT INTO orders VALUES
    (1001, 'John Smith', '[email protected]', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
    (1002, 'Sarah Johnson', '[email protected]', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
    (1003, 'Michael Brown', '[email protected]', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
    (1004, 'Emily Rogers', '[email protected]', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
    (1005, 'David Wilson', '[email protected]', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');

If you now run the following select query, you will see that the masked data is 'materialized' at insert time and stored alongside the original, unmasked data. It is necessary to explicitly select the masked columns as ClickHouse doesn't automatically include materialized columns in SELECT * queries by default.

SELECT
    *,
    name_masked,
    email_masked,
    phone_masked,
    shipping_address_masked
FROM orders
ORDER BY user_id ASC
   ┌─user_id─┬─name──────────┬─email─────────────────────┬─phone────────┬─total_amount─┬─order_date─┬─shipping_address───────────────────┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked────┐
1. │    1001 │ John Smith    │ [email protected]      │ 555-123-4567 │       299.99 │ 2024-01-15 │ 123 Main St, New York, NY 10001    │ John ****    │ jo****@gmail.com   │ 555-***-4567 │ **** New York, NY 10001    │
2. │    1002 │ Sarah Johnson │ [email protected] │ 555-987-6543 │        149.5 │ 2024-01-16 │ 456 Oak Ave, Los Angeles, CA 90210 │ Sarah ****   │ sa****@outlook.com │ 555-***-6543 │ **** Los Angeles, CA 90210 │
3. │    1003 │ Michael Brown │ [email protected]        │ 555-456-7890 │          599 │ 2024-01-17 │ 789 Pine Rd, Chicago, IL 60601     │ Michael **** │ mb****@company.com │ 555-***-7890 │ **** Chicago, IL 60601     │
4. │    1004 │ Emily Rogers  │ [email protected]    │ 555-321-0987 │        89.99 │ 2024-01-18 │ 321 Elm St, Houston, TX 77001      │ Emily ****   │ em****@yahoo.com   │ 555-***-0987 │ **** Houston, TX 77001     │
5. │    1005 │ David Wilson  │ [email protected]         │ 555-654-3210 │       449.75 │ 2024-01-19 │ 654 Cedar Blvd, Phoenix, AZ 85001  │ David ****   │ dw****@email.net   │ 555-***-3210 │ **** Phoenix, AZ 85001     │
   └─────────┴───────────────┴───────────────────────────┴──────────────┴──────────────┴────────────┴────────────────────────────────────┴──────────────┴────────────────────┴──────────────┴────────────────────────────┘

To ensure that users are only able to access columns containing the masked data, you can again use Role Based Access Control to ensure that specific roles only have grants to select on masked columns from orders.

Recreate the role that we made previously:

DROP ROLE IF EXISTS masked_order_viewer;
CREATE ROLE masked_order_viewer;

Next, grant SELECT permission to the orders table:

GRANT SELECT ON orders TO masked_data_reader;

Revoke access to any sensitive columns:

REVOKE SELECT(name) ON orders FROM masked_data_reader;
REVOKE SELECT(email) ON orders FROM masked_data_reader;
REVOKE SELECT(phone) ON orders FROM masked_data_reader;
REVOKE SELECT(shipping_address) ON orders FROM masked_data_reader;

Finally, assign the role to the appropriate users:

GRANT masked_orders_viewer TO your_user;

In the case where you want to store only the masked data in the orders table, you can mark the sensitive unmasked columns as EPHEMERAL, which will ensure that columns of this type are not stored in the table.

DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
    user_id UInt32,
    name String EPHEMERAL,
    name_masked String MATERIALIZED replaceRegexpOne(name, '^([A-Za-z]+)\\s+(.*)$', '\\1 ****'),
    email String EPHEMERAL,
    email_masked String MATERIALIZED replaceRegexpOne(email, '^(.{2})[^@]*(@.*)$', '\\1****\\2'),
    phone String EPHEMERAL,
    phone_masked String MATERIALIZED replaceRegexpOne(phone, '^(\\d{3})-(\\d{3})-(\\d{4})$', '\\1-***-\\3'),
    total_amount Decimal(10,2),
    order_date Date,
    shipping_address String EPHEMERAL,
    shipping_address_masked String MATERIALIZED replaceRegexpOne(shipping_address, '^([^,]+),\\s*(.*)$', '*** \\2')
)
ENGINE = MergeTree()
ORDER BY user_id;

INSERT INTO orders (user_id, name, email, phone, total_amount, order_date, shipping_address) VALUES
    (1001, 'John Smith', '[email protected]', '555-123-4567', 299.99, '2024-01-15', '123 Main St, New York, NY 10001'),
    (1002, 'Sarah Johnson', '[email protected]', '555-987-6543', 149.50, '2024-01-16', '456 Oak Ave, Los Angeles, CA 90210'),
    (1003, 'Michael Brown', '[email protected]', '555-456-7890', 599.00, '2024-01-17', '789 Pine Rd, Chicago, IL 60601'),
    (1004, 'Emily Rogers', '[email protected]', '555-321-0987', 89.99, '2024-01-18', '321 Elm St, Houston, TX 77001'),
    (1005, 'David Wilson', '[email protected]', '555-654-3210', 449.75, '2024-01-19', '654 Cedar Blvd, Phoenix, AZ 85001');

If we run the same query as before, you'll now see that only the materialized masked data was inserted into the table:

SELECT
    *,
    name_masked,
    email_masked,
    phone_masked,
    shipping_address_masked
FROM orders
ORDER BY user_id ASC
   ┌─user_id─┬─total_amount─┬─order_date─┬─name_masked──┬─email_masked───────┬─phone_masked─┬─shipping_address_masked───┐
1. │    1001 │       299.99 │ 2024-01-15 │ John ****    │ jo****@gmail.com   │ 555-***-4567 │ *** New York, NY 10001    │
2. │    1002 │        149.5 │ 2024-01-16 │ Sarah ****   │ sa****@outlook.com │ 555-***-6543 │ *** Los Angeles, CA 90210 │
3. │    1003 │          599 │ 2024-01-17 │ Michael **** │ mb****@company.com │ 555-***-7890 │ *** Chicago, IL 60601     │
4. │    1004 │        89.99 │ 2024-01-18 │ Emily ****   │ em****@yahoo.com   │ 555-***-0987 │ *** Houston, TX 77001     │
5. │    1005 │       449.75 │ 2024-01-19 │ David ****   │ dw****@email.net   │ 555-***-3210 │ *** Phoenix, AZ 85001     │
   └─────────┴──────────────┴────────────┴──────────────┴────────────────────┴──────────────┴───────────────────────────┘

Use query masking rules for log data

For users of ClickHouse OSS wishing to mask log data specifically, you can make use of query masking rules (log masking) to mask data.

To do so, you can define regular expression-based masking rules in the server configuration. These rules are applied to queries and all log messages before they are stored in server logs or system tables (such as system.query_log, system.text_log, and system.processes).

This helps prevent sensitive data from leaking into logs only. Note that it does not mask data in query results.

For example, to mask a social security number, you could add the following rule to your server configuration:

<query_masking_rules>
    <rule>
        <name>hide SSN</name>
        <regexp>(^|\D)\d{3}-\d{2}-\d{4}($|\D)</regexp>
        <replace>000-00-0000</replace>
    </rule>
</query_masking_rules>