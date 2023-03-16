Skip to main content

The short answer is "yes". However, we recommend keeping latency between all regions/datacenters in two-digit range, otherwise write performance will suffer as it goes through distributed consensus protocol. For example, replication between US coasts will likely work fine, but between the US and Europe won't.

Configuration-wise there's no difference compared to single-region replication, simply use hosts that are located in different locations for replicas.

For more information, see full article on data replication.

The clickhouse-local tool makes it quick and easy to read data from MySQL and output the data into lots of different formats, including Parquet, CSV, and JSON. We are going to:

  • Use the mysql table function to read the data
  • Use the INTO OUTFILE _filename_ FORMAT clause and specify the desired output format

The clickhouse-local tool is a part of the ClickHouse binary. Download it using the following:

curl https://clickhouse.com/ | sh

Export MySQL to Parquet

The mysql table function creates a table based on the results of a query sent to a MySQL instance. For example:

SELECT *
FROM
   mysql(
    'localhost:3306',
    'my_sql_database',
    'my_sql_table',
    'user',
    'password'
);

We can pipe the output of this query to a file using INTO OUTFILE. Use FORMAT to specify the format of the file to be created. Let's grab the entire contents of a MySQL table, and send its contents to a Parquet file:

./clickhouse local -q "SELECT * FROM
   mysql(
    'localhost:3306',
    'my_sql_database',
    'my_sql_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.parquet'"
note

Because the name of the output file has a .parquet extension, ClickHouse assumes we want the Parquet format, so notice we omitted the FORMAT Parquet clause.

Export MySQL to CSV

It's the same as for Parquet, except this time we use a .csv extension on the filename. ClickHouse will realize we want a comma-separated output and that's how the data will be written to the file:

./clickhouse local -q "SELECT * FROM
   mysql(
    'localhost:3306',
    'my_sql_database',
    'my_sql_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.csv'"

Export MySQL to JSON

To go from MySQL to JSON, just change the extension on the filename to jsonl or ndjson:

./clickhouse local -q "SELECT * FROM
   mysqlql(
    'localhost:3306',
    'my_sql_database',
    'my_sql_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.ndjson'"

It's impressive how simple yet powerful the clickhouse-local tool really is. You can easily read data from a database like MySQL and output it into all types of different output formats.

This question usually arises when people see official ClickHouse t-shirts. They have large words “ClickHouse не тормозит” on the front.

Before ClickHouse became open-source, it has been developed as an in-house storage system by the largest Russian IT company, Yandex. That’s why it initially got its slogan in Russian, which is “не тормозит” (pronounced as “ne tormozit”). After the open-source release we first produced some of those t-shirts for events in Russia and it was a no-brainer to use the slogan as-is.

One of the following batches of those t-shirts was supposed to be given away on events outside of Russia and we tried to make the English version of the slogan. Unfortunately, the Russian language is kind of elegant in terms of expressing stuff and there was a restriction of limited space on a t-shirt, so we failed to come up with good enough translation (most options appeared to be either long or inaccurate) and decided to keep the slogan in Russian even on t-shirts produced for international events. It appeared to be a great decision because people all over the world get positively surprised and curious when they see it.

So, what does it mean? Here are some ways to translate “не тормозит”:

  • If you translate it literally, it’d be something like “ClickHouse does not press the brake pedal”.
  • If you’d want to express it as close to how it sounds to a Russian person with IT background, it’d be something like “If your larger system lags, it’s not because it uses ClickHouse”.
  • Shorter, but not so precise versions could be “ClickHouse is not slow”, “ClickHouse does not lag” or just “ClickHouse is fast”.

If you haven’t seen one of those t-shirts in person, you can check them out online in many ClickHouse-related videos. For example, this one:

P.S. These t-shirts are not for sale, they are given away for free on most ClickHouse Meetups, usually for best questions or other forms of active participation.

OLAP stands for Online Analytical Processing. It is a broad term that can be looked at from two perspectives: technical and business. But at the very high level, you can just read these words backward:

Processing : Some source data is processed…

Analytical : …to produce some analytical reports and insights…

Online : …in real-time.

OLAP from the Business Perspective

In recent years, business people started to realize the value of data. Companies who make their decisions blindly, more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be remotely useful for making business decisions and need mechanisms to timely analyze them. Here’s where OLAP database management systems (DBMS) come in.

In a business sense, OLAP allows companies to continuously plan, analyze, and report operational activities, thus maximizing efficiency, reducing expenses, and ultimately conquering the market share. It could be done either in an in-house system or outsourced to SaaS providers like web/mobile analytics services, CRM services, etc. OLAP is the technology behind many BI applications (Business Intelligence).

ClickHouse is an OLAP database management system that is pretty often used as a backend for those SaaS solutions for analyzing domain-specific data. However, some businesses are still reluctant to share their data with third-party providers and an in-house data warehouse scenario is also viable.

OLAP from the Technical Perspective

All database management systems could be classified into two groups: OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing). Former focuses on building reports, each based on large volumes of historical data, but doing it not so frequently. While the latter usually handle a continuous stream of transactions, constantly modifying the current state of data.

In practice OLAP and OLTP are not categories, it’s more like a spectrum. Most real systems usually focus on one of them but provide some solutions or workarounds if the opposite kind of workload is also desired. This situation often forces businesses to operate multiple storage systems integrated, which might be not so big deal but having more systems make it more expensive to maintain. So the trend of recent years is HTAP (Hybrid Transactional/Analytical Processing) when both kinds of the workload are handled equally well by a single database management system.

Even if a DBMS started as a pure OLAP or pure OLTP, they are forced to move towards that HTAP direction to keep up with their competition. And ClickHouse is no exception, initially, it has been designed as fast-as-possible OLAP system and it still does not have full-fledged transaction support, but some features like consistent read/writes and mutations for updating/deleting data had to be added.

The fundamental trade-off between OLAP and OLTP systems remains:

  • To build analytical reports efficiently it’s crucial to be able to read columns separately, thus most OLAP databases are columnar,
  • While storing columns separately increases costs of operations on rows, like append or in-place modification, proportionally to the number of columns (which can be huge if the systems try to collect all details of an event just in case). Thus, most OLTP systems store data arranged by rows.

You can use clickhouse-local to convert files between any of the input and output formats that ClickHouse supports (which is over 70 different formats!). In this article, we are convert a Parquet file in S3 into a CSV and JSON file.

Let's start at the beginning. ClickHouse has a collection of table functions that read from files, databases and other resoures and converts the data to a table. To demonstrate, suppose we have a Parquet file in S3. We will use the s3 table function to read it (ClickHouse knows it's a Parquet file based on the filename).

But first, let's download the clickhouse binary:

curl https://clickhouse.com/ | sh

Accessing the data using a table function

Let's verify we can read the file by using DESCRIBE on the resulting table that the s3 table function creates:

./clickhouse local -q "DESCRIBE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')"

This particular file contains home prices of properties sold in the United Kingdom. The response looks like:

price   Nullable(Int64)
date    Nullable(UInt16)
postcode1   Nullable(String)
postcode2   Nullable(String)
type    Nullable(String)
is_new  Nullable(UInt8)
duration    Nullable(String)
addr1   Nullable(String)
addr2   Nullable(String)
street  Nullable(String)
locality    Nullable(String)
town    Nullable(String)
district    Nullable(String)
county  Nullable(String)

You can run any query you want on the data. For example, let's see which towns have the highest average price of homes:

./clickhouse local -q "SELECT
   town,
   avg(price) AS avg_price
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY town
ORDER BY avg_price DESC
LIMIT 10"

The response looks like:

GATWICK 16818750
CHALFONT ST GILES   938090.0985915493
VIRGINIA WATER  789301.1320224719
COBHAM  699874.7111622555
BEACONSFIELD    677247.5483146068
ESHER   616004.6888297872
KESTON  607585.8597560975
GERRARDS CROSS  566330.2959086584
ASCOT   551491.2975753123
WEYBRIDGE   548974.828692494

Convert the Parquet file to a CSV

You can send the result of any SQL query to a file. Let's grab all the columns from our Parquet file in S3 and send the output to a new CSV file. Because the output file ends in .csv, ClickHouse knows to use the CSV output format:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.csv'"

Let's verify it worked:

$ tail house_prices.csv
70000,10508,"YO8","9XN","detached",0,"freehold","7","","POPPY CLOSE","SELBY","SELBY","SELBY","NORTH YORKSHIRE"
130000,14274,"YO8","9XP","detached",0,"freehold","10","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
150000,18180,"YO8","9XP","detached",0,"freehold","11","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
157000,18088,"YO8","9XP","detached",0,"freehold","12","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
134000,17333,"YO8","9XP","semi-detached",0,"freehold","16","","HEATHER CLOSE","","SELBY","SELBY","NORTH YORKSHIRE"
250000,13405,"YO8","9YA","detached",0,"freehold","6","","YORKDALE COURT","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
59500,11166,"YO8","9YB","semi-detached",0,"freehold","4","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
142500,17648,"YO8","9YB","semi-detached",0,"freehold","4A","","YORKDALE DRIVE","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
230000,15125,"YO8","9YD","detached",0,"freehold","1","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"
250000,15950,"YO8","9YD","detached",0,"freehold","3","","ONE ACRE GARTH","HAMBLETON","SELBY","SELBY","NORTH YORKSHIRE"

Convert the Parquet file to a JSON

To convert the Parquet file to JSON, simply change the extension on the output filename:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
INTO OUTFILE 'house_prices.ndjson'"

Let's verify it worked:

 $ tail house_prices.ndjson
{"price":"70000","date":10508,"postcode1":"YO8","postcode2":"9XN","type":"detached","is_new":0,"duration":"freehold","addr1":"7","addr2":"","street":"POPPY CLOSE","locality":"SELBY","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"130000","date":14274,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"10","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"150000","date":18180,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"11","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"157000","date":18088,"postcode1":"YO8","postcode2":"9XP","type":"detached","is_new":0,"duration":"freehold","addr1":"12","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"134000","date":17333,"postcode1":"YO8","postcode2":"9XP","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"16","addr2":"","street":"HEATHER CLOSE","locality":"","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":13405,"postcode1":"YO8","postcode2":"9YA","type":"detached","is_new":0,"duration":"freehold","addr1":"6","addr2":"","street":"YORKDALE COURT","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"59500","date":11166,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"142500","date":17648,"postcode1":"YO8","postcode2":"9YB","type":"semi-detached","is_new":0,"duration":"freehold","addr1":"4A","addr2":"","street":"YORKDALE DRIVE","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"230000","date":15125,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"1","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}
{"price":"250000","date":15950,"postcode1":"YO8","postcode2":"9YD","type":"detached","is_new":0,"duration":"freehold","addr1":"3","addr2":"","street":"ONE ACRE GARTH","locality":"HAMBLETON","town":"SELBY","district":"SELBY","county":"NORTH YORKSHIRE"}

Convert CSV to Parquet

It works both ways - we can easily read in the new CSV file and output it into a Parquet file. The local file house_prices.csv can be read in ClickHouse using the file table function, and ClickHouse outputs the file in Parquet format based on the filename ending in .parquet (or we could have added the FORMAT Parquet clause):

./clickhouse local -q "SELECT *
FROM file('house_prices.csv')
INTO OUTFILE 'house_prices.parquet'"

As we mentioned above, you can use any of the ClickHouse input and output formats along with clickhouse local to easily convert files into different formats.

This one is easy with clickhouse-local:

  • Use the postgresql table function to read the data
  • Use the INTO OUTFILE _filename_ FORMAT clause and specify the desired output format

The output format can be any of the supported ouput formats in ClickHouse. Let's look at a few examples...

These examples use clickhouse-local, which is a part of the ClickHouse binary. Download it using the following:

curl https://clickhouse.com/ | sh

Export PostgreSQL to Parquet

The postgresql table function allows SELECT (and INSERT) queries to be performed on data that is stored on a remote PostgreSQL server. For example, to view the entire contents of a table in PostgreSQL:

SELECT *
FROM
   postgresql(
    'localhost:5432',
    'postgres_database',
    'postgres_table',
    'user',
    'password'
);

We can pipe the output of this query to a file using INTO OUTFILE. Use FORMAT to specify the format of the file to be created. Let's grab the entire contents of the PostgreSQL table, and send its contents to a Parquet file:

./clickhouse local -q "SELECT * FROM
   postgresql(
    'localhost:5432',
    'postgres_database',
    'postgres_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.parquet'"
note

Because the name of the output file has a .parquet extension, ClickHouse assumes we want the Parquet format, so notice we omitted the FORMAT Parquet clause.

Export PostgreSQL to CSV

It's the same as for Parquet, except we specify a more approriate filename for the output:

./clickhouse local -q "SELECT * FROM
   postgresql(
    'localhost:5432',
    'postgres_database',
    'postgres_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.csv'"

That's it! ClickHouse sees the .csv extension on the output file name and outputs the data as comma-separated. Otherwise, it's the exact same command as above.

Export PostgreSQL to JSON

To go from PostgreSQL to JSON, we just change the filename and ClickHouse will figure out the format:

./clickhouse local -q "SELECT * FROM
   postgresql(
    'localhost:5432',
    'postgres_database',
    'postgres_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.ndjson'"
note

You don't have to stop here - you can use clickhouse-local to pull data from PostgreSQL and send it to all types of output formats.

If ClickHouse can not determine the output type by the filename extension, or if you want to specifically choose a format, add the FOMRAT clause:

```bash
./clickhouse local -q "SELECT * FROM
   postgresql(
    'localhost:5432',
    'postgres_database',
    'postgres_table',
    'user',
    'password'
)
INTO OUTFILE 'my_output_file.ndjson'
FORMAT JSONEachRow"

Stream PostgreSQL to another process

Instead of using INTO OUTFILE, you can stream the results of a table function to another process. Here's a simple example to demonstrate the syntax - we count the number of rows using the Linux wc -l command:

./clickhouse local -q "SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet'
FORMAT JSONEachRow
)" | wc -l

However, we could easily stream the rows to a shell script, Python script, or any other process that you want.

First of all, let’s discuss why people ask this question in the first place. There are two key reasons:

  1. ClickHouse is developed with pretty high velocity, and usually there are 10+ stable releases per year. That makes a wide range of releases to choose from, which is not so trivial of a choice.
  2. Some users want to avoid spending time figuring out which version works best for their use case and just follow someone else’s advice.

The second reason is more fundamental, so we’ll start with that one and then get back to navigating through various ClickHouse releases.

Which ClickHouse Version Do You Recommend?

It’s tempting to hire consultants or trust some known experts to get rid of responsibility for your production environment. You install some specific ClickHouse version that someone else recommended; if there’s some issue with it - it’s not your fault, it’s someone else’s. This line of reasoning is a big trap. No external person knows better than you what’s going on in your company’s production environment.

So how do you properly choose which ClickHouse version to upgrade to? Or how do you choose your first ClickHouse version? First of all, you need to invest in setting up a realistic pre-production environment. In an ideal world, it could be a completely identical shadow copy, but that’s usually expensive.

Here are some key points to get reasonable fidelity in a pre-production environment with not-so-high costs:

  • Pre-production environment needs to run an as close of a set of queries as you intend to run in production:
    • Don’t make it read-only with some frozen data.
    • Don’t make it write-only with just copying data without building some typical reports.
    • Don’t wipe it clean instead of applying schema migrations.
  • Use a sample of real production data and queries. Try to choose a sample that’s still representative and makes SELECT queries return reasonable results. Use obfuscation if your data is sensitive and internal policies do not allow it to leave the production environment.
  • Make sure that pre-production is covered by your monitoring and alerting software the same way as your production environment does.
  • If your production spans across multiple datacenters or regions, make your pre-production do the same.
  • If your production uses complex features like replication, distributed tables and cascading materialized views, make sure they are configured similarly in pre-production.
  • There’s a trade-off on using the roughly same number of servers or VMs in pre-production as in production but of smaller size, or much less of them but of the same size. The first option might catch extra network-related issues, while the latter is easier to manage.

The second area to invest in is automated testing infrastructure. Don’t assume that if some kind of query has executed successfully once, it’ll continue to do so forever. It’s OK to have some unit tests where ClickHouse is mocked, but make sure your product has a reasonable set of automated tests that are run against real ClickHouse and check that all important use cases are still working as expected.

An extra step forward could be contributing those automated tests to ClickHouse’s open-source test infrastructure that are continuously used in its day-to-day development. It definitely will take some additional time and effort to learn how to run it and then how to adapt your tests to this framework, but it’ll pay off by ensuring that ClickHouse releases are already tested against them when they are announced stable, instead of repeatedly losing time on reporting the issue after the fact and then waiting for a bugfix to be implemented, backported and released. Some companies even have such test contributions to infrastructure by its use as an internal policy, (called Beyonce’s Rule at Google).

When you have your pre-production environment and testing infrastructure in place, choosing the best version is straightforward:

  1. Routinely run your automated tests against new ClickHouse releases. You can do it even for ClickHouse releases that are marked as testing, but going forward to the next steps with them is not recommended.
  2. Deploy the ClickHouse release that passed the tests to pre-production and check that all processes are running as expected.
  3. Report any issues you discovered to ClickHouse GitHub Issues.
  4. If there were no major issues, it should be safe to start deploying ClickHouse release to your production environment. Investing in gradual release automation that implements an approach similar to canary releases or green-blue deployments might further reduce the risk of issues in production.

As you might have noticed, there’s nothing specific to ClickHouse in the approach described above - people do that for any piece of infrastructure they rely on if they take their production environment seriously.

How to Choose Between ClickHouse Releases?

If you look into the contents of the ClickHouse package repository, you’ll see two kinds of packages:

  1. stable
  2. lts (long-term support)

Here is some guidance on how to choose between them:

  • stable is the kind of package we recommend by default. They are released roughly monthly (and thus provide new features with reasonable delay) and three latest stable releases are supported in terms of diagnostics and backporting of bugfixes.
  • lts are released twice a year and are supported for a year after their initial release. You might prefer them over stable in the following cases:
    • Your company has some internal policies that do not allow for frequent upgrades or using non-LTS software.
    • You are using ClickHouse in some secondary products that either do not require any complex ClickHouse features or do not have enough resources to keep it updated.

Many teams who initially think that lts is the way to go often switch to stable anyway because of some recent feature that’s important for their product.

danger

One more thing to keep in mind when upgrading ClickHouse: we’re always keeping an eye on compatibility across releases, but sometimes it’s not reasonable to keep and some minor details might change. So make sure you check the changelog before upgrading to see if there are any notes about backward-incompatible changes.

A client can view the server logs - even at a different level than what the server log level is configured to - by setting the send_logs_level client setting.

For example, suppose the client runs:

SET send_logs_level = 'trace';

The client will receive trace logs even if the server has log level set to info.

One useful scenario is to use send_logs_level to monitor the insertion of rows into a Distributed table:

  • Enable logs in clickhouse-client using SET send_logs_level = 'trace';
  • Run your INSERT query
  • Inserts into a distributed table are asynchronous by default. The data is written into a local buffer on disk, then sent to remote servers in background.
  • Logs will be sent from all nodes participating in the query processing (distributed tracing)

To check the status of distributed inserts, check the system.distribution_queue table. This table contains information about local files that are in the queue to be sent to the shards. These local files contain new parts that are created by inserting new data into the Distributed table in asynchronous mode.

Note: Please see the blog Working with Time series data in ClickHouse for additional examples of using ClickHouse for time series analysis.

ClickHouse is a generic data storage solution for OLAP workloads, while there are many specialized time-series database management systems. Nevertheless, ClickHouse’s focus on query execution speed allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we’re not going to conduct one here. Instead, let’s focus on ClickHouse features that are important to use if that’s your use case.

First of all, there are specialized codecs which make typical time-series. Either common algorithms like DoubleDelta and Gorilla or specific to ClickHouse like T64.

Second, time-series queries often hit only recent data, like one day or one week old. It makes sense to use servers that have both fast nVME/SSD drives and high-capacity HDD drives. ClickHouse TTL feature allows to configure keeping fresh hot data on fast drives and gradually move it to slower drives as it ages. Rollup or removal of even older data is also possible if your requirements demand it.

Even though it’s against ClickHouse philosophy of storing and processing raw data, you can use materialized views to fit into even tighter latency or costs requirements.

