Skip to main content

· 2 min read

Question

How can I tell if a projection is used?

Answer

  1. Create a sample database
CREATE database db1;
  1. Create a sample table that will use column1 as the primary key
CREATE table db1.table1_projections
(
column1 Int32,
column2 Int32
)
engine = MergeTree()
order by column1;
  1. Add a projection for_column2 to use column2 as the primary key
ALTER table db1.table1_projections add projection for_column2
(
select *
order by column2
);
  1. Insert test data

*this inserts 100000 rows with random numbers in column1 and column2

INSERT INTO db1.table1_projections
select
floor(randNormal(50, 5)) as column1,
floor(randUniform(1, 100)) as column2
from numbers(100000);
  1. Check sample set of data
clickhouse-cloud :) SELECT * from db1.table1_projections limit 5;

SELECT *
FROM db1.table1_projections
LIMIT 5

Query id: d6940799-b507-4a5e-9843-df55ebe818ab

┌─column1─┬─column2─┐
│ 28 │ 41 │
│ 29 │ 12 │
│ 30 │ 73 │
│ 30 │ 75 │
│ 30 │ 70 │
└─────────┴─────────┘
  1. Test that it is using the original table with column1:
clickhouse-cloud :) explain indexes = 1 
SELECT count() from db1.table1_projections where column1 > 50;

EXPLAIN indexes = 1
SELECT count()
FROM db1.table1_projections
WHERE column1 > 50

Query id: e04d5236-1a05-4f1f-9502-7e41986beb44

┌─explain────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ ReadFromMergeTree (db1.table1_projections) │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 1/1 │
│ Granules: 12/12 │
└────────────────────────────────────────────────────┘

*notice that it is reading from db1.table1_projections

  1. Test reading from the projection by using column2 in the where clause
clickhouse-cloud :) explain indexes = 1 
SELECT * from db1.table1_projections where column2 > 50;

EXPLAIN indexes = 1
SELECT *
FROM db1.table1_projections
WHERE column2 > 50

Query id: d2b20e01-93bf-4b60-a370-4aac7b454267

┌─explain─────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ Filter │
│ ReadFromMergeTree (for_column2) │
│ Indexes: │
│ PrimaryKey │
│ Keys: │
│ column2 │
│ Condition: (column2 in [51, +Inf)) │
│ Parts: 1/1 │
│ Granules: 6/12 │
└─────────────────────────────────────────────┘

*notice that now the for_column2 projection is used.

For more info

Projections: https://clickhouse.com/docs/en/sql-reference/statements/alter/projection

numbers table function: https://clickhouse.com/docs/en/sql-reference/table-functions/numbers

Blog for generating random data: https://clickhouse.com/blog/generating-random-test-distribution-data-for-clickhouse

· 2 min read

Can I just run HTTP requests to ClickHouse server using requests module?

Answer

Yes, here is the sample code.

import requests
import datetime

create_replace_stmt = 'CREATE OR REPLACE TABLE test_table (name String, age UInt8) Engine=MergeTree ORDER BY tuple();'
select_query = 'SELECT count() FROM test_table'
insert_query = 'INSERT INTO test_table SELECT * FROM generateRandom(\'name String, age UInt8\',1,1) LIMIT 300000000'

CH_URL = 'https://your_clickhouse_service_fqdn:8443'
CH_USER = 'default'
CH_PASSWORD = 'secret_pwd'

headers = {}
headers["X-ClickHouse-User"] = CH_USER
headers["X-ClickHouse-Key"] = CH_PASSWORD

now = (datetime.datetime.now())
print("{} - starting...".format(now))


# create/replace table
now = (datetime.datetime.now())
print("{} - creating/replacing table...".format(now))
response = requests.post(url=CH_URL,
params={"database": "default",
"query": create_replace_stmt,
"session_id": "my-session-id-string"
},
headers=headers)

# select count()
response = requests.post(url=CH_URL,
params={"database": "default",
"query": select_query,
"session_id": "my-session-id-string"
},
headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table before insert: {}".format(
now, response.content.decode('utf-8')))


# insert
now = (datetime.datetime.now())
print("{} - Inserting data...".format(now))
response = requests.post(url=CH_URL,
params={"database": "default",
"query": insert_query,
"session_id": "my-session-id-string",
"wait_end_of_query": 1
},
headers=headers)

now = (datetime.datetime.now())
print("{} - Done inserting data...".format(now))

response = requests.post(url=CH_URL,
params={"database": "default",
"query": select_query,
"session_id": "my-session-id-string",
},
headers=headers)

now = (datetime.datetime.now())
print("{} - elements in test_table after insert: {}".format(
now, response.content.decode('utf-8')))

Sample expected output:

(venv) ➜  venv/bin/python main.py
2023-07-07 14:54:27.336450 - starting...
2023-07-07 14:54:27.336476 - creating/replacing table...
2023-07-07 14:54:28.125270 - elements in test_table before insert: 0

2023-07-07 14:54:28.125352 - Inserting data...
2023-07-07 14:55:23.788466 - Done inserting data...
2023-07-07 14:55:23.962134 - elements in test_table after insert: 299115357

requirements.txt

requests==2.31.0

See this other KB for steps on how setup your python venv.

· One min read

How do I enforce a time limit on my queries?

Answer

You can use max_execution_time setting:

clickhouse-cloud :) SELECT 1 SETTINGS max_execution_time=0.0001

SELECT 1
SETTINGS max_execution_time = 0.0001

Query id: 3db752a7-b94f-4456-b3b9-ccbf290d1394


0 rows in set. Elapsed: 0.113 sec.

Received exception from server (version 23.5.1):
Code: 159. DB::Exception: Received from service.aws.clickhouse.cloud:9440. DB::Exception: Timeout exceeded: elapsed 0.000557862 seconds, maximum: 0.0001. (TIMEOUT_EXCEEDED)

· One min read

The short answer is "yes". However, we recommend keeping latency between all regions/datacenters in two-digit range, otherwise write performance will suffer as it goes through distributed consensus protocol. For example, replication between US coasts will likely work fine, but between the US and Europe won't.

Configuration-wise there's no difference compared to single-region replication, simply use hosts that are located in different locations for replicas.

For more information, see full article on data replication.

· 2 min read

A columnar database stores the data of each column independently. This allows reading data from disk only for those columns that are used in any given query. The cost is that operations that affect whole rows become proportionally more expensive. The synonym for a columnar database is a column-oriented database management system. ClickHouse is a typical example of such a system.

Key columnar database advantages are:

  • Queries that use only a few columns out of many.
  • Aggregating queries against large volumes of data.
  • Column-wise data compression.

Here is the illustration of the difference between traditional row-oriented systems and columnar databases when building reports:

Traditional row-oriented Traditional row-oriented

Columnar Columnar

A columnar database is the preferred choice for analytical applications because it allows having many columns in a table just in case, but to not pay the cost for unused columns on read query execution time (a traditional OLTP database reads all of the data during queries as the data is stored in rows and not columns). Column-oriented databases are designed for big data processing and data warehousing, they often natively scale using distributed clusters of low-cost hardware to increase throughput. ClickHouse does it with combination of distributed and replicated tables.

· One min read

It’s a combination of “Clickstream” and “Data wareHouse”. It comes from the original use case at Yandex.Metrica, where ClickHouse was supposed to keep records of all clicks by people from all over the Internet, and it still does the job. You can read more about this use case on ClickHouse history page.

This two-part meaning has two consequences:

  • The only correct way to write ClickHouse is with capital H.
  • If you need to abbreviate it, use CH. For some historical reasons, abbreviating as CK is also popular in China, mostly because one of the first talks about ClickHouse in Chinese used this form.
Info

Many years after ClickHouse got its name, this approach of combining two words that are meaningful on their own has been highlighted as the best way to name a database in a research by Andy Pavlo, an Associate Professor of Databases at Carnegie Mellon University. ClickHouse shared his “best database name of all time” award with Postgres.

· One min read

ClickHouse is an open-source project developed on GitHub.

As customary, contribution instructions are published in CONTRIBUTING file in the root of the source code repository.

If you want to suggest a substantial change to ClickHouse, consider opening a GitHub issue explaining what you want to do, to discuss it with maintainers and community first. Examples of such RFC issues.

If your contributions are security related, please check out our security policy too.

· One min read

This is just a short example that illustrates the use of JSONExtract functions.

Create a table:

CREATE TABLE default.json_extract_example
(
`rawJSON` String EPHEMERAL,
`a1` String DEFAULT JSONExtractString(rawJSON, 'a1'),
`a2` Boolean DEFAULT JSONExtractBool(rawJSON, 'a2'),
`a3.aa1` Float DEFAULT JSONExtractFloat(JSONExtractRaw(rawJSON, 'a3'), 'aa1'),
`a3.aa2` UInt8 DEFAULT JSONExtractUInt(JSONExtractRaw(rawJSON, 'a3'), 'aa2')
)
ENGINE = MergeTree
ORDER BY (a1, a2)

Add your JSON raw string:

INSERT INTO default.json_extract_example (rawJSON) VALUES ('{"a1": "XX", "a2": true, "a3":{"aa1":23.11,"aa2":12}}');

Query your data:

SELECT *
FROM json_extract_example
FORMAT Pretty

Yields:

┏━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ a1 ┃ a2 ┃ a3.aa1 ┃ a3.aa2 ┃
┡━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ XX │ true │ 23.11 │ 12 │
└────┴──────┴────────┴────────┘

Each stored as the original JSON type:

SELECT
toTypeName(a1),
toTypeName(a2),
toTypeName(a3.aa1),
toTypeName(a3.aa2)
FROM default.json_extract_example
FORMAT Pretty
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ toTypeName(a1) ┃ toTypeName(a2) ┃ toTypeName(a3.aa1) ┃ toTypeName(a3.aa2) ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ String │ Bool │ Float32 │ UInt8 │
└────────────────┴────────────────┴────────────────────┴────────────────────┘

· 2 min read

The short answer is “no”. The key-value workload is among top positions in the list of cases when NOT to use ClickHouse. It’s an OLAP system after all, while there are many excellent key-value storage systems out there.

However, there might be situations where it still makes sense to use ClickHouse for key-value-like queries. Usually, it’s some low-budget products where the main workload is analytical in nature and fits ClickHouse well, but there’s also some secondary process that needs a key-value pattern with not so high request throughput and without strict latency requirements. If you had an unlimited budget, you would have installed a secondary key-value database for this secondary workload, but in reality, there’s an additional cost of maintaining one more storage system (monitoring, backups, etc.) which might be desirable to avoid.

If you decide to go against recommendations and run some key-value-like queries against ClickHouse, here are some tips:

  • The key reason why point queries are expensive in ClickHouse is its sparse primary index of main MergeTree table engine family. This index can’t point to each specific row of data, instead, it points to each N-th and the system has to scan from the neighboring N-th row to the desired one, reading excessive data along the way. In a key-value scenario, it might be useful to reduce the value of N with the index_granularity setting.
  • ClickHouse keeps each column in a separate set of files, so to assemble one complete row it needs to go through each of those files. Their count increases linearly with the number of columns, so in the key-value scenario, it might be worth avoiding using many columns and put all your payload in a single String column encoded in some serialization format like JSON, Protobuf, or whatever makes sense.
  • There’s an alternative approach that uses Join table engine instead of normal MergeTree tables and joinGet function to retrieve the data. It can provide better query performance but might have some usability and reliability issues. Here’s an usage example.