物化视图简介与ClickHouse中的应用示例

物化视图简介与ClickHouse中的应用示例

2023年7月4日发(作者:)

物化视图简介与ClickHouse中的应⽤⽰例前⾔最近在搞520⼤促的事情,忙到脚不点地,所以就写些简单省事的吧。物化视图概念我们都知道,数据库中的视图(view)是从⼀张或多张数据库表查询导出的虚拟表,反映基础表中数据的变化,且本⾝不存储数据。那么物化视图(materialized view)是什么呢?英⽂维基中给出的描述是相当准确的,抄录如下。In computing, a materialized view is a database object that contains the results of a query. For example, itmay be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table orjoin result, or may be a summary using an aggregate process of setting up a materialized view is sometimes called materialization. This is a form of cachingthe results of a query, similar to memoization of the value of a function in functional languages, and it issometimes described as a form of precomputation. As with other forms of precomputation, database userstypically use materialized views for performance reasons, i.e. as a form of optimization.物化视图是查询结果集的⼀份持久化存储,所以它与普通视图完全不同,⽽⾮常趋近于表。“查询结果集”的范围很宽泛,可以是基础表中部分数据的⼀份简单拷贝,也可以是多表join之后产⽣的结果或其⼦集,或者原始数据的聚合指标等等。所以,物化视图不会随着基础表的变化⽽变化,所以它也称为快照(snapshot)。如果要更新数据的话,需要⽤户⼿动进⾏,如周期性执⾏SQL,或利⽤触发器等机制。产⽣物化视图的过程就叫做“物化”(materialization)。⼴义地讲,物化视图是数据库中的预计算逻辑+显式缓存,典型的空间换时间思路。所以⽤得好的话,它可以避免对基础表的频繁查询并复⽤结果,从⽽显著提升查询的性能。它当然也可以利⽤⼀些表的特性,如索引。在传统关系型数据库中,Oracle、PostgreSQL、SQL Server等都⽀持物化视图,作为流处理引擎的Kafka和Spark也⽀持在流上建⽴物化视图。下⾯来聊聊ClickHouse⾥的物化视图功能。ClickHouse物化视图⽰例我们⽬前只是将CK当作点击流数仓来⽤,故拿点击流⽇志表当作基础表。CREATE TABLE IF NOT EXISTS ics_access_logON CLUSTER sht_ck_cluster_1 ( ts_date Date, ts_date_time DateTime, user_id Int64, event_type String, from_type String, column_type String, groupon_id Int64, site_id Int64, site_name String, main_site_id Int64, main_site_name String, merchandise_id Int64, merchandise_name String, -- A lot more )ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ods/analytics_access_log','{replica}')PARTITION BY ts_dateORDER BY (ts_date,toStartOfHour(ts_date_time),main_site_id,site_id,event_type,column_type)TTL ts_date + INTERVAL 1 MONTHSETTINGS index_granularity = 8192,use_minimalistic_part_header_in_zookeeper = 1,merge_with_ttl_timeout = 86400;w/ SummingMergeTrgeTreeee如果要查询某个站点⼀天内分时段的商品点击量,写出如下SQL语句。SELECT toStartOfHour(ts_date_time) AS ts_hour,merchandise_id,count() AS pvFROM ics_access_log_allWHERE ts_date = today() AND site_id = 10087GROUP BY ts_hour,merchandise_id;这是⼀个典型的聚合查询。如果各个地域的分析⼈员都经常执⾏该类查询(只是改变ts_date与site_id的条件⽽已),那么肯定有相同的语句会被重复执⾏多次,每次都会从analytics_access_log_all这张⼤的明细表取数据,显然是⽐较浪费资源的。⽽通过将CK中的物化视图与合适的MergeTree引擎配合使⽤,就可以实现预聚合,从物化视图出数的效率⾮常好。下⾯就根据上述SQL语句的查询条件创建⼀个物化视图,请注意其语法。CREATE MATERIALIZED VIEW IF NOT EXISTS _site_merchandise_visitON CLUSTER sht_ck_cluster_1ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{shard}/test/mv_site_merchandise_visit','{replica}')PARTITION BY ts_dateORDER BY (ts_date,ts_hour,site_id,merchandise_id)SETTINGS index_granularity = 8192, use_minimalistic_part_header_in_zookeeper = 1AS SELECT ts_date, toStartOfHour(ts_date_time) AS ts_hour, site_id, merchandise_id, count() AS visitFROM ics_access_logGROUP BY ts_date,ts_hour,site_id,merchandise_id;可见,物化视图与表⼀样,也可以指定表引擎、分区键、主键和表设置参数。商品点击量是个简单累加的指标,所以我们选择SummingMergeTree作为表引擎(上述是⾼可⽤情况,所以⽤了带复制的ReplicatedSummingMergeTree)。该引擎⽀持以主键分组,对数值型指标做⾃动累加。每当表的parts做后台merge的时候,主键相同的所有记录会被加和合并成⼀⾏记录,⼤⼤节省空间。⽤户在创建物化视图时,通过AS SELECT ...⼦句从基础表中查询需要的列,⼗分灵活。在默认情况下,物化视图刚刚创建时没有数据,随着基础表中的数据批量写⼊,物化视图的计算结果也逐渐填充起来。如果需要从历史数据初始化,在AS SELECT⼦句的前⾯加上POPULATE关键字即可。需要注意,在POPULATE填充历史数据的期间,新进⼊的这部分数据会被忽略掉,所以如果对准确性要求⾮常⾼,应慎⽤。执⾏完上述CREATE MATERIALIZED VIEW语句后,通过SHOW TABLES语句查询,会发现有⼀张名为.inner.[物化视图名]的表,这就是持久化物化视图数据的表,当然我们是不会直接操作它的。SHOW TABLES┌─name─────────────────────────────┐│ ._site_merchandise_visit ││ mv_site_merchandise_visit │└──────────────────────────────────┘基础表、物化视图与物化视图的underlying table的关系如下简图所⽰。/blog/clickhouse-materialized-views-illuminated-part-1当然,在物化视图上也可以建⽴分布式表。CREATE TABLE IF NOT EXISTS _site_merchandise_visit_allON CLUSTER sht_ck_cluster_1AS _site_merchandise_visitENGINE = Distributed(sht_ck_cluster_1,test,mv_site_merchandise_visit,rand());查询物化视图的风格与查询普通表没有区别,返回的就是预聚合的数据了。SELECT ts_hour,merchandise_id,sum(visit) AS visit_sumFROM _site_merchandise_visit_allWHERE ts_date = today() AND site_id = 10087GROUP BY ts_hour,merchandise_id;w/ Aggr AggregaegatingMergeTreeSummingMergeTree只能处理累加的情况,如果不只有累加呢?物化视图还可以配合更加通⽤的AggregatingMergeTree引擎使⽤,⽤户能够通过聚合函数(aggregate function)来⾃定义聚合指标。举个例⼦,假设我们要按各城市的页⾯来按分钟统计PV和UV,就可以创建如下的物化视图。CREATE MATERIALIZED VIEW IF NOT EXISTS _site_minute_pv_uvON CLUSTER sht_ck_cluster_1ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dw/main_site_minute_pv_uv','{replica}')PARTITION BY ts_dateORDER BY (ts_date,ts_minute,main_site_id)SETTINGS index_granularity = 8192, use_minimalistic_part_header_in_zookeeper = 1AS SELECT ts_date, toStartOfMinute(ts_date_time) as ts_minute, main_site_id, sumState(1) as pv, uniqState(user_id) as uvFROM ics_access_logGROUP BY ts_date,ts_minute,main_site_id;利⽤AggregatingMergeTree产⽣物化视图时,实际上是记录了被聚合指标的状态,所以需要在原本的聚合函数名(如sum、uniq)之后加上"State"后缀。创建分布式表的步骤就略去了。⽽从物化视图查询时,相当于将被聚合指标的状态进⾏合并并产⽣结果,所以需要在原本的聚合函数名(如sum、uniq)之后加上"Merge"后缀。-State和-Merge语法都是CK规定好的,称为聚合函数的组合器(combinator)。SELECT ts_date,formatDateTime(ts_minute,'%H:%M') AS hour_minute,sumMerge(pv) AS pv,uniqMerge(uv) AS uvFROM _site_minute_pv_uv_allWHERE ts_date = today() AND main_site_id = 10029GROUP BY ts_date,hour_minuteORDER BY hour_minute ASC;我们也可以通过查询系统表来查看物化视图实际占⽤的parts信息。SELECT

partition,

name,

rows,

bytes_on_disk,

modification_time,

min_date,

max_date,

engineFROM HERE (database = 'dw') AND (table = '._site_minute_pv_uv')┌─partition──┬─name───────────────┬─rows─┬─bytes_on_disk─┬───modification_time─┬───min_date─┬───max_date─┬─engine─────────────────────────┐│ 2020-05-19 │ 20200519_0_169_18 │ 9162 │ 4540922 │ 2020-05-19 20:33:29 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_170_179_2 │ 318 │ 294479 │ 2020-05-19 20:37:18 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_170_184_3 │ 449 │ 441282 │ 2020-05-19 20:40:24 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_170_189_4 │ 696 │ 594995 │ 2020-05-19 20:47:40 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_180_180_0 │ 40 │ 33416 │ 2020-05-19 20:37:58 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_181_181_0 │ 70 │ 34200 │ 2020-05-19 20:38:44 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_182_182_0 │ 83 │ 35981 │ 2020-05-19 20:39:32 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_183_183_0 │ 77 │ 35786 │ 2020-05-19 20:39:32 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_184_184_0 │ 81 │ 35766 │ 2020-05-19 20:40:19 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_185_185_0 │ 42 │ 32859 │ 2020-05-19 20:41:54 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_186_186_0 │ 83 │ 35750 │ 2020-05-19 20:43:30 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_187_187_0 │ 79 │ 34272 │ 2020-05-19 20:46:45 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_188_188_0 │ 75 │ 33917 │ 2020-05-19 20:46:45 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree ││ 2020-05-19 │ 20200519_189_189_0 │ 81 │ 35712 │ 2020-05-19 20:47:35 │ 2020-05-19 │ 2020-05-19 │ ReplicatedAggregatingMergeTree │└────────────┴────────────────────┴──────┴───────────────┴─────────────────────┴────────────┴────────────┴────────────────────────────The End继续去忙了,民那晚安吧(啥后记:如果表数据不是只增的,⽽是有较频繁的删除或修改(如接⼊changelog的表),物化视图底层需要改⽤CollapsingMergeTree/VersionedCollapsingMergeTree;如果物化视图是由两表join产⽣的,那么物化视图仅有在左表插⼊数据时才更新。如果只有右表插⼊数据,则不更新。

发布者:admin,转转请注明出处:http://www.yc00.com/web/1688435425a137435.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信