dbt-audit-helper 从字面意思是dbt 的审计帮助工具,但是实际上我们也可以使用此工具做一些数据质量相关的东西
dbt-audit-helper 提供的macro
- 比较数据输出
包含了compare_relations,compare_queries,compare_row_counts - 比较列
compare_column_values, cmpare_all_columns, compare_which_columns_differ, compare_relation_columns - 以及打印输出到日志&输出到自定义的测试中
参考使用
- compare_queries
{% set old_query %}
select
id as order_id,
amount,
customer_id
from old_database.old_schema.fct_orders
{% endset %}
{% set new_query %}
select
order_id,
amount,
customer_id
from {{ ref('fct_orders') }}
{% endset %}
{{ audit_helper.compare_queries(
a_query = old_query,
b_query = new_query,
primary_key = "order_id"
) }}
- 输出结果
{% set old_query %}
select * from old_database.old_schema.dim_product
where is_latest
{% endset %}
{% set new_query %}
select * from {{ ref('dim_product') }}
{% endset %}
{% set audit_query = audit_helper.compare_column_values(
a_query = old_query,
b_query = new_query,
primary_key = "product_id",
column_to_compare = "status"
) %}
{% set audit_results = run_query(audit_query) %}
{% if execute %}
{% do audit_results.print_table() %}
{% endif %}
- 集成测试的
{% set old_relation = ref('stg_customers') %}
{% set dbt_relation = ref('customers') %}
{{ audit_helper.compare_relations(
a_relation = old_relation,
b_relation = dbt_relation,
primary_key = "customer_id"
) }}
where percent_of_total < 100
- 测试存储其他地方
{{config(schema='test_appv2')}}
{% set old_relation = ref('stg_customers') %}
{% set dbt_relation = ref('customers') %}
{{ audit_helper.compare_relations(
a_relation = old_relation,
b_relation = dbt_relation,
primary_key = "customer_id"
)}}
where percent_of_total = 100
内部实现简单说明
只说明一个简单的compare_queries
- compare_queries 内部实现
从处理上核心是对于sql 查询进行数据集的交叉并处理,通过分析包含的相同数据,之后提供结果信息的汇总
{% macro compare_queries(a_query, b_query, primary_key=None, summarize=true, limit=None) -%}
{{ return(adapter.dispatch('compare_queries', 'audit_helper')(a_query, b_query, primary_key, summarize, limit)) }}
{%- endmacro %}
{% macro default__compare_queries(a_query, b_query, primary_key=None, summarize=true, limit=None) %}
with a as (
{{ a_query }}
),
b as (
{{ b_query }}
),
a_intersect_b as (
select * from a
{{ dbt.intersect() }}
select * from b
),
a_except_b as (
select * from a
{{ dbt.except() }}
select * from b
),
b_except_a as (
select * from b
{{ dbt.except() }}
select * from a
),
all_records as (
select
*,
true as in_a,
true as in_b
from a_intersect_b
union all
select
*,
true as in_a,
false as in_b
from a_except_b
union all
select
*,
false as in_a,
true as in_b
from b_except_a
),
{%- if summarize %}
summary_stats as (
select
in_a,
in_b,
count(*) as count
from all_records
group by 1, 2
),
final as (
select
*,
round(100.0 * count / sum(count) over (), 2) as percent_of_total
from summary_stats
order by in_a desc, in_b desc
)
{%- else %}
final as (
select * from all_records
where not (in_a and in_b)
order by {{ primary_key ~ ", " if primary_key is not none }} in_a desc, in_b desc
)
{%- endif %}
select * from final
{%- if limit and not summarize %}
limit {{ limit }}
{%- endif %}
{% endmacro %}
说明
dbt-audit-helper 是一个很不错的工具包,可以实现不错的模型check,提示数据质量以及做一些模式审计处理
参考资料
https://github.com/dbt-labs/dbt-audit-helper
标签:audit,compare,helper,relation,query,dbt,select From: https://www.cnblogs.com/rongfengliang/p/18167802