首页 > 其他分享 >ZSTD相关笔记.md

ZSTD相关笔记.md

时间:2023-02-24 20:33:06浏览次数:47  
标签:md KB -- ZSTD 笔记 dic train zstd 字典

目录

相关资料

测试不同字典大小样本的压缩率情况

样本大小:102 MB (107,155,190 字节) 样本数量:173842

不使用字典进行压缩时的压缩率

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress request/request/* --output-dir-flat req-c

设置多线程参数了CPU占有率居然还是只有12.4%左右! IO居然为0;
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.
读取耗时大概20多分钟
显示压缩进度之后,CPU反而降低到4%左右, 内存涨到33.8MB;IO为1.2MB左右
从10:16:07启动到10:43:46结束 总压缩耗时00:27:39
173842 files compressed : 60.55% ( 102 MiB => 61.9 MiB)

按照ZSTD最小的字典大小256训练试试

zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic

CPU占有率只有12.4%, IO很少
开始时间:11:31:43
! Warning : setting manual memory limit for dictionary training data at 0 MB Training samples set too large (102 MB); training on 0 MB only...
Trying 82 different sets of parameters
d=6
Total number of training samples is 1 and is invalid.Failed to initialize context
dictionary training failed : Src size is incorrect

zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic

CPU占有率只有12.4%, IO很少
开始时间:11:56:32
结束时间:14:07:38
训练耗时:02:11:06
训练倍数:107155190/256=418574倍
k=146
d=6
steps=40
split=100

zstd --verbose --ultra -22 -T0 --auto-threads=logical --progress -D req.256.dic --output-dir-flat req-c-256 request/request/*

173842 files compressed : 49.99% ( 102 MiB => 51.1 MiB)
开始时间:14:24:46
结束时间:‎15:02:44
压缩耗时:00:37:58

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.256.dic -o req.256.c.dic

字典压缩后大小=101.56% ( 256 B => 260 B, req.256.c.dic)
节约效率=(100-49.99)/2561024=200.04=每KB带来200%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-49.99=10.56%
使用字典压缩后比不用字典提升效率=(60.55-49.99)/256
1024=42.24=每KB带来42.24%的效果提升

样本平均大小:107155190/173842=616 字节

zstd --verbose --train --train-cover --maxdict=616 request/request/* -o req.616.dic

CPU占有率只有12.4%, IO很少
开始时间:18:51:07
开始出现训练提示的时间:19:12:56(读取耗时≈22分钟)
训练耗时=03:56:51
总训练耗时=04:18:40
训练倍数:173842倍
k=242
d=6
steps=40
split=100

zstd -D req.616.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616

173842 files compressed : 37.78% ( 102 MiB => 38.6 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616.dic -o req.616.c.dic

字典压缩后大小=89.12% ( 616 B => 549 B, req.616.c.dic)
字典压缩后大小节约效率=(100-37.78)/5491024=116.053=每KB带来116%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-37.78=22.77%
使用字典压缩后比不用字典提升效率=(60.55-37.78)/549
1024=42.471=每KB带来42%的效果提升

比256字典提升幅度=49.99-37.78=12.21%
比256字典提升效率=(49.99-37.78)/(549-256)*1024=42.672=每KB带来42%的效果提升

按照样本平均大小的10倍来设置字典大小6166字节=6.02KB

zstd --verbose --train --train-cover --maxdict=6166 request/request/* -o req.6166.dic

CPU占有率只有12.4%, IO很少
开始时间:19:10:41
开始出现训练提示的时间:19:33:20(读取耗时≈22分钟)
总训练耗时=03:59:06
训练倍数:107155190/6166=17378倍
k=1250
d=8
steps=40
split=100

zstd -D req.6166.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6166

173842 files compressed : 21.22% ( 102 MiB => 21.7 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.dic -o req.6166.c.dic

字典压缩后大小=43.01% ( 6.02 KiB => 2.59 KiB, req.6166.c.dic)
字典压缩后大小节约效率=(100-21.22)/26521024=30.419=每KB带来30%的压缩率下降
相对平均大小的字典膨胀=2652/549=4.831倍数
相对平均大小的字典效率降低到=30.419/116.053=26.2%-100=73.8%
使用字典压缩后比不用字典提升幅度=60.55-21.22=39.33%
使用字典压缩后比不用字典提升效率=(60.55-21.22)/2652
1024=15.186=每KB带来15%的效果提升;

比256字典提升幅度=49.99-21.22=28.77%
比256字典提升效率=(49.99-21.22)/(2652-256)*1024=12.296=每KB带来12%的效果提升

手动删除字典后面的可见字符串再尝试压缩效果

原始未压缩后的字典:6166字节 压缩后2652字节(压缩后就没有可见的有意义字符了)
删除可见字符串之后:0151字节 (使用16进制编辑工具删除比较准确)
尝试进行压缩没有一启动就报错,但是最后执行的时候报内存错误:
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*

开始压缩时间:12:11:13
zstd: error 11 : Allocation error : not enough memory

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*

去掉M1024参数还是不行
zstd: error 11 : Allocation error : not enough memory

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.YeThin.dic -o req.6166.YeThin.c.dic

字典压缩后大小=108.61% ( 151 B => 164 B, req.6166.YeThin.c.dic)

按照样本平均大小的100倍来设置字典大小61666字节=60.22KB

zstd --verbose --train --train-cover --maxdict=61666 request/request/* -o req.61666.dic

CPU占有率只有12.4%, IO很少
开始时间:19:35:56
开始出现训练提示的时间:19:57:20(读取耗时≈22分钟)
总训练耗时=03:44:13
训练倍数:107155190/61666=1737倍
k=1970
d=6
steps=40
split=100

zstd -D req.61666.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-61666

173842 files compressed : 18.49% ( 102 MiB => 18.9 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.61666.dic -o req.61666.c.dic

字典压缩后大小=24.78% ( 60.2 KiB => 14.9 KiB, req.61666.c.dic)
字典压缩后大小节约效率=(100-18.49)/152781024=5.463=每KB带来5%的压缩率下降
相对平均大小的字典膨胀=15278/549=27.829倍数
相对平均大小的字典效率降低到=5.463/116.053=4.7%-100=95.3%
使用字典压缩后比不用字典提升幅度=60.55-18.49=42.06%
使用字典压缩后比不用字典提升效率=(60.55-18.49)/15278
1024=2.819=每KB带来2%的效果提升

比256字典提升幅度=49.99-18.49=31.5%
比256字典提升效率=(49.99-18.49)/(15278-256)*1024=2.147=每KB带来2%的效果提升

按照样本平均大小的1K倍来设置字典大小616000字节=587.47KB

zstd --verbose --train --train-cover --maxdict=616000 request/request/* -o req.616000.dic

CPU占有率只有12.4%, IO很少
开始时间:18:59:18
开始出现训练提示的时间:19:21:37(读取耗时≈22分钟)
总训练耗时=03:57:46
训练倍数:173倍
k=1778
d=8
steps=40
split=100

zstd -D req.616000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616000

173842 files compressed : 16.00% ( 102 MiB => 16.4 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616000.dic -o req.616000.c.dic

字典压缩后大小=19.99% ( 602 KiB => 120 KiB, req.616000.c.dic)
字典压缩后大小节约效率=(100-16.00)/1231511024=0.698=每KB带来0.69%的压缩率下降
相对平均大小的字典膨胀=123151/549=224.319倍数
相对平均大小的字典效率降低到=0.698/116.053=0.6%-100=99.4%
使用字典压缩后比不用字典提升幅度=60.55-16.00=44.55%
使用字典压缩后比不用字典提升效率=(60.55-16.00)/123151
1024=0.37=每KB带来0.37%的效果提升

比256字典提升幅度=49.99-16.00=33.99%
比256字典提升效率=(49.99-16.00)/(123151-256)*1024=0.283=每KB带来0.283%的效果提升

按照样本平均大小的1W倍来设置字典大小6160000字节=5.87MB

zstd --verbose --train --train-cover --maxdict=6160000 request/request/* -o req.6160000.dic

CPU占有率只有12.4%, IO很少
开始时间:18:57:15
开始出现训练提示的时间:19:19:35(读取耗时≈22分钟)
总训练耗时=03:58:28
训练倍数:17.3倍
k=1922
d=8
steps=40
split=100

zstd -D req.6160000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6160000

173842 files compressed : 10.64% ( 102 MiB => 10.9 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6160000.dic -o req.6160000.c.dic

字典压缩后大小=15.15% ( 5.87 MiB => 912 KiB, req.6160000.c.dic)
字典压缩后大小节约效率=(100-10.64)/9334571024=0.098=每KB带来0.098%的压缩率下降
相对平均大小的字典膨胀=933457/549=1700.286倍数
相对平均大小的字典效率降低到=0.098/116.053=0.1%-100=99.9%
使用字典压缩后比不用字典提升幅度=60.55-10.64=49.91%
使用字典压缩后比不用字典提升效率=(60.55-10.64)/933457
1024=0.055=每KB带来0.055%的效果提升

比256字典提升幅度=49.99-10.64=39.35%
比256字典提升效率=(49.99-10.64)/(933457-256)*1024=0.043=每KB带来0.043%的效果提升


样本大小:107 KB (110,148 字节) 样本数量:306

样本平均大小:110148/306= 359.96字节

训练得到的字典大小都为:23788,超过这个大小都是一样的!

zstd --verbose --train --train-cover --maxdict=110148 req/* -o req.110148.dic
zstd --verbose --train --train-cover --maxdict=110141 req/* -o req.110141.dic
zstd --verbose --train --train-cover --maxdict=108KB req/* -o req.108KB.dic

字典大小=23788
比平均尺寸大的倍数=23788/360=66倍
训练倍数:110148/23788=4.63倍

zstd --ultra -22 --progress req.108KB.dic -o req.108KB.c.dic

req.108KB.dic : 50.25% ( 23.2 KiB => 11.7 KiB, req.108KB.c.dic)
字典压缩后大小=11.6 KB (11,954 字节)

zstd -D req.108KB.dic --ultra -22 --progress req/* --output-dir-flat req-c-108KB

306 files compressed : 17.13% ( 108 KiB => 18.4 KiB)

重新指定字典大小23788进行训练,压缩率居然更高!

zstd --verbose --train --train-cover --maxdict=23788 req/* -o req.23788.dic

字典大小=23788

zstd --ultra -22 --progress req.23788.dic -o req.23788.c.dic

req.23788.dic : 28.52% ( 23.2 KiB => 6.63 KiB, req.23788.c.dic)
字典压缩后大小=6.62 KB (6,785 字节)

zstd -D req.23788.dic --ultra -22 --progress req/* --output-dir-flat req-c-23788

306 files compressed : 14.37% ( 108 KiB => 15.5 KiB)

随便测试一个比平均大小更大的字典大小888

zstd --verbose --train --train-cover --maxdict=888 req/* -o req.888.dic

字典大小=888

zstd --ultra -22 --progress req.888.dic -o req.888.c.dic

req.888.dic : 65.99% ( 888 B => 586 B, req.888.c.dic)
字典压缩后大小=586 字节

zstd -D req.888.dic --ultra -22 --progress req/* --output-dir-flat req-c-888

306 files compressed : 32.51% ( 108 KiB => 35.0 KiB)

按照110148/306=359.961平均每个文件360的大小来指定字典大小

zstd --verbose --train --train-cover --maxdict=360 req/* -o req.360.dic

字典大小=360

zstd --ultra -22 --progress req.360.dic -o req.360.c.dic

req.360.dic : 94.44% ( 360 B => 340 B, req.360.c.dic)
字典压缩后大小=340 字节

zstd -D req.360.dic --ultra -22 --progress req/* --output-dir-flat req-c-360

306 files compressed : 47.99% ( 108 KiB => 51.6 KiB)

按照ZSTD最小的字典大小256训练试试

zstd --verbose --train --train-cover --maxdict=256 req/* -o req.256.dic

字典大小=256

zstd --ultra -22 --progress req.256.dic -o req.256.c.dic

req.256.dic :102.34% ( 256 B => 262 B, req.256.c.dic)
字典压缩后大小=262 字节(反而增大了!!!)

zstd -D req.256.dic --ultra -22 --progress req/* --output-dir-flat req-c-256

306 files compressed : 58.17% ( 108 KiB => 62.6 KiB)

不使用字典进行压缩时的压缩率

zstd --ultra -22 --progress req/* --output-dir-flat req-c

306 files compressed : 74.81% ( 108 KiB => 80.5 KiB)


训练字典的相关参数经验

以下3个参数训练出来的字典MD5居然都是一样的,看来还是不会使用这些参数:
--train-cover=shrink=2
--train-cover=shrink
--train-cover

训练字典时无法占满CPU和内存

zstd --verbose -T0 --auto-threads=logical --train -M1024 --train-cover --maxdict=616 request/request/* -o req.616.dic

设置多线程参数了CPU占有率居然还是只有12.4%左右!
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.

参数相关帮助说明

一看就懂的K近邻算法(KNN),K-D树,并实现手写数字识别! - 简书

Kd-树是K-dimension tree的缩写
KD树的最近邻搜索算法

zstd(1) — zstd — Debian unstable — Debian Manpages

如果一个小数据样本家族中存在某种相关性,那么培训就会奏效。特定于数据的字典越多,它的效率就越高 (没有通用字典)。
因此,每种类型的数据部署一个字典将提供最大的好处。
字典增益大多在最初的几KB有效。然后,压缩算法将逐渐使用先前解码的内容来更好地压缩文件的其余部分。
--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]
If split is not specified or split <= 0, then the default value of 100 is used.
If shrink flag is not used, then the default value for shrinkDict of 0 is used.
If shrink is not specified, then the default value for shrinkDictMaxRegression of 1 is used.
Having shrink enabled takes a truncated dictionary of minimum size and doubles in size(尺寸翻倍)
until compression ratio of the truncated dictionary is at most shrinkDictMaxRegression%(退化率) worse than the compression ratio of the largest dictionary.
启用shrink后,会得到一个最小尺寸的截断的字典,并使其尺寸加倍,直到截断的字典的压缩率最多比最大的字典的压缩率差N%。

字典大小设置不准确时的提示

! Warning : data size of samples too small for target dictionary size
! Samples should be about 100x larger than target dictionary size
Trying 5 different sets of parameters
WARNING: The maximum dictionary size 112640 is too large compared to the source size 82775!
size(source)/size(dictionary) = 0.734863, but it should be >= 10!
This may lead to a subpar次品 dictionary!
We recommend training on sources at least 10x, and preferably 100x the size of the dictionary!

Zstandard CLI 帮助说明

*** Zstandard CLI (64-bit) v1.5.4, by Yann Collet ***

Compress or decompress the INPUT file(s); reads from STDIN if INPUT is `-` or not provided.

Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT]

Options:
  -o OUTPUT                     Write output to a single file, OUTPUT.
  -k, --keep                    Preserve INPUT file(s). [Default]
  --rm                          Remove INPUT file(s) after successful (de)compression.

  -#                            Desired compression level, where `#` is a number between 1 and 19;
                                lower numbers provide faster compression, higher numbers yield
                                better compression ratios. [Default: 3]

  -d, --decompress              Perform decompression.
  -D DICT                       Use DICT as the dictionary for compression or decompression.

  -f, --force                   Disable input and output checks. Allows overwriting existing files,
                                receiving input from the console, printing ouput to STDOUT, and
                                operating on links, block devices, etc. Unrecognized formats will be
                                passed-through through as-is.

  -h                            Display short usage and exit.
  -H, --help                    Display full help and exit.
  -V, --version                 Display the program version and exit.

Advanced options:
  -c, --stdout                  Write to STDOUT (even if it is a console) and keep the INPUT file(s).

  -v, --verbose                 Enable verbose output; pass multiple times to increase verbosity.
  -q, --quiet                   Suppress warnings; pass twice to suppress errors.
  --trace LOG                   Log tracing information to LOG.

  --[no-]progress               Forcibly show/hide the progress counter. NOTE: Any (de)compressed
                                output to terminal will mix with progress counter text.

  -r                            Operate recursively on directories.
  --filelist LIST               Read a list of files to operate on from LIST.
  --output-dir-flat DIR         Store processed files in DIR.
  --[no-]asyncio                Use asynchronous IO. [Default: Enabled]

  --[no-]check                  Add XXH64 integrity checksums during compression. [Default: Add, Validate]
                                If `-d` is present, ignore/validate checksums during decompression.

  --                            Treat remaining arguments after `--` as files.

Advanced compression options:
  --ultra                       Enable levels beyond 19, up to 22; requires more memory.
  --fast[=#]                    Use to very fast compression levels. [Default: 1]
  --adapt                       Dynamically adapt compression level to I/O conditions.
  --long[=#]                    Enable long distance matching with window log #. [Default: 27]
  --patch-from=REF              Use REF as the reference point for Zstandard's diff engine.

  -T#                           Spawn # compression threads. [Default: 1; pass 0 for core count.]
  --single-thread               Share a single thread for I/O and compression (slightly different than `-T1`).
  --auto-threads={physical|logical}
                                Use physical/logical cores when using `-T0`. [Default: Physical]

  -B#                           Set job size to #. [Default: 0 (automatic)]
  --rsyncable                   Compress using a rsync-friendly method (`-B` sets block size).

  --exclude-compressed          Only compress files that are not already compressed.

  --stream-size=#               Specify size of streaming input from STDIN.
  --size-hint=#                 Optimize compression parameters for streaming input of approximately size #.
  --target-compressed-block-size=#
                                Generate compressed blocks of approximately # size.

  --no-dictID                   Don't write `dictID` into the header (dictionary compression only).
  --[no-]compress-literals      Force (un)compressed literals.
  --[no-]row-match-finder       Explicitly enable/disable the fast, row-based matchfinder for
                                the 'greedy', 'lazy', and 'lazy2' strategies.

  --format=zstd                 Compress files to the `.zst` format. [Default]
  --format=gzip                 Compress files to the `.gz` format.
  --format=xz                   Compress files to the `.xz` format.
  --format=lzma                 Compress files to the `.lzma` format.

Advanced decompression options:
  -l                            Print information about Zstandard-compressed files.
  --test                        Test compressed file integrity.
  -M#                           Set the memory usage limit to # megabytes.
  --[no-]sparse                 Enable sparse mode. [Default: Enabled for files, disabled for STDOUT.]
  --[no-]pass-through           Pass through uncompressed files as-is. [Default: Disabled]

Dictionary builder:
  --train                       Create a dictionary from a training set of files.

  --train-cover[=k=#,d=#,steps=#,split=#,shrink[=#]]
                                Use the cover algorithm (with optional arguments).
  --train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink[=#]]
                                Use the fast cover algorithm (with optional arguments).

  --train-legacy[=s=#]          Use the legacy algorithm with selectivity #. [Default: 9]
  -o NAME                       Use NAME as dictionary name. [Default: dictionary]
  --maxdict=#                   Limit dictionary to specified size #. [Default: 112640]
  --dictID=#                    Force dictionary ID to #. [Default: Random]

Benchmark options:
  -b#                           Perform benchmarking with compression level #. [Default: 3]
  -e#                           Test all compression levels up to #; starting level is `-b#`. [Default: 1]
  -i#                           Set the minimum evaluation to time # seconds. [Default: 3]
  -B#                           Cut file into independent chunks of size #. [Default: No chunking]
  -S                            Output one benchmark result per input file. [Default: Consolidated result]
  --priority=rt                 Set process priority to real-time.

标签:md,KB,--,ZSTD,笔记,dic,train,zstd,字典
From: https://www.cnblogs.com/AsionTang/p/17153041.html

相关文章

  • 2023算法笔记
    Hoppz算法笔记前言2023_02_18还是太菜了,笔记基于《算法导论》&&《数据结构与算法分析C++描述》,想着为复试准备(虽然很大程度上今年是考不上了),就开始重看算法导论,前......
  • 极光笔记 | 埋点体系建设与实施方法论
    PART 01 前言随着网络技术的发展,从粗犷型到精细化运营型,再到现在的数字化运营,数据变得越来越细分和重要,不仅可以进行策略调整,还可以实现自动化的精细化运营。而数据价值......
  • 极光笔记 | 埋点体系建设与实施方法论
    PART01前言随着网络技术的发展,从粗犷型到精细化运营型,再到现在的数字化运营,数据变得越来越细分和重要,不仅可以进行策略调整,还可以实现自动化的精细化运营。而数据价值......
  • 图笔记
    图笔记图的遍历代码结构和回溯代码结构相似,因此在整理回溯题目之后整理了图相关的题目,这部分的题目考察较少,不建议花太多工夫,了解常用的算法即可,力扣周赛后两题极高概率......
  • React Native学习笔记----React Native简介与环境安装
    ReactNative的基础是React,是在web端非常流行的开源UI框架。要想掌握ReactNative,先了解React框架本身是非常有帮助的。一、什么是ReactNative1.1ReactNati......
  • 阿里云ACP学习笔记-负载均衡SLB-弹性伸缩AS-对象存储OSS
    9负载均衡SLBECS+SLB+AS CDN              协议支持:  SSL证书管理:       10弹性伸缩AS 11对象存储OSS ......
  • <<运维监控系统实战笔记>> 小记随笔 —— Prometheus 初识
    Prometheus简介Prometheusserver包含时序库、告警引擎、数据展示三大块,体系中最核心的组件Exporters采集数据的客户端,负载采集数据存在内存中,提供http接口,让......
  • 常系数齐次线性递推学习笔记
    求一个满足\(k\)阶齐次线性递推数列\(a_i\)的第\(n\)项,即:\[f_n=\sum_{i=1}^ka_if_{n-i}\]\(n\leq10^{18},k\leq32000\)。使用矩阵乘法加速可以做到\(O(k^3\l......
  • 【笔记】springboot使用Spring-data-jpa
    Spring-data-jpaSpring-data-jpa依赖于Hibernate。通过整合Hibernate之后,我们以操作Java实体的⽅式最终将数据改变映射到数据库表中。添加依赖:<dependency<groupId......
  • 笔记
     判断datatable指定列是否重复varquery=fromtindtTemp.AsEnumerable()grouptbynew{t1=t.Field<string>("St......