首页 > 其他分享 >K2-lhotse数据读取、训练流程分析

K2-lhotse数据读取、训练流程分析

时间:2023-10-12 17:46:29浏览次数:26  
标签:cut 读取 CutSet lhotse K2 cuts class features

class K2SpeechRecognitionDataset(torch.utils.data.Dataset):
The PyTorch Dataset for the speech recognition task using k2 library.

This dataset expects to be queried with lists of cut IDs,
for which it loads features and automatically collates/batches them.

To use it with a PyTorch DataLoader, set ``batch_size=None``
and provide a :class:`SimpleCutSampler` sampler.
Each item in this dataset is a dict of:
    .. code-block::
        {
            'inputs': float tensor with shape determined by :attr:`input_strategy`:
                      - single-channel:
                        - features: (B, T, F)
                        - audio: (B, T)
                      - multi-channel: currently not supported
            'supervisions': [
                {
                    'sequence_idx': Tensor[int] of shape (S,)
                    'text': List[str] of len S
                    # For feature input strategies
                    'start_frame': Tensor[int] of shape (S,)
                    'num_frames': Tensor[int] of shape (S,)
                    # For audio input strategies
                    'start_sample': Tensor[int] of shape (S,)
                    'num_samples': Tensor[int] of shape (S,)

                    # Optionally, when return_cuts=True
                    'cut': List[AnyCut] of len S
                }
            ]
        }
    Dimension symbols legend:
    * ``B`` - batch size (number of Cuts)
    * ``S`` - number of supervision segments (greater or equal to B, as each Cut may have multiple supervisions)
    * ``T`` - number of frames of the longest Cut
    * ``F`` - number of features<details>

    The 'sequence_idx' field is the index of the Cut used to create the example in the Dataset.
def getitem(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]:

CutSet定义如下

class CutSet(Serializable, AlgorithmMixin):
CutSet ties together all types of data -- audio, features and supervisions, and is suitable to represent
training/dev/test sets.

.. note::
    :class:`~lhotse.cut.CutSet` is the basic building block of PyTorch-style Datasets for speech/audio processing tasks.

When coming from Kaldi, there is really no good equivalent -- the closest concept may be Kaldi's "egs" for training
neural networks, which are chunks of feature matrices and corresponding alignments used respectively as inputs and
supervisions. :class:`~lhotse.cut.CutSet` is different because it provides you with all kinds of metadata,
and you can select just the interesting bits to feed them to your models.
CutSet初始化部分

三种不同方式去cut原始数据【需要对齐信息】

:class:`~lhotse.cut.CutSet` can be created from any combination of :class:`~lhotse.audio.RecordingSet`,
:class:`~lhotse.supervision.SupervisionSet`, and :class:`~lhotse.features.base.FeatureSet`
with :meth:`lhotse.cut.CutSet.from_manifests`::

    >>> from lhotse import CutSet
    >>> cuts = CutSet.from_manifests(recordings=my_recording_set)
    >>> cuts2 = CutSet.from_manifests(features=my_feature_set)
    >>> cuts3 = CutSet.from_manifests(
    ...     recordings=my_recording_set,
    ...     features=my_feature_set,
    ...     supervisions=my_supervision_set,
    ... )

When creating a :class:`.CutSet` with :meth:`.CutSet.from_manifests`, the resulting cuts will have the same duration
as the input recordings or features. For long recordings, it is not viable for training.
We provide several methods to transform the cuts into shorter ones.

Consider the following scenario::

                      Recording
    |-------------------------------------------|
    "Hey, Matt!"     "Yes?"        "Oh, nothing"
    |----------|     |----|        |-----------|

    .......... CutSet.from_manifests() ..........
                        Cut1
    |-------------------------------------------|

    ............. Example CutSet A ..............
        Cut1          Cut2              Cut3
    |----------|     |----|        |-----------|

    ............. Example CutSet B ..............
              Cut1                  Cut2
    |---------------------||--------------------|

    ............. Example CutSet C ..............
                 Cut1        Cut2
                |---|      |------|

The CutSet's A, B and C can be created like::

    >>> cuts_A = cuts.trim_to_supervisions()
    >>> cuts_B = cuts.cut_into_windows(duration=5.0)
    >>> cuts_C = cuts.trim_to_unsupervised_segments()
CutSet注意
  • 多线程
  • 修改不可传递
.. note::
    Some operations support parallel execution via an optional ``num_jobs`` parameter.
    By default, all processing is single-threaded.

.. caution::
    Operations on cut sets are not mutating -- they return modified copies of :class:`.CutSet` objects,
    leaving the original object unmodified (and all of its cuts are also unmodified).
CutSet文件转换及dict信息获取
:class:`~lhotse.cut.CutSet` can be stored and read from JSON, JSONL, etc. and supports optional gzip compression::

    >>> cuts.to_file('cuts.jsonl.gz')
    >>> cuts4 = CutSet.from_file('cuts.jsonl.gz')

It behaves similarly to a ``dict``::

        >>> 'rec1-1-0' in cuts
        True
        >>> cut = cuts['rec1-1-0']
        >>> for cut in cuts:
        >>>    pass
        >>> len(cuts)
        127
CutSet属性及相关操作
:class:`~lhotse.cut.CutSet` has some convenience properties and methods to gather information about the dataset::

    >>> ids = list(cuts.ids)
    >>> speaker_id_set = cuts.speakers
    >>> # The following prints a message:
    >>> cuts.describe()
    Cuts count: 547
    Total duration (hours): 326.4
    Speech duration (hours): 79.6 (24.4%)
    ***
    Duration statistics (seconds):
    mean    2148.0
    std      870.9
    min      477.0
    25%     1523.0
    50%     2157.0
    75%     2423.0
    max     5415.0
    dtype: float64


Manipulation examples::

    >>> longer_than_5s = cuts.filter(lambda c: c.duration > 5)
    >>> first_100 = cuts.subset(first=100)
    >>> split_into_4 = cuts.split(num_splits=4)
    >>> shuffled = cuts.shuffle()
    >>> random_sample = cuts.sample(n_cuts=10)
    >>> new_ids = cuts.modify_ids(lambda c: c.id + '-newid')

These operations can be composed to implement more complex operations, e.g.
bucketing by duration:

    >>> buckets = cuts.sort_by_duration().split(num_splits=30)
CutSet与原始数据detach解绑
Cuts in a :class:`.CutSet` can be detached from parts of their metadata::

    >>> cuts_no_feat = cuts.drop_features()
    >>> cuts_no_rec = cuts.drop_recordings()
    >>> cuts_no_sup = cuts.drop_supervisions()
CutSet较小时,推荐排序方法 ``` Sometimes specific sorting patterns are useful when a small CutSet represents a mini-batch::
>>> cuts = cuts.sort_by_duration(ascending=False)
>>> cuts = cuts.sort_like(other_cuts)
</details>

<details>
<summary>CutSet Batch操作pad\truncate</summary>

:class:~lhotse.cut.CutSet offers some batch processing operations::

>>> cuts = cuts.pad(num_frames=300)  # or duration=30.0
>>> cuts = cuts.truncate(max_duration=30.0, offset_type='start')  # truncate from start to 30.0s
>>> cuts = cuts.mix(other_cuts, snr=[10, 30], mix_prob=0.5)
</details>

<details>
<summary>CutSet DA操作【speed\vol\rvb】</summary>

且可以组合连续操作

:class:~lhotse.cut.CutSet supports lazy data augmentation/transformation methods which require adjusting some information
in the manifest (e.g., num_samples or duration).
Note that in the following examples, the audio is untouched -- the operations are stored in the manifest,
and executed upon reading the audio::

>>> cuts_sp = cuts.perturb_speed(factor=1.1)
>>> cuts_vp = cuts.perturb_volume(factor=2.)
>>> cuts_24k = cuts.resample(24000)
>>> cuts_rvb = cuts.reverb_rir(rir_recordings)

.. caution::
If the :class:.CutSet contained :class:~lhotse.features.base.Features manifests, they will be
detached after performing audio augmentations such as :meth:.CutSet.perturb_speed,
:meth:.CutSet.resample, :meth:.CutSet.perturb_volume, or :meth:.CutSet.reverb_rir.

</details>

<details>
<summary>CutSet并行计算var、mean</summary>

:class:~lhotse.cut.CutSet offers parallel feature extraction capabilities
(see meth:.CutSet.compute_and_store_features: for details),
and can be used to estimate global mean and variance::

>>> from lhotse import Fbank
>>> cuts = CutSet()
>>> cuts = cuts.compute_and_store_features(
...     extractor=Fbank(),
...     storage_path='/data/feats',
...     num_jobs=4
... )
>>> mvn_stats = cuts.compute_global_feature_stats('/data/features/mvn_stats.pkl', max_cuts=10000)

See also:

- :class:`~lhotse.cut.Cut`
</details>

标签:cut,读取,CutSet,lhotse,K2,cuts,class,features
From: https://www.cnblogs.com/lhx9527/p/17759751.html

相关文章

  • ADO.NET读取MySQL数据库的三种方式:DataReader、DataSet、DataView
    https://blog.csdn.net/lilongsy/article/details/127351421ADO.NET读取MySQL数据库有多种方式:DataReader、DataSet、DataView。Command对象的ExecuteScalar方法查询数据库获取某个单个值,但是如果获取多行、多列可以用ExcecuteReader,ExcecuteReader返回一个DataReader的数据流对......
  • python读取excel测试用例数据
     #excel_readUtil.pyfromopenpyxlimportload_workbookimportpandasclassHandleExcel:"""封装excel文件处理类"""def__init__(self,filename,sheetname=None):"""定义构造方法:p......
  • SAP ABAP 域(domain)固定值读取方法
    1SELECTSINGLEVALPOS2FROMDD07V3INTO@DATA(GT_DD07V)4WHEREDOMNAME='ZSTUTYPE'ANDVALPOS=@P_ZSTUTYP."域名和值5IFSY-SUBRC<>0.6MESSAGETEXT-134TYPE'S'DISPLAYLIKE'E......
  • 从串口读取数据的注意事项
    从串口一次可以读4096个字节的数据。如果读取时间间隔很长,串口会积累大量数据,可能会超过4096个字节。如果读取时间间隔很短,那么可能产生的数据很少,甚至没有数据。因此,需要确定合适的读取时间。对于读取的数据要仔细观察结果是16进制数据还是字符串数据。注意数据产生的时间先后顺序......
  • js_使用js读取link外部样式
    <linkrel="stylesheet"href="https://at.alicdn.com/t/c/font_1826665_p96ije5uc2f.css"crossorigin>varlinkStyle=document.getElementsByTagName("link")[0];varsheet=linkStyle.sheet||linkStyle.styleSheet;letreg=......
  • 用pyyaml读取yaml文件做接口数据驱动
    importyaml##封装读取yaml文件类#classYamlconf:#def__init__(self,file_path):#"""file_path:yaml文件的路径"""#self.file_path=file_path#defload_yaml(self):#withopen(self.file_path,enco......
  • Flask2.0基础教程
    Flask基础Flask介绍参考:Flask官方文档Flask是一个用Python编写的轻量级Web应用框架。它的核心非常简单,但是可以通过各种插件来扩展,使其可以用来构建复杂的Web应用。Flask的设计目标是保持核心简单且易于使用,同时能够被扩展以适应不同的应用需求。Flask框架主要特点......
  • C#使用utf-8读取ini
    参考: c#使用指定编码格式读写iniIniFile.csusingSystem.Runtime.InteropServices;classIniFile{publicIniFile(stringfilePath){m_FilePath=filePath;}privatestringm_FilePath;[DllImport("kerne......
  • C++ 使用getline()从文件中读取一行字符串
    我们知道,getline()方法定义在istream类中,而fstream和ifstream类继承自istream类,因此fstream和ifstream的类对象可以调用getline()成员方法。当文件流对象调用getline()方法时,该方法的功能就变成了从指定文件中读取一行字符串。该方法有以下2种语法格式:istream&......
  • .net core读取Response.Body
    读取请求体流的demopublicstaticasyncTask<string>GetBodyForm(thisHttpContexthttp){varcontent=string.Empty;varrequest=http.Request;try{request.Body.Position=......