无论是传统的机器学习,深度学习还是 LLM,都离不开数据。而实际中很多数据项目组织混乱,缺乏指导和流程。要想做好数据工程,就需要遵守一定的规则,并建立良好的项目结构,这样才能确保我们的数据项目事半功倍。
这10条规则,需要仔细理解。
规则 1:从一开始就井然有序,并保持井然有序
规则 2:一切都来自某个地方,并且原始数据是不可变的
规则 3:版本控制是基本的专业素养
规则 4:笔记本(jupyter)用于探索,源文件(.py)用于重复
规则 5:测试和健全性检查可防止灾难
规则 6:大声失败,快速失败
规则 7:从原始数据到最终输出,项目运行完全自动化
规则 8:提取并集中重要参数
规则 9:项目运行默认是详细的,并产生有形的工件
规则 10:从最简单的端到端管道
另外需要建立一个合乎逻辑、合理标准化但又灵活的项目结构.如下图
例如数据始终位于 data/ 中,原始数据位于 data/raw/,用于分析的最终清理版本位于 data/processed/ 中。jupyter文件位于notebooks/ 中,我们鼓励使用编号方案来提供秩序感。项目 py代码位于 src/ 中,可以从jupyter中导入以鼓励重复数据删除和标准化。这种合理的结构有助于其他人理解、重现和扩展您的分析,并建立一种信任感,。
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like make data
or make train
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default mkdocs project; see www.mkdocs.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator’s initials, and a short -
delimited description, e.g.
│ 1.0-jqp-initial-data-exploration
.
│
├── pyproject.toml <- Project configuration file with package metadata for
│ {{ cookiecutter.module_name }} and configuration for tools like black
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with pip freeze > requirements.txt
│
├── setup.cfg <- Configuration file for flake8
│
└── {{ cookiecutter.module_name }} <- Source code for use in this project.
│
├── init.py <- Makes {{ cookiecutter.module_name }} a Python module
│
├── config.py <- Store useful variables and configuration
│
├── dataset.py <- Scripts to download or generate data
│
├── features.py <- Code to create features for modeling
│
├── modeling
│ ├── init.py
│ ├── predict.py <- Code to run model inference with trained models
│ └── train.py <- Code to train models
│
└── plots.py <- Code to create visualizations
可以通过安装如下命令,快速建立项目结构
pip install cookiecutter-data-science
ccds
ccds https://github.com/drivendata/cookiecutter-data-science
project_name (project_name):My Analysis
repo_name (my_analysis):my_analysis
module_name (my_analysis):
author_name (Your name (or your organization/company/team)):Dat A. Scientist
description (A short description of the project.):This is my analysis of the data.
python_version_number (3.10):3.12
Select dataset_storage
1 - none
2 - azure
3 - s3
4 - gcs
Choose from [1/2/3/4] (1):3
bucket (bucket-name):s3://my-aws-bucket
aws_profile (default):
Select environment_manager
1 - virtualenv
2 - conda
3 - pipenv
4 - none
Choose from [1/2/3/4] (1):2
Select dependency_file
1 - requirements.txt
2 - environment.yml
3 - Pipfile
Choose from [1/2/3] (1):1
Select pydata_packages
1 - none
2 - basic
Choose from [1/2] (1):2
Select open_source_license
1 - No license file
2 - MIT
3 - BSD-3-Clause
Choose from [1/2/3] (1):2
Select docs
1 - mkdocs
2 - none
Choose from [1/2] (1):1
在实践中,需要根据实际情况,不断优化数据项目的流程和结构,真正实现数据的端到端,可重复的生成过程,从而满足数据分析,机器学习的需要。
标签:原则,工程,推荐,py,Choose,规则,data,Select,name From: https://blog.csdn.net/Practicer2015/article/details/140937894