Chapter 6.2-Preparing the dataset

时间：2025-01-20 19:28:10浏览次数：3

标签：Chapter spam df dataset train 6.2 file path data

Chapter 6 -Fine-tuning for classification

6.2-Preparing the dataset

如下图所示，分类微调 LLM 的三阶段过程
1. 数据集准备。
2. 模型设置。
3. 微调和评估模型。

本节准备用于分类微调的数据集。我们使用一个包含垃圾邮件和非垃圾邮件文本的数据集，对大语言模型（LLM）进行微调，以对其进行分类。首先，我们下载并解压数据集。

import urllib.request
import zipfile
import os
from pathlib import Path

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"

def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
    if data_file_path.exists():
        print(f"{data_file_path} already exists. Skipping download and extraction.")
        return

    # Downloading the file
    with urllib.request.urlopen(url) as response:
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read())

    # Unzipping the file
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extracted_path)

    # Add .tsv file extension
    original_file_path = Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")

download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

导入csv文件

import pandas as pd


download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

df = pd.read_csv(data_file_path, sep='\t', header=None, names=["Label", "Text"])
df

当我们检查类分布时，我们看到数据包含“ham”（即“not spam”）的频率比“spam”高得多

print(df["Label"].value_counts())

"""输出"""
Label
ham     4825
spam     747
Name: count, dtype: int64

处于快速微调大模型考虑，对数据集进行下采样(处理类平衡的方法之一)，让每个类别包含出747个实例

def create_balanced_dataset(df):
    
    # Count the instances of "spam"
    num_spam = df[df["Label"] == "spam"].shape[0]
    
    # Randomly sample "ham" instances to match the number of "spam" instances
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)
    
    # Combine ham "subset" with "spam"
    balanced_df = pd.concat([ham_subset
                           , df[df["Label"] == "spam"]]
                           , ignore_index=True
                           )

    return balanced_df


balanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())

"""输出"""
Label
ham     747
spam    747
Name: count, dtype: int64

接下来，我们将字符串类标签“ham”和“spam”更改为整数类标签0和1：

balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})  
balanced_df

现在让我们定义一个函数，将数据集随机划分为训练、验证和测试子集,70%用于训练，10%用于验证，20%用于测试

def random_split(df, train_frac, validation_frac):
    # Shuffle the entire DataFrame
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)

    # Calculate split indices
    train_end = int(len(df) * train_frac)
    validation_end = train_end + int(len(df) * validation_frac)

    # Split the DataFrame
    train_df = df[:train_end]
    validation_df = df[train_end:validation_end]
    test_df = df[validation_end:]

    return train_df, validation_df, test_df

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
# Test size is implied to be 0.2 as the remainder

train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

我们已经下载了数据集，对其进行类别平衡并拆分为训练验证测试集。

6.3-Creating data loaders

标签：Chapter,spam,df,dataset,train,6.2,file,path,data
From： https://blog.csdn.net/hbkybkzw/article/details/145268094

MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Cla
MESED:AMulti-modalEntitySetExpansionDatasetwithFine-grainedSemanticClassesandHardNegativeEntities译文论文题目：MESED:AMulti-modalEntitySetExpansionDatasetwithFine-grainedSemanticClassesandHardNegativeEntities论文链接：https://ar......
Chapter 05: 路由与状态管理
Chapter05:路由与状态管理VueRouter1.路由配置1.1基础路由配置//router/index.tsimport{createRouter,createWebHistory}from'vue-router'importtype{RouteRecordRaw}from'vue-router'constroutes:RouteRecordRaw[]=[{p......
Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinati
本文是LLM系列文章，针对《TowardsBetterMulti-taskLearning:AFrameworkforOptimizingDatasetCombinationsinLargeLanguageModels》的翻译。迈向更好的多任务学习：一个优化大型语言模型中数据集组合的框架摘要1引言2相关工作3框架4实验设置5结果6......
javascript-chapter11 标准库
1.setclassset是值的集合，set是无序的，且不能重复。一个值可以是或不是集合的成员。lets=newSet();//空集lett=newSet([1,s]);//有两个元素的集合。lett=newSet(s);//新集合，元素来自于sletunique=newSet("Mississippi");//Misp四个元素unique.size//4Set不用在使......
解决 Spring Boot 启动错误问题：elasticsearch-java 8.17.0 报 elasticsearch-rest-cli
解决SpringBoot启动错误问题：Thefollowingmethoddidnotexistorg.elasticsearch.client.RequestOptions$Builder.setHttpAsyncResponseConsumerFactory异常分析与解决方案在使用SpringBoot应用时，可能会遇到以下启动错误：***************************APPLICATIONF......
docker部署最新6.2版Zabbix Server端.240103
一、安装docker，参见本博客docker安装文档。二、启动空的mysql-eMYSQL_DATABASE="zabbix"\-eMYSQL_USER="zabbix"\-eMYSQL_PASSWORD="zabbix_pwd1234"\-eMYSQL_ROOT_PASSWORD="root_pwd12345"\-p3306:3......
6.2 Lexing raw delimited content 对原始分隔内容进行词法排序
https://lalrpop.github.io/lalrpop/lexer_tutorial/002_raw_delimited_content.htmlOurcalculatorexampleoperatedonnumbersandarithmeticoperators.Thereisnooverlapbetweenthecharactersfornumericdigits(0,1,...),thecharactersrepresentingope......
Nginx稳定版最新1.26.2源码包安装【保姆级教学】
Nginx安装及配置开源Nginx官网地址(https://nginx.org)Nginx源码包下载地址(https://nginx.org/en/download.html)Mainlineversion主线版本Stableversion稳定版本Legacyversions陈旧版本下载Nginx源码文件curl-Ohttps://nginx.org/download/nginx-1.26......
高频手术设备GB 9706.202-2021第201.8.10.4.2条款手术连接用电线是如何扭曲试验
在现代外科手术领域，技术的进步带来了革命性的变化，其中高频手术设备（也称为高频电刀或电切刀）的应用尤为显著。这种设备以其精确的切割能力和有效的凝血功能，已经成为手术室中不可或缺的工具。高频手术设备通过利用高频电流的热效应，不仅能够迅速切割组织，还能在切割的同时实现止血，大......
selectdataset 发布2024最热门Top100数据集
遇见数据集索引了国内外的大部分网站。首页有最新的数据集推荐：GitHub、HuggingFace、arXiv这些热门站点，都属于日级别的更新。这个站点是从搜索引擎方面去监控最新的数据集，大家如果有关注某个一个特点领域或话题的更新，可以关注这个站点：遇见数据集-让每个数据集都被发现，让每......

Chapter 6.2-Preparing the dataset

Chapter 6 -Fine-tuning for classification

6.2-Preparing the dataset

6.3-Creating data loaders

相关文章

赞助商

阅读排行