首页 > 其他分享 >Chapter 6.2-Preparing the dataset

Chapter 6.2-Preparing the dataset

时间:2025-01-20 19:28:10浏览次数:3  
标签:Chapter spam df dataset train 6.2 file path data

Chapter 6 -Fine-tuning for classification

6.2-Preparing the dataset

  • 如下图所示,分类微调 LLM 的三阶段过程

    1. 数据集准备。

    2. 模型设置。

    3. 微调和评估模型。

  • 本节准备用于分类微调的数据集。我们使用一个包含垃圾邮件和非垃圾邮件文本的数据集,对大语言模型(LLM)进行微调,以对其进行分类。首先,我们下载并解压数据集。

    import urllib.request
    import zipfile
    import os
    from pathlib import Path
    
    url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
    zip_path = "sms_spam_collection.zip"
    extracted_path = "sms_spam_collection"
    data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"
    
    def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
        if data_file_path.exists():
            print(f"{data_file_path} already exists. Skipping download and extraction.")
            return
    
        # Downloading the file
        with urllib.request.urlopen(url) as response:
            with open(zip_path, "wb") as out_file:
                out_file.write(response.read())
    
        # Unzipping the file
        with zipfile.ZipFile(zip_path, "r") as zip_ref:
            zip_ref.extractall(extracted_path)
    
        # Add .tsv file extension
        original_file_path = Path(extracted_path) / "SMSSpamCollection"
        os.rename(original_file_path, data_file_path)
        print(f"File downloaded and saved as {data_file_path}")
    
    download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
    

    导入csv文件

    import pandas as pd
    
    
    download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
    
    df = pd.read_csv(data_file_path, sep='\t', header=None, names=["Label", "Text"])
    df
    

    image-20250112141326941

    当我们检查类分布时,我们看到数据包含“ham”(即“not spam”)的频率比“spam”高得多

    print(df["Label"].value_counts())
    
    """输出"""
    Label
    ham     4825
    spam     747
    Name: count, dtype: int64
    

    处于快速微调大模型考虑,对数据集进行下采样(处理类平衡的方法之一),让每个类别包含出747个实例

    def create_balanced_dataset(df):
        
        # Count the instances of "spam"
        num_spam = df[df["Label"] == "spam"].shape[0]
        
        # Randomly sample "ham" instances to match the number of "spam" instances
        ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)
        
        # Combine ham "subset" with "spam"
        balanced_df = pd.concat([ham_subset
                               , df[df["Label"] == "spam"]]
                               , ignore_index=True
                               )
    
        return balanced_df
    
    
    balanced_df = create_balanced_dataset(df)
    print(balanced_df["Label"].value_counts())
    
    """输出"""
    Label
    ham     747
    spam    747
    Name: count, dtype: int64
    

    接下来,我们将字符串类标签“ham”和“spam”更改为整数类标签0和1:

    balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})  
    balanced_df
    

    image-20250112142111403

    现在让我们定义一个函数,将数据集随机划分为训练、验证和测试子集,70%用于训练,10%用于验证,20%用于测试

    def random_split(df, train_frac, validation_frac):
        # Shuffle the entire DataFrame
        df = df.sample(frac=1, random_state=123).reset_index(drop=True)
    
        # Calculate split indices
        train_end = int(len(df) * train_frac)
        validation_end = train_end + int(len(df) * validation_frac)
    
        # Split the DataFrame
        train_df = df[:train_end]
        validation_df = df[train_end:validation_end]
        test_df = df[validation_end:]
    
        return train_df, validation_df, test_df
    
    train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
    # Test size is implied to be 0.2 as the remainder
    
    train_df.to_csv("train.csv", index=None)
    validation_df.to_csv("validation.csv", index=None)
    test_df.to_csv("test.csv", index=None)
    

    我们已经下载了数据集,对其进行类别平衡并拆分为训练验证测试集。


6.3-Creating data loaders

标签:Chapter,spam,df,dataset,train,6.2,file,path,data
From: https://blog.csdn.net/hbkybkzw/article/details/145268094

相关文章

  • MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Cla
    MESED:AMulti-modalEntitySetExpansionDatasetwithFine-grainedSemanticClassesandHardNegativeEntities译文论文题目:MESED:AMulti-modalEntitySetExpansionDatasetwithFine-grainedSemanticClassesandHardNegativeEntities论文链接:https://ar......
  • Chapter 05: 路由与状态管理
    Chapter05:路由与状态管理VueRouter1.路由配置1.1基础路由配置//router/index.tsimport{createRouter,createWebHistory}from'vue-router'importtype{RouteRecordRaw}from'vue-router'constroutes:RouteRecordRaw[]=[{p......
  • Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinati
    本文是LLM系列文章,针对《TowardsBetterMulti-taskLearning:AFrameworkforOptimizingDatasetCombinationsinLargeLanguageModels》的翻译。迈向更好的多任务学习:一个优化大型语言模型中数据集组合的框架摘要1引言2相关工作3框架4实验设置5结果6......
  • javascript-chapter11 标准库
    1.setclassset是值的集合,set是无序的,且不能重复。一个值可以是或不是集合的成员。lets=newSet();//空集lett=newSet([1,s]);//有两个元素的集合。lett=newSet(s);//新集合,元素来自于sletunique=newSet("Mississippi");//Misp四个元素unique.size//4Set不用在使......
  • 解决 Spring Boot 启动错误问题:elasticsearch-java 8.17.0 报 elasticsearch-rest-cli
    解决SpringBoot启动错误问题:Thefollowingmethoddidnotexistorg.elasticsearch.client.RequestOptions$Builder.setHttpAsyncResponseConsumerFactory异常分析与解决方案在使用SpringBoot应用时,可能会遇到以下启动错误:***************************APPLICATIONF......
  • docker部署最新6.2版Zabbix Server端.240103
    一、安装docker,参见本博客docker安装文档。二、启动空的mysql-eMYSQL_DATABASE="zabbix"\-eMYSQL_USER="zabbix"\-eMYSQL_PASSWORD="zabbix_pwd1234"\-eMYSQL_ROOT_PASSWORD="root_pwd12345"\-p3306:3......
  • 6.2 Lexing raw delimited content 对原始分隔内容进行词法排序
    https://lalrpop.github.io/lalrpop/lexer_tutorial/002_raw_delimited_content.htmlOurcalculatorexampleoperatedonnumbersandarithmeticoperators.Thereisnooverlapbetweenthecharactersfornumericdigits(0,1,...),thecharactersrepresentingope......
  • Nginx稳定版最新1.26.2源码包安装【保姆级教学】
    Nginx安装及配置开源Nginx官网地址(https://nginx.org)Nginx源码包下载地址(https://nginx.org/en/download.html)Mainlineversion主线版本Stableversion稳定版本Legacyversions陈旧版本下载Nginx源码文件curl-Ohttps://nginx.org/download/nginx-1.26......
  • 高频手术设备GB 9706.202-2021第201.8.10.4.2条款手术连接用电线是如何扭曲试验
    在现代外科手术领域,技术的进步带来了革命性的变化,其中高频手术设备(也称为高频电刀或电切刀)的应用尤为显著。这种设备以其精确的切割能力和有效的凝血功能,已经成为手术室中不可或缺的工具。高频手术设备通过利用高频电流的热效应,不仅能够迅速切割组织,还能在切割的同时实现止血,大......
  • selectdataset 发布2024最热门Top100数据集
    遇见数据集索引了国内外的大部分网站。首页有最新的数据集推荐:GitHub、HuggingFace、arXiv这些热门站点​,都属于日级别的更新。这个站点是从搜索引擎方面去监控最新的数据集,大家如果有关注某个一个特点领域或话题的更新,可以关注这个站点:遇见数据集-让每个数据集都被发现,让每......