首页 > 其他分享 >金融领域预训练模型用于分类任务,大模型应用参考

金融领域预训练模型用于分类任务,大模型应用参考

时间:2023-09-24 19:46:56浏览次数:42  
标签:训练 参考 macro 模型 batch data micro model true

在bert的基础上加了一个分类层:

代码实现:

output = bert.model.output
output = Lambda(lambda x: x[:, 0], name='CLS-token')(output)
output = Dense(
    units=num_classes,
    activation='softmax',
    kernel_initializer=bert.initializer
)(output)

model = keras.models.Model(bert.model.input, output)

 

然后就是利用bert的输出训练一个分类任务了!!!

完整代码如下:

#! -*- coding:utf-8 -*-
#FinWoBERT:中文金融领域增强预训练模型
'''
康明. 深度学习预训练语言模型(案例篇) ——中文金融文本情绪分类研究[M]. 北京: 清华大学出版社, 2022.
Ming Kang. Pretraining Language Models in Deep Learning: A Case Study of Chinese Sentiment Classification for Financial Text.
Beijing: Tsinghua University Press, 2022.
'''

import os, json
import numpy as np
from bert4keras.backend import keras, set_gelu
from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
from bert4keras.optimizers import Adam,extend_with_piecewise_linear_lr
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open
from keras.layers import Lambda, Dense

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

import jieba_fast as jieba
jieba.initialize()

num_classes = 3
maxlen = 512
batch_size = 32

# bert配置
# path = "/Users/sssdjj/bert_source/"
config_path = 'data/chinese_wobert_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'data/chinese_wobert_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'data/chinese_wobert_L-12_H-768_A-12/vocab.txt'


labels = {"其他":0,"利多":1,"利空":2}

stop_words = []

# 加入停用词
# with open("data/cn_stopwords.txt") as f:
#     for i in f:
#         stop_words.append(i.strip())

def load_data(filename):
    """加载数据
    单条格式:(文本, 标签id)
    """
    D = []
    with open(filename, encoding='utf-8') as f:
        for l in f:
            if len(l.strip().split('|||')) == 2:
                label,text = l.strip().split('|||')
                # 去除停用词
                # for i in stop_words:
                    # text = str(text).replace(i," ")
                D.append((text, labels[label]))
    return D

path = "data/"
# 加载数据集
train_data = load_data(path+'train.txt')
valid_data = load_data(path+'test.txt')

# 增加自定义词库  word.txt 元词表  word_zhengf.txt 加入正负词
jieba.load_userdict(path+"word_zhengf_buzai_vocab.txt")

# 建立分词器
tokenizer = Tokenizer(
    dict_path,
    do_lower_case=True,
    pre_tokenize=lambda s: jieba.cut(s, HMM=False)
)


class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, (text, label) in self.sample(random):
            token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)
            batch_token_ids.append(token_ids)
            batch_segment_ids.append(segment_ids)
            batch_labels.append([label])
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []
ctokens = []
with open(path+"word_zhengf_buzai_vocab.txt") as f:
    for i in f:

        ctokens.append(tokenizer.encode(i.strip())[0][1:-1])

bert = build_transformer_model(
    config_path,
    checkpoint_path,
    return_keras_model=False,
    compound_tokens=ctokens
)

output = bert.model.output
output = Lambda(lambda x: x[:, 0], name='CLS-token')(output)
output = Dense(
    units=num_classes,
    activation='softmax',
    kernel_initializer=bert.initializer
)(output)

model = keras.models.Model(bert.model.input, output)
model.summary()
# AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR')
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=Adam(learning_rate=1e-6),  # 用足够小的学习率
    metrics=['accuracy'],
)

# 转换数据集
train_generator = data_generator(train_data, batch_size)
valid_generator = data_generator(valid_data, batch_size)

def norm_index(y_true,y_pred):
    acc = accuracy_score(y_true, y_pred)
    macro_prec = precision_score(y_true, y_pred, average='macro')
    micro_prec = precision_score(y_true, y_pred, average='micro')

    macro_recall = recall_score(y_true, y_pred, average='macro')
    micro_recall = recall_score(y_true, y_pred, average='micro')

    macro_f1 = f1_score(y_true, y_pred, average='macro')
    micro_f1 = f1_score(y_true, y_pred, average='micro')

    cm = confusion_matrix(y_true, y_pred)

    return acc, macro_prec,micro_prec, macro_recall, micro_recall,macro_f1,micro_f1, cm


def evaluate(data):
    total, right = 0., 0.
    pred_list,true_list = [], []
    for x_true, y_true in data:
        y_pred = model.predict(x_true).argmax(axis=1)
        y_true = y_true[:, 0]
        # total += len(y_true)
        # right += (y_true == y_pred).sum()
        pred_list.extend(y_pred)
        true_list.extend(y_true)
    return norm_index(true_list,pred_list)


class Evaluator(keras.callbacks.Callback):
    def __init__(self):
        self.best_val_acc = 0.

    def on_epoch_end(self, epoch, logs=None):
        val_acc, macro_prec,micro_prec, macro_recall, micro_recall,macro_f1,micro_f1, cm = evaluate(valid_generator)
        if val_acc > self.best_val_acc:
            self.best_val_acc = val_acc
            model.save_weights('train/best_model_sentiment.weights')
        
        print(
            u'val_acc: %.15f, best_val_acc: %.15f,loss:%s\n' %
            (val_acc, self.best_val_acc,logs)
        )
        print(
            u'macro_prec: %.15f, micro_prec: %.15f\n' %
            (macro_prec, micro_prec)
        )
        print(
            u'macro_recall: %.15f, micro_recall: %.15f\n' %
            (macro_recall, micro_recall)
        )
        print(
            u'macro_f1: %.15f, micro_f1: %.15f\n' %
            (macro_f1, micro_f1)
        )
        print(cm)


if __name__ == '__main__':

    evaluator = Evaluator()

    model.fit_generator(
        train_generator.forfit(),
        steps_per_epoch=len(train_generator),
        epochs= 100,
        callbacks=[evaluator]
    )

else:

    model.load_weights('best_model_sentiment.weights')

  

为了提升金融领域的领域大模型,还可以针对预训练加入金融领域特有的语料库:

 

 关键技术:

灾难性遗忘

 

 

 

几个文章可以深入阅读下:

Yuqing Zhao, Divya Saxena, Jiannong Cao. Revisiting Parameter Reuse to Overcome Catastrophic Forgetting in Neural Networks.
arXiv:2207.11005v1 [cs.LG], 2022.
Matteo Boschini, Lorenzo Bonicelli, Angelo Porrello, et al. Transfer without Forgetting // Computer Vision – ECCV 2022: 17th European Conference, 
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I. Cham: Springer, 2022.
Yabin Wang, Zhiwu Huang, Xiaopeng Hong. S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning. 
arXiv:2207.12819v1 [cs.CV], 2022.

  

 另外,为了增强可解释性,预训练的语料库要和分类任务保持一致。

 

最后为了增强模型的健壮性,还可以加入GAN:

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. arXiv:1412.6572v3 [stat.ML], 2015.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, et al. Intriguing Properties of Neural Networks. arXiv:1312.6199v4 [cs.CV], 2014.
TensorFlow. Adversarial example using FGSM. https://tensorflow.google.cn/tutorials/generative/adversarial_fgsm, 2021.
Nathan Inkawhich. Adversarial Example Generation. https://pytorch.org/tutorials/beginner/fgsm_tutorial.html, 2021.

样本生成用的是该文章的方法:

 

 另外,为了防止过拟合,在输出层可以加入L1正则化!

 

 

 

 

标签:训练,参考,macro,模型,batch,data,micro,model,true
From: https://www.cnblogs.com/bonelee/p/17726496.html

相关文章

  • 代码随想录算法训练营-动态规划-2|62. 不同路径
    62. 不同路径 1classSolution:2defuniquePaths(self,m:int,n:int)->int:3#创建一个二维列表用于存储唯一路径数4dp=[[0]*nfor_inrange(m)]56#设置第一行和第一列的基本情况7foriinran......
  • 使用 FHE 实现加密大语言模型
    近来,大语言模型(LLM)已被证明是提高编程、内容生成、文本分析、网络搜索及远程学习等诸多领域生产力的可靠工具。大语言模型对用户隐私的影响尽管LLM很有吸引力,但如何保护好输入给这些模型的用户查询中的隐私这一问题仍然存在。一方面,我们想充分利用LLM的力量,但另一方面......
  • 主流大语言模型的技术原理细节
    1.比较LLaMA、ChatGLM、Falcon等大语言模型的细节:tokenizer、位置编码、LayerNormalization、激活函数等。2.大语言模型的分布式训练技术:数据并行、张量模型并行、流水线并行、3D并行、零冗余优化器ZeRO、CPU卸载技术ZeRo-offload、混合精度训练、激活重计算技术、Flash......
  • 算法训练day18 LeetCode 513
    算法训练day18LeetCode513.112.106513.找树左下角的值题目513.找树左下角的值-力扣(LeetCode)题解代码随想录(programmercarl.com)递归方式单独数据存储最大深度,和此深度的结点值递归后要注意回溯classSolution{public:intmaxDepth=INT_MIN;......
  • 能计算虫洞模型的算筹数字计算机电路图
     ......
  • 能计算虫洞模型的算筹数字计算机3
    能计算虫洞模型的算筹数字计算机3              第十一部分  古筹算考释计算电路下面的内容可参见清劳乃宣著《古筹算考释》,清劳乃宣,字玉初,桐乡人。此集系追述古代筹算之法,凡算术之涉乎筹者,均徽引著书,祥为考释。卷一筹制、算位、乘除、开方;卷二......
  • 算法训练day8 LeetCode 344
    算法训练day8:LeetCode344.541.151.剑指offer05.58.344.反转字符串题目344.反转字符串-力扣(LeetCode)题解代码随想录(programmercarl.com)classSolution{public:voidreverseString(vector<char>&s){for(inti=0,j=s.size()-1;i......
  • vivo亮相博鳌科创会 自研大模型即将发布
    2023年9月19日-21日,博鳌亚洲论坛国际科技与创新论坛第三届大会在广东珠海举行。长期以来,博鳌亚洲论坛与vivo保持着战略合作伙伴关系。今年3月,vivoXFold2系列手机荣膺博鳌亚洲论坛官方指定手机。在本次会议中,vivo执行副总裁、首席运营官、vivo中央研究院院长胡柏山,vivo副总裁、OS......
  • 代码随想录算法训练营-动态规划-1|509. 斐波那契数、70. 爬楼梯
    509. 斐波那契数 1classSolution:2deffib(self,n:int)->int:3ifn<=2:4returnn56prev1,prev2=0,17for_inrange(2,n+1):8sum_value=prev1+prev29prev1,......
  • 代码随想录算法训练营day17 | ● 110.平衡二叉树 ● 257. 二叉树的所有路径 ● 404.
    110.平衡二叉树classSolution{public:intgetHeight(TreeNode*node){if(node==NULL)return0;intleftHeight=getHeight(node->left);if(leftHeight==-1)return-1;intrightHeigh......