首页 > 其他分享 >深度学习--seqt2seq RNN 英语翻译法语--86

深度学习--seqt2seq RNN 英语翻译法语--86

时间:2024-06-21 20:43:09浏览次数:8  
标签:fr RNN -- encoder state 英语翻译 decoder sent out

目录

1. 结构

我画的:

2. 代码解读

导包

import nltk
import numpy as np
import re
import shutil
import tensorflow as tf
import os
import unicodedata

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

数据集的预处理

def clean_up_logs(data_dir):
    checkpoint_dir = os.path.join(data_dir, "checkpoints")
    if os.path.exists(checkpoint_dir):
        shutil.rmtree(checkpoint_dir, ignore_errors=True)
        os.mkdir(checkpoint_dir)
    return checkpoint_dir

这个函数通过去除重音符号、添加标点符号周围的空格、去除非字母和标点符号的字符、去除多余的空格以及转换为小写,对输入的英文句子进行了全面的预处理。

def preprocess_sentence(sent):
    
    sent = "".join([c for c in unicodedata.normalize("NFD", sent) if unicodedata.category(c) != "Mn"])
    sent = re.sub(r"([!.?])", r" \1", sent)
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)
    sent = re.sub(r"\s+", " ", sent)
    sent = sent.lower()
    return sent

这里需要注意的是decoder
输入fr_sent_in 每一句的开头需要加上BOS (begin of sentence)
y_label为fr_sent_out 用于做评估 做loss计算 每一句的结尾 需要加上EOS (end of sentence)

def download_and_read():
    en_sents, fr_sents_in, fr_sents_out = [], [], []
    local_file = os.path.join("datasets", "fra.txt")
    with open(local_file, "r", encoding="utf-8") as fin:
        for i, line in enumerate(fin):
            en_sent, fr_sent, *_ = line.strip().split("\t")
            fr_sent = preprocess_sentence(fr_sent)
            fr_sent_in = [w for w in ("BOS " + fr_sent).split()]  # decoder输出为法语 需要在开头加上BOS标记
            fr_sent_out = [w for w in (fr_sent + " EOS").split()]  # decoder输出为法语 需要在结尾加上EOS标记
            en_sents.append(en_sent)
            fr_sents_in.append(fr_sent_in)
            fr_sents_out.append(fr_sent_out)
            if i >= NUM_SENT_PAIRS - 1:
                break
    return en_sents, fr_sents_in, fr_sents_out

encoder部分
encoder的call方法 需要传入x 以及初始的state则会输出 encoder_out,encoder_state
具体的rnn实现使用封装好的GRU,控制参数return_state=True

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_timesteps, encoder_dim, **kwargs):
        super(Encoder, self).__init__(**kwargs)

        self.encoder_dim = encoder_dim
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=num_timesteps)
        self.rnn = tf.keras.layers.GRU(encoder_dim, return_sequences=False, return_state=True)

    def call(self, x, state):
        x = self.embedding(x)
        x, state = self.rnn(x, initial_state=state)
        return x, state

    def init_state(self, batch_size):
        return tf.zeros((batch_size, self.encoder_dim))

decoder部分
call方法也需要传输x,state然后返回 decoder_out,decoder_state
rnn的实现也是使用封装好的GRU,控制参数return_state=True, return_sequences=True
rnn的返回值x,再输入全链接层,输出下一个单子是哪一个

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, num_timesteps, decoder_dim, **kwargs):
        super(Decoder, self).__init__(**kwargs)

        self.decoder_dim = decoder_dim
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=num_timesteps)
        self.rnn = tf.keras.layers.GRU(decoder_dim, return_state=True, return_sequences=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, x, state):
        x = self.embedding(x)
        x, state = self.rnn(x, state)
        x = self.dense(x)
        return x, state

损失函数采用稀疏交叉熵SparseCategoricalCrossentropy 有了loss的计算才能求导计算梯度然后反向传播

def loss_func(ytrue, ypred):
    scce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    # tf.math.equal(ytrue, 0):判断 ytrue 中的每个元素是否等于 0(通常 0 表示填充)。
    # tf.math.logical_not:对上述结果取反,得到一个布尔张量,其中 True 表示非填充部分,False 表示填充部分。
    mask = tf.math.logical_not(tf.math.equal(ytrue, 0))

    # tf.cast:将布尔张量转换为整数张量(True 变为 1,False 变为 0)
    mask = tf.cast(mask, dtype=tf.int64)

    # 计算损失,使用 mask 作为样本权重,这样填充部分的损失将被忽略。
    loss = scce(ytrue, ypred, sample_weight=mask)
    return loss

训练step的定义

@tf.function
def train_step(encoder_in, decoder_in, decoder_out, encoder_state):
    with tf.GradientTape() as tape:
        encoder_out, encoder_state = encoder(encoder_in, encoder_state)
        decoder_state = encoder_state
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
        loss = loss_func(decoder_out, decoder_pred)
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)  # 梯度计算 tf2版本自动求导的功能
    optimizer.apply_gradients(zip(gradients, variables))
    return loss

推理计算

def predict(encoder, decoder, batch_size, sents_en, data_en, sents_fr_out, word2idx_fr, idx2wprd_fr):
    # 随机的取一句
    random_id = np.random.choice(len(sents_en))
    print("Input    : ", " ".join(sents_en[random_id]))
    print("Output    : ", " ".join(sents_fr_out[random_id]))

    encoder_in = tf.expand_dims(data_en[random_id], axis=0)
    decoder_out = tf.expand_dims(sents_fr_out[random_id], axis=0)

    encoder_state = encoder.init_state(1)
    encoder_out, encoder_state = encoder(encoder_in, encoder_state)
    decoder_state = encoder_state
    decoder_in = tf.expand_dims(tf.constant([word2idx_fr["BOS"]]), axis=0)

    pred_sent_fr = []
    while True:
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
        decoder_pred = tf.argmax(decoder_pred, axis=-1)
        pred_word = idx2wprd_fr[decoder_pred.numpy()[0][0]]
        pred_sent_fr.append(pred_word)
        if pred_word == "EOS":
            break
        decoder_in = decoder_pred
    print("predict: ", " ".join(pred_sent_fr))

bleu分数的计算

def evaluate_bleu_score(encoder, decoder, test_dataset, word2idx_fr, idx2word_fr):
    bleu_scores = []
    smooth_fn = SmoothingFunction()

    for encoder_in, decoder_in, decoder_out in test_dataset:
        encoder_state = encoder.init_state(batch_size)
        encoder_out, encoder_state = encoder(encoder_in, encoder_state)
        decoder_state = encoder_state
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)

        # compute argmax
        decoder_pred = tf.argmax(decoder_pred, axis=-1).numpy()

        # decoder_out 是y_true
        for i in range(decoder_out.shape[0]):  # 取到y_true的一句话
            ref_sent = [idx2word_fr[j] for j in decoder_out[i].to_list() if j > 0]
            hyp_sent = [idx2word_fr[j] for j in decoder_pred[i].to_list() if j > 0]

            # remove EOS
            ref_sent = ref_sent[0:-1]
            hyp_sent = hyp_sent[0:-1]
            bleu_score = sentence_bleu([ref_sent], hyp_sent, smoothing_function=smooth_fn)
            bleu_scores.append(bleu_score)
    return np.mean(np.array(bleu_scores))  # 取均值

一些全局变量

NUM_SENT_PAIRS = 30000
EMBEDDING_DIM = 256
ENCODER_DIM, DECODER_DIM = 1024, 1024
BATCH_SIZE = 64
NUM_EPOCHS = 30
NUM_EPOCHS = 5

tf.random.set_seed(30)

data_dir = "datasets"
checkpoint_dir = clean_up_logs(data_dir)

# datasets preparation
download_url = "http://www.manythings.org/anki/fra-eng.zip"
sents_en, sents_fr_in, sents_fr_out = download_and_read()

分词器处理样本

tokenizer_en = tf.keras.preprocessing.text.Tokenizer(filters="", lower=False)
tokenizer_en.fit_on_texts(sents_en)
data_en = tokenizer_en.texts_to_sequences(sents_en)
data_en = tf.keras.preprocessing.sequence.pad_sequences(data_en, padding="post")


tokenizer_fr = tf.keras.preprocessing.text.Tokenizer(filters="", lower=False)
tokenizer_fr.fit_on_texts(sents_fr_in)
tokenizer_fr.fit_on_texts(sents_fr_out)

data_fr_in = tokenizer_fr.texts_to_sequences(sents_fr_in)
data_fr_in = tf.keras.preprocessing.sequence.pad_sequences(data_fr_in, padding='post')

data_fr_out = tokenizer_fr.texts_to_sequences(sents_fr_out)
data_fr_out = tf.keras.preprocessing.sequence.pad_sequences(data_fr_out, padding="post")

vocab_size_en = len(tokenizer_en.word_index)
vocab_size_fr = len(tokenizer_fr.word_index)
word2idx_en = tokenizer_en.word_index
idx2word_en = {v: k for k, v in word2idx_en.items()}

word2idx_fr = tokenizer_fr.word_index
idx2word_fr = {v: k for k, v in word2idx_fr.items()}

print(f"Vocab size (en): {vocab_size_en}")
print(f"Vocab size (fr): {vocab_size_fr}")

maxlen_en = data_en.shape[1]
maxlen_fr = data_fr_out.shape[1]
print(f"seq len (en): {maxlen_en}")
print(f"seq len (fr): {maxlen_fr}")

数据集的划分

batch_size = BATCH_SIZE
dataset = tf.data.Dataset.from_tensor_slices((data_en, data_fr_in, data_fr_out))
dataset = dataset.shuffle(10000)
test_size = NUM_SENT_PAIRS // 4
test_dataset = dataset.take(test_size).batch(batch_size, drop_remainder=True)
train_dataset = dataset.skip(test_size).batch(batch_size, drop_remainder=True)

encoder decoder输入输出维度的检查

# check encoder/decoder dimensions
embedding_dim = EMBEDDING_DIM
encoder_dim, decoder_dim = ENCODER_DIM, DECODER_DIM

encoder = Encoder(vocab_size_en+1, embedding_dim, maxlen_en, encoder_dim)
decoder = Decoder(vocab_size_fr+1, embedding_dim, maxlen_fr, decoder_dim)

optimizer = tf.keras.optimizers.Adam()
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

for encoder_in, decoder_in, decoder_out in train_dataset:
    encoder_state = encoder.init_state(batch_size)
    encoder_out, encoder_state = encoder(encoder_in, encoder_state)
    decoder_state = encoder_state
    decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
    break
print("encoder input         :", encoder_in.shape)
print("encoder output        :", encoder_out.shape, "state:    ", encoder_state.shape)
print("decoder output (logits)       :", decoder_pred.shape, "state:    ", decoder_state.shape)
print("decoder output (labels)       :", decoder_out.shape)

训练

# training step
num_epochs = NUM_EPOCHS
for e in range(num_epochs):
    encoder_state = encoder.init_state(batch_size)

    for batch, data in enumerate(train_dataset):
        encoder_in, decoder_in, decoder_out = data
        # decoder_out is the label value
        # decoder_in feed into decoder and will return decoder_pred and state
        # print(encoder_in.shape, decoder_in.shape, decoder_out.shape)

        loss = train_step(
            encoder_in, decoder_in, decoder_out, encoder_state
        )
    print("Epoch: {}, Loss: {:.4f}".format(e+1, loss.numpy()))

    if e % 10 == 0:
        checkpoint.save(file_prefix=checkpoint_prefix)

    predict(encoder, decoder, batch_size, sents_en, data_en, sents_fr_out, word2idx_fr, idx2word_fr)
    eval_score = evaluate_bleu_score(encoder, decoder, test_dataset, word2idx_fr, idx2word_fr)
    print("Eval Score (BLEU): {:.3e}".format(eval_score))

checkpoint.save(file_prefix=checkpoint_prefix)

标签:fr,RNN,--,encoder,state,英语翻译,decoder,sent,out
From: https://www.cnblogs.com/cavalier-chen/p/18261308

相关文章

  • 20240621维护记录
     dockerrun-d--namepause-1k8s.gcr.io/pause:3.2 注意:RunningError请看pods什么周期介绍https://www.jianshu.com/p/0bb8572e34f#!/bin/bashKEY=`cat/proc/sys/kernel/random/uuid`USER=`echo$KEY|cut-d"-"-f1`ACCESS_KEY=`uuidgen`SECRET_KEY=$KEYROLE_NAME......
  • C# 设计模式的七大原则
    1、单一职责原则-SRP(SingleResponsibilityPrinciple)单一职责原则要求一个类应该只有一个引起变化的原因,即一个类只负责一项功能。这有助于保持类的简洁性和可维护性,降低代码的复杂度。点击查看->【SRP】举例代码publicclassVehicle{publicstringBrand{get;......
  • Apollo动态障碍物绕行
    Apollo动态障碍物绕行附赠自动驾驶最全的学习资料和量产经验:链接1、动态障碍物绕行分析:2、PathLaneBorrowDecider分析需要进入借道场景才可以触发绕行功能。3、PathBoundsDecider分析:可以看到经过PathBoundsDecider计算后,总共形成3个pathBoundary,分别是fallback、r......
  • 汽车功能安全(ISO 26262)系列_ 系统阶段开发 - 技术安全需求(TSR)及安全机制
    汽车功能安全(ISO 26262)系列: 系统阶段开发 - 技术安全需求(TSR)及安全机制01. 什么是TSR附赠自动驾驶最全的学习资料和量产经验:链接总体而言,技术安全需求(TSR: Technical Safety Requirement)是为满足安全目标SG或功能安全需求(FSR),由功能安全需求(FSR)在技......
  • 一些全家桶的地址
    一些全家桶的地址随便拿走反正不是我写的qwq博弈论《博弈论全家桶》(ACM/OI)(超全的博弈论/组合游戏大合集)_oi博弈论-CSDN博客图论复习3图的全家桶-ezoiHY-博客园(cnblogs.com)数论【知识总结】数论全家桶-Inspector_Javert-博客园(cnblogs.com)组合数学《......
  • 高精地图的黄昏与车端感知的黎明
    高精地图的黄昏与车端感知的黎明在过去几年的时间里,自动驾驶行业的技术路线经历了依赖高精地图向“重感知、轻地图”的转变。在营销口径上,有人宣传轻图,有人声称无图,还有人喊出了真无图或全面无图的口号,林林总总,令人眼花缭乱,请问元芳你怎么看?附赠自动驾驶最全的学习资料......
  • 横向LQR、纵向PID控制进行轨迹跟踪以及python实现
    横向LQR、纵向PID控制进行轨迹跟踪以及python实现附赠自动驾驶最全的学习资料和量产经验:链接一、LQR问题模型建立:理论部分比较成熟,这里只介绍demo所使用的建模方程:使用离散代数黎卡提方程求解系统状态矩阵:输入矩阵:A矩阵:B矩阵:二、代码实现#导入相关包imp......
  • 2024.6.21 国学社最后一课有感
    开端盼望着,犹豫着,来到了最后一课。作为学校的“大社”之一,十余社员尽皆到场;说尽琅然轩的前世今生,倾诉对国学的热爱满怀;三首送别词,道尽一年来日日夜夜;诗词别董大千里黄云白日曛,北风吹雁雪纷纷。莫愁前路无知己,天下谁人不识君。 临江仙·送钱穆父一别都门三改火,天涯踏尽......
  • 对红酒数据集,分别采用决策树算法和随机森林算法进行分类。
    1.导入所需要的包fromsklearn.treeimportDecisionTreeClassifierfromsklearn.ensembleimportRandomForestClassifierfromsklearn.datasetsimportload_winefromsklearn.model_selectionimporttrain_test_split2.导入数据,并且对随机森林和决策数进行对比 x_tr......
  • 网络爬虫设置代理服务器
          目录1.获取代理IP2.设置代理IP3.检测代理IP的有效性4.处理异常         如果希望在网络爬虫程序中使用代理服务器,就需要为网络爬虫程序设置代理服务器。设置代理服务器一般分为获取代理IP、设置代理IP两步。接下来,分别对获取代理......