首页 > 其他分享 >深度学习-nlp-NLP之实现ai写唐诗--75

深度学习-nlp-NLP之实现ai写唐诗--75

时间:2024-04-21 22:46:16浏览次数:23  
标签:nlp NLP ai len cell state words tf size

目录

1. 训练

import collections
import numpy as np
import tensorflow as tf

# -------------------------------数据预处理---------------------------#

poetry_file = './data/poetry.txt'

# 诗集
poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

# 按诗的字数排序
poetrys = sorted(poetrys, key=lambda line: len(line), reverse=False)
print('唐诗总数: ', len(poetrys))

# 统计每个字出现次数
all_words = []
for poetry in poetrys:
    temp = [word for word in poetry]
    all_words += temp

counter = collections.Counter(all_words)
print(counter.items())
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
print(count_pairs)
print(*count_pairs)
words, _ = zip(*count_pairs)
# 倒序排序后取出了所有字
print(words)

# 取前多少个常用字,并且加上空格
print(len(words))
words = words[:len(words)] + (' ',)
print(words)
print(len(words))

# 每个字映射为一个数字ID
word_num_map = dict(zip(words, range(len(words))))
print(word_num_map)

# 把诗转换为向量形式
# 定义一个查索引的方法,如果是常用字就给index,如果不是就给默认值len(words)
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]
# [[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
# [339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
# ....]

# 每次取256首诗进行训练
batch_size = 256
# 计算多少次可以把诗学完了
n_chunk = len(poetrys_vector) // batch_size
# 准备数据
x_batches = []
y_batches = []
for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size
    # 每次取256首诗
    batches = poetrys_vector[start_index:end_index]
    # 计算256首诗里面最长的长度
    length = max(map(len, batches))
    # 创建全部为空格的索引号的矩阵
    xdata = np.full((batch_size, length), word_num_map[' '], np.int32)
    # 把每首诗的向量盖覆填入
    for row in range(batch_size):
        xdata[row, :len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:, :-1] = xdata[:, 1:]
    
    # xdata             ydata
    # [6,2,4,6,9]       [2,4,6,9,9]
    # [1,4,2,8,5]       [4,2,8,5,5]
    
    x_batches.append(xdata)
    y_batches.append(ydata)

# ---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])


# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    # 单个节点里面神经网络有两层,堆叠的,相当于网络层更深
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        # len(words) + 1 加一的原因是句子向量里面有一个不在所有词里面的默认值
        # 构建从Cell单元输出得结果到输出层Y得W和b矩阵
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words) + 1])
        softmax_b = tf.get_variable("softmax_b", [len(words) + 1])
        # 是来构建X输入到Cell之间得变化,说白了就是把X变成X_in交给RNN Cell单元
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words) + 1, rnn_size])
            # 相当于对每个词进行one-hot编码再生成稠密的向量
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    # 下面一行是来构建RNN网络拓扑结构
    # 如果是True,outputs的维度是[steps, batch_size, depth]
    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    # reshape之后的形状是(steps*batch_size, 128)
    output = tf.reshape(outputs, [-1, rnn_size])
    # 计算从Cell单元输出的结果到输出层Y的结果
    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state


# 训练
def train_neural_network():
    logits, last_state, _, _, _ = neural_network()
    targets = tf.reshape(output_targets, [-1])
    loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)])
    cost = tf.reduce_mean(loss)
    learning_rate = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    # Gradient Clipping的引入是为了处理gradient explosion或者gradients vanishing的问题。当在一次迭代中权重的更新过于迅猛的话,
    # 很容易导致loss divergence。Gradient Clipping的直观作用就是让权重的更新限制在一个合适的范围。
    # clip_norm是截取的比率, 这个函数返回截取过的梯度张量
    # minimize() = compute_gradients() + apply_gradients()
    # 这里相当于将计算梯度和更新梯度变成两部分来做
    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.apply_gradients(zip(grads, tvars))

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        saver = tf.train.Saver(tf.global_variables())

        for epoch in range(50):
            sess.run(tf.assign(learning_rate, 0.002 * (0.97 ** epoch)))
            n = 0
            for batch in range(n_chunk):
                train_loss, _ = sess.run([cost, train_op],
                                         feed_dict={input_data: x_batches[n], output_targets: y_batches[n]})
                n += 1
                print(epoch, batch, train_loss)
            if epoch % 7 == 0:
                saver.save(sess, './poetry.module', global_step=epoch)


train_neural_network()

2. 推理

import collections
import numpy as np
import tensorflow as tf

# -------------------------------数据预处理---------------------------#

poetry_file = './data/poetry.txt'

# 诗集
poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

# 按诗的字数排序
poetrys = sorted(poetrys, key=lambda line: len(line))
print('唐诗总数: ', len(poetrys))

# 统计每个字出现次数
all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])

words, _ = zip(*count_pairs)

# 在词表里面加了一个特殊的字符,空格
words = words[:len(words)] + (' ',)
print(words)

# 每个字映射为一个数字ID
word_num_map = dict(zip(words, range(len(words))))
print(word_num_map)

# 把诗转换为向量形式
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]
# [[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
# [339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
# ....]

batch_size = 1
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size

    batches = poetrys_vector[start_index:end_index]
    length = max(map(len, batches))
    xdata = np.full((batch_size, length), word_num_map[' '], np.int32)
    for row in range(batch_size):
        xdata[row, :len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:, :-1] = xdata[:, 1:]

    # xdata             ydata
    # [6,2,4,6,9]       [2,4,6,9,9]
    # [1,4,2,8,5]       [4,2,8,5,5]
    
    x_batches.append(xdata)
    y_batches.append(ydata)

# ---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])


# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words) + 1])
        softmax_b = tf.get_variable("softmax_b", [len(words) + 1])
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words) + 1, rnn_size])
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    output = tf.reshape(outputs, [-1, rnn_size])

    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state


# -------------------------------生成古诗---------------------------------#
# 使用训练完成的模型

def gen_poetry():
    def to_word(weights):
        t = np.cumsum(weights)
        print(t)
        sample = int(np.searchsorted(t, np.random.rand(1)))
        return words[sample]

    _, last_state, probs, cell, initial_state = neural_network()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        saver = tf.train.Saver(tf.global_variables())
        saver.restore(sess, './poetry.module-49')

        state_ = sess.run(cell.zero_state(batch_size, tf.float32))

        first_letter = '唐'
        x = np.array([list(map(word_num_map.get, first_letter))])
        print(x.shape)
        print(x)
        [probs_, state_] = sess.run([probs, last_state],
                                    feed_dict={input_data: x, initial_state: state_})
        print(probs_)
        word = to_word(probs_)
        # word = words[np.argmax(probs_)]
        poem = first_letter + ''
        while word != '[' and word != ']':
            poem += word
            x = np.zeros((1, 1))
            x[0, 0] = word_num_map[word]
            [probs_, state_] = sess.run([probs, last_state],
                                        feed_dict={input_data: x, initial_state: state_})
            word = to_word(probs_)
            # word = words[np.argmax(probs_)]
        return poem


print(gen_poetry())

3. 效果

提示词 唐
唐鶋庭浪汉长,受素汀筵昼竹。春悲烧谷女屈隋,白花长孔自琵设。万班妄官赋意在,分首殊絮共频心。

提示词: 春
春万赐绶雪雪芳,袂楼绕日月边时。旬文未必歌似泪,懒采秦愁未能功。经夜旧月暮不病,应饮尔只肯先无。

标签:nlp,NLP,ai,len,cell,state,words,tf,size
From: https://www.cnblogs.com/cavalier-chen/p/18149652

相关文章

  • 深度学习-nlp-NLP之实现skip-gram--74
    目录1.数据的获取2.数据加载3.制作数据集4.制作训练集5.模型定义6.训练模型7.可视化8.结果1.数据的获取#导入一些需要的库#由于Python是由社区推动的开源并且免费的开发语言,不受商业公司控制,因此,Python的改进往往比较激进,#不兼容的情况时有发生。Python为了确保......
  • Linux_aarch64_head.S到main.c的环境建立
    PS:要转载请注明出处,本人版权所有。PS:这个只是基于《我自己》的理解,如果和你的原则及想法相冲突,请谅解,勿喷。环境说明  无前言  最开始,我仅仅是对linux比较感兴趣,觉得其很神奇的,能够做到很多事情。后面了解到其源码也是开源的,于是抱着学习的态度,简要的看了看相关的代......
  • 服务器硬件与RAID
    1.服务器硬件详解**查看服务器CPU的信息**cat/proc/cpuinfo或lscpumodelname#CPU型号physicalid#物理CPU的IDcpucores#每个物理CPU中的核心数processor#逻辑CPU的ID查看服务器内存的信息cat/proc/meminfo或......
  • .Net与AI的强强联合:AntSK知识库项目中Rerank模型的技术突破与实战应用
        随着人工智能技术的飞速发展,.Net技术与AI的结合已经成为了一个新的技术热点。今天,我要和大家分享一个令人兴奋的开源项目——AntSK,这是一个基于.net平台构建的开源离线AI知识库项目。在这个项目中,我们最近加入了一项强大的Rerank(重排)模型,进一步增强了我们的AI知识库的......
  • vs code--AI编码助手(GitHub Copilot)安装使用
    GitHubCopilot是现在最流行的编程代码助手,现在介绍一下vscode安装使用过程。vscode选择扩展点击,输入Copilot,显示GitHubCopilot和GitHubCopilotChat。现在是默认一起装,装其中一个另外一个都会自动安装。 安装好需要登录GitHub 点击跳到GitHub网页界面 需要输入Dev......
  • kubernetes CNI(Container Network Inferface)
    为什么需要CNI在kubernetes中,pod的网络是使用networknamespace隔离的,但是我们有时又需要互相访问网络,这就需要一个网络插件来实现pod之间的网络通信。CNI就是为了解决这个问题而诞生的。CNI是containernetworkinterface的缩写,它是一个规范,定义了容器运行时如何配......
  • 李rumor 3个月学习NLP路线(补充)
    原视频链接:bilibili 思维导图 leerumor:nlp_tutorialgithub  第一个月1-基础知识概率论与数理统计、线性代数python编程、矩阵运算材料:学习需要2周,建议初学者先看懂线性分类、svm模型、图模型。教材:李航《统计学习方法》微信读书 视频:吴恩达的cS229公开课bilibili......
  • 值得学习的技巧/码风——mainly from jiangly
    1、主体框架:#include<bits/stdc++.h>usingi64=longlong;intmain(){ std::ios::sync_with_stdio(false);std::cin.tie(nullptr);}2、基本都在主函数中定义数组而非全局,且多用\(\rmstd::vector\)而非数组。3、常用函数std::rotate、std::swap、std::max、s......
  • AI刘强东带货,能给数字人带来商业化吗?
    在数字化浪潮的推动下,AI数字人技术正逐渐渗透到直播带货领域,引发行业关注。京东推出的AI刘强东数字人直播带货活动,尽管观看量超1155万,但商业化效果尚待观察。文章深入探讨了AI数字人在直播带货中的优势与挑战,以及其在电商领域的应用前景,为读者提供了一个全面了解AI数字人商业化进......
  • vue3 优化ai生成的手写签名
    下面是baiduai生成的代码:在Vue3中实现手写签名功能,可以使用canvas元素来创建一个绘图区域,并监听鼠标事件来实现签名的记录。以下是一个简单的例子:vue<template><div><canvasref="signatureCanvas"@mousedown="startSigning"@mousemove="updat......