LLM探索：为ChatGLM2的gRPC后端增加连续对话功能

标签：completion ChatRequest gRPC ChatGLM2 request current LLM history

前言

之前我做 AIHub 的时候通过 gRPC 的方式接入了 ChatGLM 等开源大模型，对于大模型这块我搞了个 StarAI 框架，相当于简化版的 langchain ，可以比较方便的把各种大模型和相关配套组合在一起使用。

主要思路还是用的 OpenAI 接口的那套，降低学习成本，但之前为了快速开发，就只搞了个简单的 gRPC 接口，还差个多轮对话功能没有实现，这次就来完善一下这个功能。

简述

系统分为LLM后端和客户端两部分，LLM后端使用 gRPC 提供接口，客户端就是我用 Blazor 开发的 AIHub

所以这次涉及到这几个地方的修改

proto
客户端 - C# 代码
AIHub页面 - Blazor 的 razor 代码
gRPC 服务端 - Python 代码

修改 proto

来改造一下 proto 文件

\syntax = "proto3";

import "google/protobuf/wrappers.proto";

option csharp_namespace = "AIHub.RPC";

package aihub;

service ChatHub {
  rpc Chat (ChatRequest) returns (ChatReply);
  rpc StreamingChat (ChatRequest) returns (stream ChatReply);
}

message ChatRequest {
  string prompt = 1;
  repeated Message history = 2;
  int32 max_length = 3;
  float top_p = 4;
  float temperature = 5;
}

message Message {
  string role = 1;
  string content = 2;
}

message ChatReply {
  string response = 1;
}

增加了 Message 类型，在 ChatRequest 聊天请求中增加了 history 字段作为对话历史。

修改 C# 的 gRPC 客户端代码

上面的 proto 写完之后编译项目，会重新生成客户端的 C# 代码，现在来修改一下我们的调用代码

可以看到 ChatRequest 多了个 RepeatedField<Message> 类型的 history 属性，这个属性是只读的，所以每次聊天的时候传入对话历史只能使用添加的方式。

为了方便使用，我封装了以下方法来创建 ChatRequest 对象

private ChatRequest GetRequest(string prompt, List<Message>? history = null) {
  var request = new ChatRequest {
    Prompt = prompt,
    MaxLength = 2048,
    TopP = 0.75f,
    Temperature = 0.95f
  };

  if (history != null) {
    request.History.AddRange(history);
  }

  return request;
}

继续改写两个聊天的方法，增加个一个 history 参数

public async Task<string> Chat(string prompt, List<Message>? history = null) {
  var resp = await _client.ChatAsync(GetRequest(prompt, history));
  return RenderText(resp.Response);
}

public async IAsyncEnumerable<string> StreamingChat(string prompt, List<Message>? history = null) {
  using var call = _client.StreamingChat(GetRequest(prompt, history));
  await foreach (var resp in call.ResponseStream.ReadAllAsync()) {
    yield return RenderText(resp.Response);
  }
}

搞定。

修改 gRPC 服务端的 Python 代码

先来看看 ChatGLM2 是如何传入对话的

对官方提供的 demo 进行调试，发现传入模型的 history 是列表里面包着一个个元组，表示一个个对话，奇奇怪怪的格式。

history = [('问题1', '回答1'), ('问题2', '回答2')]

但是 AIHub 的对话是按照 OpenAI 的思路来做的，是这样的格式：

history = [
  {'role': 'user', 'content': '问题1'},
  {'role': 'assistant', 'content': '回答1'},
  {'role': 'user', 'content': '问题2'},
  {'role': 'assistant', 'content': '回答2'},
]

现在需要把 OpenAI 对话格式转换为 ChatGLM 的格式

直接上代码吧

def messages_to_tuple_history(messages: List[chat_pb2.Message]):
    """把聊天记录列表转换成 ChatGLM 需要的 list 嵌套 tuple 形式"""
    history = []
    current_completion = ['', '']
    is_enter_completion = False

    
    for item in messages:
        if not is_enter_completion and item.role == 'user':
            is_enter_completion = True

        if is_enter_completion:
            if item.role == 'user':
                if len(current_completion[0]) > 0:
                    current_completion[0] = f"{current_completion[0]}\n\n{item.content}"
                else:
                    current_completion[0] = item.content
            if item.role == 'assistant':
                if len(current_completion[1]) > 0:
                    current_completion[1] = f"{current_completion[1]}\n\n{item.content}"
                else:
                    current_completion[1] = item.content

                is_enter_completion = False
                history.append((current_completion[0], current_completion[1]))
                current_completion = ['', '']

    return history

目前只处理了 user 和 assistant 两种角色，其实 OpenAI 还有 system 和 function ，system 比较好处理，可以做成以下形式

[('system prompt1', ''), ('system prompt2', '')]

不过我还没测试，暂时也用不上这个东西，所以就不写在代码里了。

接着继续修改两个对话的方法

class ChatService(chat_pb2_grpc.ChatHubServicer):
    def Chat(self, request: chat_pb2.ChatRequest, context):
        response, history = model.chat(
            tokenizer,
            request.prompt,
            history=messages_to_tuple_history(request.history),
            max_length=request.max_length,
            top_p=request.top_p,
            temperature=request.temperature)
        torch_gc()
        return chat_pb2.ChatReply(response=response)

    def StreamingChat(self, request: chat_pb2.ChatRequest, context):
        current_length = 0
        for response, history in model.stream_chat(
                tokenizer,
                request.prompt,
                history=messages_to_tuple_history(request.history),
                max_length=request.max_length,
                top_p=request.top_p,
                temperature=request.temperature,
                return_past_key_values=False):

            print(response[current_length:], end="", flush=True)
            yield chat_pb2.ChatReply(response=response)
            current_length = len(response)

        torch_gc()

对了，每次对话完成记得回收显存

def torch_gc():
    if torch.cuda.is_available():
        with torch.cuda.device(CUDA_DEVICE):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()

这样就搞定了。

PS: Python 日志组件可以用 loguru ，很好用，我最近刚发现的。

小结

gRPC 方式调用开发起来还是有点麻烦的，主要是调试比较麻烦，我正在考虑是否改成统一 OpenAI 接口方式的调用，GitHub 上有人贡献了 ChatGLM 的 OpenAI 兼容接口，后续可以看看。

不过在视觉这块，还是得继续搞 gRPC ，传输效率比较好。大模型可以使用 HTTP 的 EventSource 是因为数据量比较小，次要原因是对话是单向的，即：用户向模型提问，模型不会主动向用户发送信息。

标签：completion,ChatRequest,gRPC,ChatGLM2,request,current,LLM,history
From： https://www.cnblogs.com/deali/p/17774304.html

LLM探索：为ChatGLM2的gRPC后端增加连续对话功能

前言

简述

修改 proto

修改 C# 的 gRPC 客户端代码

修改 gRPC 服务端的 Python 代码

小结

相关文章

赞助商

阅读排行