首页 > 其他分享 >私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2

私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2

时间:2023-10-11 10:55:10浏览次数:40  
标签:llama 新体验 RAM 离线 tokens sec 聊天 Llama GB

“私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!”

一个自托管的、离线的、类似chatgpt的聊天机器人。由美洲驼提供动力。100%私密,没有数据离开您的设备。

Demo

https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd

1.支持模型

Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.

Model name Model size Model download size Memory required
Nous Hermes Llama 2 7B Chat (GGML q4_0) 7B 3.79GB 6.29GB
Nous Hermes Llama 2 13B Chat (GGML q4_0) 13B 7.32GB 9.82GB
Nous Hermes Llama 2 70B Chat (GGML q4_0) 70B 38.87GB 41.37GB
Code Llama 7B Chat (GGUF Q4_K_M) 7B 4.24GB 6.74GB
Code Llama 13B Chat (GGUF Q4_K_M) 13B 8.06GB 10.56GB
Phind Code Llama 34B Chat (GGUF Q4_K_M) 34B 20.22GB 22.72GB

1.1 安装LlamaGPT 在 umbrelOS

Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.

LlamaGPT on Umbrel App Store

1.2 安装LlamaGPT on M1/M2 Mac

Make sure your have Docker and Xcode installed.

Then, clone this repo and cd into it:

git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

./run-mac.sh --model 7b

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively.
To run 7B, 13B or 34B Code Llama models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

1.3 在 Docker上安装

You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.

Then, clone this repo and cd into it:

git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

./run.sh --model 7b

Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda flag, like:

./run.sh --model 7b --with-cuda

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively.
To run Code Llama 7B, 13B or 34B models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

Note: On the first run, it may take a while for the model to be downloaded to the /models directory. You may also see lots of output like this for a few minutes, which is normal:

llama-gpt-llama-gpt-ui-1       | [INFO  wait] Host [llama-gpt-api-13b:8000] not yet available...

After the model has been automatically downloaded and loaded, and the API server is running, you'll see an output like:

llama-gpt-ui_1   | ready - started server on 0.0.0.0:3000, url: http://localhost:3000

You can then access LlamaGPT at http://localhost:3000.


1.4 在Kubernetes安装

First, make sure you have a running Kubernetes cluster and kubectl is configured to interact with it.

Then, clone this repo and cd into it.

To deploy to Kubernetes first create a namespace:

kubectl create ns llama

Then apply the manifests under the /deploy/kubernetes directory with

kubectl apply -k deploy/kubernetes/. -n llama

Expose your service however you would normally do that.

2.OpenAI兼容API

Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001. Open http://localhost:3001/docs to see the API documentation.

  • 基线

We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.

Feel free to add your own benchmarks to this table by opening a pull request.

2.1 Nous Hermes Llama 2 7B Chat (GGML q4_0)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 54 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM) 16.7 tokens/sec
Ryzen 5700G 4.4GHz 4c (16 GB RAM) 11.50 tokens/sec
GCP c2-standard-4 vCPU (16 GB RAM) 4.3 tokens/sec
Umbrel Home (16GB RAM) 2.7 tokens/sec
Raspberry Pi 4 (8GB RAM) 0.9 tokens/sec

2.2 Nous Hermes Llama 2 13B Chat (GGML q4_0)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 20 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM) 8.6 tokens/sec
GCP c2-standard-4 vCPU (16 GB RAM) 2.2 tokens/sec
Umbrel Home (16GB RAM) 1.5 tokens/sec

2.3 Nous Hermes Llama 2 70B Chat (GGML q4_0)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 4.8 tokens/sec
GCP e2-standard-16 vCPU (64 GB RAM) 1.75 tokens/sec
GCP c2-standard-16 vCPU (64 GB RAM) 1.62 tokens/sec

2.4 Code Llama 7B Chat (GGUF Q4_K_M)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 41 tokens/sec

2.5 Code Llama 13B Chat (GGUF Q4_K_M)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 25 tokens/sec

2.6 Phind Code Llama 34B Chat (GGUF Q4_K_M)

Device Generation speed
M1 Max MacBook Pro (64GB RAM) 10.26 tokens/sec

更多优质内容请关注公号:汀丶人工智能;会提供一些相关的资源和优质文章,免费获取阅读。

标签:llama,新体验,RAM,离线,tokens,sec,聊天,Llama,GB
From: https://www.cnblogs.com/ting1/p/17756541.html

相关文章

  • chat智能聊天机器人api免费分享
    智能聊天机器人,上知天文,下知地理。接口地址:http://youlanjihua.com/youlanApi/v1/chat/index.php?secret=&content=请求方式:GET请求参数:​secret关注公众号【幽蓝计划】发送‘密钥’获取content提问的问题返回示例:​{"data":{"content":"你好,我是小i机器人,一个大......
  • 服务器数据恢复-DS5300存储多块硬盘出现坏道离线导致raid5阵列崩溃的数据恢复案例
    服务器数据恢复环境:某单位一台DS5300存储,1个主机+4个扩展柜,组建了2组RAID5(一组27块硬盘,一组23块盘)。27块盘的那组RAID5阵列存放Oracle数据库文件,存储系统一共分了11个卷。服务器故障:27块盘的那组RAID5阵列中有2块磁盘故障离线,导致RAID阵列崩溃,存储不可用,存储设备已经过保。服务......
  • centos 离线安装docker
    最佳方案就是查看官方文档了https://docs.docker.com/engine/install/centos/#install-from-a-package在docker官网找到centos安装目录,里面有个Installfromapackage 章节,其他系统也可以在相应系统类别里找到对应章节 Installfromapackage Ifyoucan'tuseDocker'......
  • CentOS环境:离线安装配置gitlab(适用于内网环境)
    前言:  此篇是为了完结真实的物理隔离环境下、验证yum缓存的文件包安装配置是否成功,对上篇在线安装文章的补充。1.互联网电脑环境准备1.1电脑环境配置信息IP:192.168.31.164OS:CentOSLinuxrelease7.9.2009(Core)1.2清除yum的rpm包缓存数据包清除前的数据记录:[root@bdlab......
  • 安防视频/集中云存储平台EasyCVR(V3.3)部分通道显示离线该如何解决?
    安防视频监控/视频集中存储/云存储/磁盘阵列EasyCVR平台可拓展性强、视频能力灵活、部署轻快,可支持的主流标准协议有国标GB28181、RTSP/Onvif、RTMP等,以及支持厂家私有协议与SDK接入,包括海康Ehome、海大宇等设备的SDK等。平台既具备传统安防视频监控的能力,也具备接入AI智能分析的能......
  • 视频监控/安防视频监控平台EasyCVR配置集群后有一台显示离线是什么原因?
    开源EasyDarwin视频监控TSINGSEE青犀视频平台EasyCVR能在复杂的网络环境中,将分散的各类视频资源进行统一汇聚、整合、集中管理,在视频监控播放上,TSINGSEE青犀视频安防监控汇聚平台可支持1、4、9、16个画面窗口播放,可同时播放多路视频流,也能支持视频定时轮播。视频监控汇聚平台EasyCV......
  • 离线安装Kubernetes(K8s)方法
    1简述1.1搭建方法介绍 K8s有两种搭建方式:1.使用K8s官方发布的二进制包搭建环境2.使用Kubeadm搭建(推荐该种方式) 本文的K8s搭建流程均基于Kubeadm方式1.2Kubeadm简介 Kubeadm是一款旨在为创建Kubernetes集群提供最佳实践“快速路径”的工具。它执行必要的操作,以用户......
  • 汽车之家主机厂离线化 H5 Hybrid 实践
    1.背景H5页面做秒开优化是业务的常规操作,一般正常通过网络请求的H5页面,我们都是围绕资源加载速度优化展开。优化手段主要分两个方向,一个是提升网络速度,一个是减少资源大小。提升网络速度,一般的手段有DNS预解析、多域名、升级HTTP2、使用CDN、SSR。而即使有......
  • 汽车之家主机厂离线化 H5 Hybrid 实践
    1.背景H5页面做秒开优化是业务的常规操作,一般正常通过网络请求的H5页面,我们都是围绕资源加载速度优化展开。优化手段主要分两个方向,一个是提升网络速度,一个是减少资源大小。提升网络速度,一般的手段有DNS预解析、多域名、升级HTTP2、使用CDN、SSR。而即使有......
  • 汽车之家主机厂离线化 H5 Hybrid 实践
    1.背景H5页面做秒开优化是业务的常规操作,一般正常通过网络请求的H5页面,我们都是围绕资源加载速度优化展开。优化手段主要分两个方向,一个是提升网络速度,一个是减少资源大小。提升网络速度,一般的手段有DNS预解析、多域名、升级HTTP2、使用CDN、SSR。而即使有......