Jupyter 二次开发思路（1）

标签：25 Jupyter Hub 18 jupyterhub 二次开发思路 JupyterHub

上篇文章介绍了 Jupyter 生态及重要组件的原理。基于之前的内容，本文介绍 Jupyter 二次开发的思路。首先介绍项目的需求，接着进一步介绍架构设计，进行demo的实现，最后进行总结。

需求

实现图数据管理分析 BI 平台的 Notebook Service，具备数据的探索、执行分析任务、sql 操作、spark 操作等功能。这个平台目前是单租户的架构，是 to b 的。一般是公司的数据分析团队使用，一个团队一般是十几个人，写代码用到 Notebook 的可能也就那么几个人。所以对性能和可用性没有像 to c 的产品要求那么高。

总结下来，我们的 Notebook Service 应该具备下面的功能：

提供给用户开箱即用的环境
隔离用户的环境，避免互相污染
接入 spark、adb、oss 等平台资源供用户进行探查、分析

目前打算先快速把架构搭起来，先使用自带的 IPython 内核，后面再接入平台的对象存储、数据库和 spark 集群等资源。

架构设计

经过调研，决定采用下面组件来组合实现：

前端使用经典的 Jupyter Notebook 组件，因为项目不需要太多的功能，只需要实现简单的 Notebook 功能就行。
服务端 使用 Jupyter Server 来转发前端的请求给内核执行。
内核负责执行代码，将结果返回服务端。暂时使用 IPython Kernel，后面再接入平台的对象存储、数据库和 spark 集群等资源。
部署使用 Jupyter Hub 组件，用于为实现多用户提供 Notebook，即不同用户使用不同的 Jupyter Server 和 Kernel。
- Authenticators：实现自定义登录鉴权，可以通过自定义 Authenticator 类并在配置文件中指定来实现。
- Spawners：用户登录时，Jupyter Hub 会用用户启动一个新实例（Jupyter Server + Kernel）。启动实例是通过 Spawner 实现的。官方提供了多种 Spawner 的实现，包括：本机新的Notebook Server进程、本机启动Docker实例、K8s系统中启动新的Pod、YARN中启动新的实例等等。这些实现本身是可配置的。如果不符合需求，也可以自己开发全新的 Spawner。后续我们需要接入 spark 和数据库等资源，可以基于官方提供的 Spawner 进行定制，来接入资源。

整个架构如下图所示。不同的客户端通过 JupyterHub 进行登录验证后，可以通过 Jupyter Notebook 前端，访问对应的实例。每个实例即 K8s 的 pod，不同 pod 之间的资源是隔离的。

安装部署

下面在本地部署一个 Jupyter Hub，从各个组件的源码进行编译。虽然各个组件都可以通过扩展的方式去开发，但是后期如果有复杂的架构的话，可能需要修改相关的源码，所以通过在本地去编译各个组件的源码，去部署 Jupyter Hub。在本地完成开发后，可以打包成镜像，然后参考社区的教程部署到 K8s 上。

# 安装 IPyKernel
pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple

# 编译 Jupyter Server
git clone https://github.com/jupyter-server/jupyter_server.git
cd jupyter_server
pip install -v -e .
cd ../

# 编译 Jupyter Lab 前端
# 需要注意的是，直接编译 Jupyter Lab，会自动安装 IPyKernel 和 Jupyter Server 等依赖
# 我们将这些步骤放在前面先编译安装了，方便理解
git clone https://github.com/jupyterlab/jupyterlab.git
cd jupyterlab
# 如果执行 pip install -v -e . 报错，就先尝试执行 yarn install 安装前端的依赖
yarn install
pip install -v -e .
# build 前端资源
jupyter lab build
cd ../

# 安装 Jupyter Hub 的 proxy，它的作用是将用户的请求路由给对应的 Jupyter Server
npm install -g configurable-http-proxy --registry=https://registry.npmmirror.com

# 编译 Jupyter Hub
git clone https://github.com/jupyterhub/jupyterhub.git
cd jupyterhub
pip install -v -e .
cd ../

回顾一下之前的简化架构图，会发现上述命令将需要用到的重要组件都安装到了。 JupyterHub 管理多个用户，每个用户都有独立的单用户服务器实例，包括 Jupyter Lab 界面、Jupyter Server 和 IPython 内核。用户通过 JupyterLab 交互界面编写和执行代码，Jupyter Server 处理请求并与内核通信执行代码，最终将结果返回给用户，实现了多用户的交互式计算环境。

上面安装好了 Jupyter 的本地环境后，接下来介绍如何跑起来。

# 创建一个文件夹用来存放配置文件
mkdir jupyterhub-config && cd jupyterhub-config
# 生成配置文件
jupyterhub --generate-config

按下面的配置进行修改，各个配置注释都有解释，也可以参考这里。

# 使用 PAM 进行用户身份验证，用户通过操作系统中的有效用户名和密码进行验证
c.JupyterHub.authenticator_class = 'jupyterhub.auth.PAMAuthenticator'

# 让能通过验证的用户成功访问 Hub
c.Authenticator.allow_all = True

然后执行下面命令。

# 以 jupyterhub_config.py 中的配置运行jupyterhub
jupyterhub -f jupyterhub_config.py

控制台会类似输出下面的内容。

需要注意的是 8081 端口是 Hub API 的监听端口，用于与其他组件（如 Proxy 和 Spawner）进行通信的接口，只对 JupyterHub 的内部通信开放，不直接对外提供服务。

8000 端口是 JupyterHub Proxy 的监听端口，负责将用户的请求从这个端口转发到对应的 Jupyter Server 服务，这个端口通常对外开放，用户通过这个端口访问 JupyterHub 的 Web 界面。所以我们在浏览器打开 http://127.0.0.1:8000，输入电脑的登录账号和密码就可以登录成功。

[I 2024-08-15 18:25:27.339 JupyterHub app:3307] Running JupyterHub version 5.2.0.dev
[I 2024-08-15 18:25:27.339 JupyterHub app:3337] Using Authenticator: jupyterhub.auth.PAMAuthenticator-5.2.0.dev
[I 2024-08-15 18:25:27.339 JupyterHub app:3337] Using Spawner: jupyterhub.spawner.LocalProcessSpawner-5.2.0.dev
[I 2024-08-15 18:25:27.339 JupyterHub app:3337] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.2.0.dev
[I 2024-08-15 18:25:27.342 JupyterHub app:1882] Writing cookie_secret to /jupyterhub_cookie_secret
[I 2024-08-15 18:25:27.764 alembic.runtime.migration migration:215] Context impl SQLiteImpl.
[I 2024-08-15 18:25:27.764 alembic.runtime.migration migration:218] Will assume non-transactional DDL.
[I 2024-08-15 18:25:27.768 alembic.runtime.migration migration:623] Running stamp_revision  -> 4621fec11365
[I 2024-08-15 18:25:27.827 JupyterHub proxy:556] Generating new CONFIGPROXY_AUTH_TOKEN
[I 2024-08-15 18:25:27.844 JupyterHub app:3376] Initialized 0 spawners in 0.003 seconds
[I 2024-08-15 18:25:27.846 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.twenty_four_hours
[I 2024-08-15 18:25:27.846 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.seven_days
[I 2024-08-15 18:25:27.846 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.thirty_days
[W 2024-08-15 18:25:27.846 JupyterHub proxy:748] Running JupyterHub without SSL.  I hope there is SSL termination happening somewhere else...
[I 2024-08-15 18:25:27.846 JupyterHub proxy:752] Starting proxy @ http://:8000
18:25:28.383 [ConfigProxy] info: Proxying http://*:8000 to (no default)
18:25:28.384 [ConfigProxy] info: Proxy API at http://127.0.0.1:8001/api/routes
18:25:28.763 [ConfigProxy] info: 200 GET /api/routes
[I 2024-08-15 18:25:28.763 JupyterHub app:3690] Hub API listening on http://127.0.0.1:8081/hub/
18:25:28.765 [ConfigProxy] info: 200 GET /api/routes
[I 2024-08-15 18:25:28.765 JupyterHub proxy:477] Adding route for Hub: / => http://127.0.0.1:8081
18:25:28.767 [ConfigProxy] info: Adding route / -> http://127.0.0.1:8081
18:25:28.768 [ConfigProxy] info: Route added / -> http://127.0.0.1:8081
18:25:28.768 [ConfigProxy] info: 201 POST /api/routes/
[I 2024-08-15 18:25:28.768 JupyterHub app:3731] JupyterHub is now running at http://:8000

如果运行的时候发现没有 Kernel，那么可以通过如下命令来安装 Kernel，然后重启

# 查看有没有安装内核
jupyter kernelspec list
# 如果没有的话，添加 IPyKernel
# 这条命令将当前 Python 环境作为一个新的内核安装到 Jupyter 中，安装在用户级别目录
python -m ipykernel install --user --name 内核名称 --display-name "内核显示名称"
# 注意，这条命令是用来删除已经安装的内核。如果需要删除内核的话，执行下面命令。
jupyter kernelspec remove 内核名称

自定义认证

调试

首先介绍如何在 Pycharm 中将 Jupyter 跑起来，方便进行开发调试（如果你不是 Python 开发的话，相信对你会有帮助

标签：25,Jupyter,Hub,18,jupyterhub,二次开发,思路,JupyterHub
From： https://blog.csdn.net/trl_jdi/article/details/141299839

Jupyter 二次开发思路（1）

需求

架构设计

安装部署

自定义认证

调试

相关文章

赞助商

阅读排行