FVFL: A Flexible and Verifiable Privacy-Preserving Federated Learning Scheme -- FVFL:一种灵活且可验证的隐私保护联邦学习方案
来源
IoT 2024
链接:https://ieeexplore.ieee.org/document/10510350
导读
Github:Github
Abstract
- 联邦学习可以解决数据孤岛的问题,但也带来了更为严重的数据隐私问题。
Furthermore, in the process of multi-source data collaboration, the efficiency of the whole federated learning system is usually not high.
- 此外,在多源数据协作过程中,整个联邦学习系统的效率通常不高。
- 本文提出了一种名为FVFL的方案,既保证了本地数据的安全性和抗合谋攻击,更重要的是它能很好地支持客户端灵活参与联邦学习。
we adopt Paillier encryption and Secret sharing to guaranttee client’s data security and resistance to collusive attacks; Moreover, our encryption mechanism allows client to participate in federated learning flexibly, and the correctness of the encryption algorithm is not affected by client’s drop out.
- 采用Paillier加密和秘密共享,保证客户端数据安全、抗合谋攻击;此外,我们的加密机制允许客户端灵活参与联邦学习,并且加密算法的正确性不受客户端退出的影响。
转自CSDN:
秘密共享是一种秘密分割存储技术,其目的是一定程度抵御多方合谋与入侵。
秘密共享的核心思想是将秘密拆分为n份,分别分发给参与方 P1,...,Pn。
n-out-of-n秘密共享要求所有参与方结合才能恢复秘密,t-out-of-n秘密共享要求至少任意t个参与方结合才能恢复秘密。
从思想可以看出秘密共享算法主要又秘密分发算法和秘密重构算法组成。
而秘密分发方式决定重构的方式,因此基于秘密分发,我们将秘密共享分为基于位运算的加性秘密共享和基于线性代数的线性秘密共享。
原文链接:https://blog.csdn.net/qq_38798147/article/details/126700015
The super-increasing sequence is introduced to reduce the communication overhead of the whole system, the simulation result shows that the result is significant; the Lagrange interpolation polynomial and secret Sharing is introduced to implement verification mechansim, to prevent malicious forgery of aggregation results in the cloud.The verification mechanism ensures the clients to obtain real and reliable aggregation results in the cloud.
- 为了降低整个系统的通信开销,引入了超增长序列,仿真结果表明效果显著。
- 引入拉格朗日插值多项式和秘密共享实现验证机制,防止聚合结果在云端被恶意伪造。
- 验证机制保证了客户端在云中获得真实可靠的聚合结果。
- 此外,我们的验证机制允许客户端灵活参与联邦学习,并且验证算法的正确性不受客户端退出的影响。实验结果表明,该方法具有较高的精度和效率。
Index Terms—Privacy-preserving, Federated learning, Verifiable, Deep Learning.
Introduction
- 随着人工智能技术的出现,大数据的处理和操作变得非常高效和简单,并得到了广泛的应用,如人脸识别、在线医疗诊断和语音识别等。一个准确的网络模型需要大量的数据,这些数据通常由不同的所有者保存。
- 然而,数据往往包含敏感信息,数据所有者不能直接传输数据(GDPR),这就造成了我们所说的数据孤岛问题,影响了人工智能技术的进一步发展。
Federated learning is an efficient solution to solve the problem of data security, which keep data locally, and all participants can cooperate to build an effective global network
model.
- 联邦学习是解决数据安全问题的有效解决方案,它将数据保持在本地,所有参与者可以合作构建有效的全局网络模型。
Yang Qiang et al divided federated learning into horizontal federated learning (HFL), vertical federated learning (VFL) and federated transfer learning (FTL) according to the features of client data, these methods are almost applicable to all kinds of data islands, and ensure data privacy.
- 杨强等根据客户端数据的特点,将联邦学习分为水平联邦学习(HFL)、垂直联邦学习(VFL)和联邦迁移学习(FTL),这些方法几乎适用于各种数据孤岛,并保证了数据的隐私性。
Simple federated learning methods are not enough to fully ensure the security of data, and privacy computing such as homomorphism (HE), differential privacy (DP), and secure multiparty computing (MPC) need to be added to ensure data security.
- 简单的联邦学习方法不足以完全保证数据的安全性,需要增加同态(HE)、差分隐私(DP)、安全多方计算(MPC)等隐私计算来保证数据的安全性。
- Phong等指出,在联邦学习过程中,可以通过共享梯度得到真实的用户数据。研究表明,使用针对生成网络的攻击模式,可以通过联邦学习中共享的参数恢复客户端的数据。因此,为了构建一个安全可靠的联邦学习系统,我们需要采用隐私计算技术。
Privacy computing technology can guarantee the security of data under the semi-honest model, but it is not enough to prevent malicious attacks by participants in the federated learning system.
- 隐私计算技术可以保证半诚实模型下数据的安全性,但不足以防止联邦学习系统中参与者的恶意攻击。
In practice, a federated learning client can drop out for a variety of reasons. Therefore, the whole system must have strong robustness to ensure that the federated learning process
can be completed when clients are not online at the same time.Moreover, for the federated learning system with verifiability of aggregation results, the verifiability of aggregation results
should not be affected by the absence of a few clients.
- 在实践中,联邦学习客户端可能由于各种原因而退出。因此,整个系统必须具有较强的鲁棒性,以保证联邦学习过程能够在客户端不同时在线的情况下完成。此外,对于具有聚合结果可验证性的联邦学习系统,聚合结果的可验证性不应因缺少几个客户端而受到影响。
In this paper, We consider that the client is a semi-honest threat model, that is, the client follows the protocol settings for execution. However, there may be malicious attackers
on the server side. we consider above threat model and introduce a scheme named FVFL, which ensure the local data security and resistance to collusive attacks, more importantly it
can well support client flexible participate federated learning.
- 在本文中,我们认为客户端是一个半诚实的威胁模型,即客户端遵循协议设置执行。但是,服务器端可能存在恶意攻击者。考虑到上述威胁模型,我们引入了一种名为FVFL的方案,既保证了本地数据的安全性和抗合谋攻击的能力,更重要的是它能很好地支持客户端灵活参与联邦学习。
We adopt Paillier encryption and Shamir’s secret sharing to guarantee client’s data security; the super-increasing sequence is introduced to reduce the communication overhead of the whole system; the Lagrange interpolation polynomial and Shamir’s secret sharing is introduced to implement verification mechanism, to prevent malicious forgery of aggregation results in the cloud.
- 采用Paillier加密和Shamir秘密共享,保证客户数据安全;引入超递增序列,降低了整个系统的通信开销;引入拉格朗日插值多项式和Shamir秘密共享来实现验证机制,防止聚合结果在云中被恶意伪造。
Moreover, our verification mechanism allows clients to participate in federated learning flexibly.
- 此外,我们的验证机制允许客户灵活地参与联邦学习。
In summary,our contributions can be summarized as follows:
- We exploit the super-increasing sequence to encode gradient data. More importantly, the super-increasing sequence reduces the communication overhead of the whole system
significantly.- We design an efficient encryption mechanism for privacypreserving federated learning, which resists to collusive attacks and supports clients to exit.
- We design an efficient verification mechanism for privacy-preserving federated learning, which can be executed smoothly even if some clients drop out simultaneously.
- We analyze the utility of FVFL through extensive simulation experiments, which demonstrate that our scheme has high accuracy and efficiency, and the verification
overhead is acceptable.
- 利用超递增序列对梯度数据进行编码。更重要的是,超递增序列显著降低了整个系统的通信开销。
- 我们设计了一种有效的保护隐私的联邦学习加密机制,该机制能够抵抗合谋攻击并支持客户端退出。
- 我们设计了一种有效的保护隐私的联邦学习验证机制,即使一些客户端同时退出,该机制也能顺利执行。
- 通过大量的仿真实验分析了FVFL的实用性,结果表明该方案具有较高的精度和效率,验证开销在可接受范围内。
Problem Statement
A. Problem Definition
In our scheme, there are three entities: Trusted Authority (TA), Client and Central Aggregation Server. The TA distributed the parameters to the clients, clients utilize the
parameters code and encrypt the gradient, and the server aggregate the model paraments.
- 在我们的方案中,有三个实体:可信机构(TA)、客户端和中央聚合服务器。TA将参数分发给客户端,客户端利用参数对梯度进行编码和加密,服务器对模型参数进行聚合。
In this model, three challenges should be considerd.
- First of all, the gradient data of each client may compromise the privacy of local data, clients should be able to ensure local data
security and privacy.- Secondly, the server may be trustless.The responsibility of the server is to receive and aggregate the gradient data, and return the real and reliable aggregation results to each client. However, the server may be suffering from malicious attacks or for the purpose of saving computing resources, thus the gradient aggregation result is not trusted.
- Thirdly, during federated learning procedure, clients may drop out due to various reasons, the federated learning system should be able to work normally but not affected by clients’ exit.
- 在这个模型中,应该考虑三个挑战。
- 第一,每个客户端的梯度数据可能会危及本地数据的隐私,客户端应该能够保证本地数据的安全和隐私。
- 其次,服务器可能是不可信的。服务器的职责是接收和聚合梯度数据,并将真实可靠的聚合结果返回给每个客户端。但是,由于服务器可能受到恶意攻击,或者出于节省计算资源的目的,梯度聚合结果不可信。
- 第三,在联邦学习过程中,客户端可能会因为各种原因退出,联邦学习系统应该能够正常工作,不受客户端退出的影响。
B. Threat Model and Goals
We define the threat model in our FVFL as follows. We believe that TA is safe and reliable and will not be subjected to any malicious attacks. We assume that clients in the federated
learning system are honest and curious, which means that clients will follow protocol Settings, but client may be offline for various reasons. We assume that there may be malicious behavior on the server side. The server may maliciously forge aggregation results, or try to comprimise some clients to form a collusion attack to steal confidential data of other clients.
- 我们在我们的FVFL中定义威胁模型如下。我们相信TA安全可靠,不会受到任何恶意攻击。我们假设联邦学习系统中的客户端是诚实和好奇的,这意味着客户端将遵循协议设置,但客户端可能由于各种原因而离线。我们假设服务器端可能存在恶意行为。服务器可能恶意伪造聚合结果,或者试图攻破部分客户端,形成合谋攻击,窃取其他客户端的机密数据。
Therefore, the goal of our FVFL model is to ensure the local gradient security of the clients, the clients are able to judge the reliability of the cloud aggregation results, and the whole system can resist the collusive attacks between some clients.The whole system supports clients to participate flexibly.
- 因此,我们的FVFL模型的目标是保证客户端的局部梯度安全,客户端能够判断云聚合结果的可靠性,整个系统能够抵御部分客户端之间的合谋攻击。整个系统支持客户灵活参与。
共谋攻击(Collusion Attack):
在区块链联邦学习中,共谋攻击指的是多个参与者共同合作,以操纵模型的训练过程或模型输出。
这种攻击可能包括:
数据篡改: 多个恶意参与者共谋篡改数据或提供虚假的模型更新,旨在改变模型的结果或损害其他参与者。
拒绝合法参与: 多个恶意参与者可能联合拒绝接受或认可合法的模型更新,以阻碍合法参与者的贡献。
这些攻击威胁着区块链联邦学习中模型聚合的安全性和结果的可信度。
防范这些攻击需要采用安全的加密技术、合理的数据验证机制和去中心化的治理模式,以确保模型训练过程的安全和可信任性。
Preliminaries
A. Federated Learning
(1) Deep Learning:
一般来说,深度学习模型由三层组成,输入层、隐藏层和输出层。我们可以将深度学习模型描述为函数 f ( x , Θ ) = y f(x,\Theta)=y f(x,Θ)=y ,其中 x x x表示用户输入, Θ \Theta Θ是模型参数, y y y是函数f的输出,参数为 Θ \Theta Θ。
(2) Stochatic Gradient Descent(SGD)
对于梯度下降法(GD),在不失一般性的前提下,有
T
T
T对训练数据
(
x
,
y
)
(x, y)
(x,y),数据集
D
=
(
x
j
,
y
j
)
,
j
=
1
,
2
,
…
,
T
D = {(x_j, y_j), j = 1,2,…, T}
D=(xj,yj),j=1,2,…,T,基于上述数据集,损失函数可描述为:
与GD不同的是,在第
k
k
k次迭代中,SGD将随机选择一个子集
D
∗
k
D^{ *k}
D∗k,而不是整个数据集
D
D
D。损失函数可表示为:
根据上述损失函数,在第k次迭代中,梯度为:
然后,我们可以将模型参数更新如下:
式中
η
\eta
η 为客户端的学习率。
(3) Federated Aggregation
如图1所示,联邦学习可以桥接各种组织来训练一个全局模型,就需要有一个中央服务器,它聚合来自数据所有者的参数。
在联邦学习的第
k
k
k 次迭代中,客户端
i
i
i 计算梯度
ω
i
k
\omega_i^k
ωik ,中央服务器对
m
m
m 个客户端的梯度进行聚合,得到w
ω
k
=
∑
i
=
1
i
=
m
ω
i
k
\omega^k=\textstyle \sum_{i=1}^{i=m}\omega_i^k
ωk=∑i=1i=mωik ,然后将聚合结果发送给每个客户端。每个客户端根据聚合结果更新本地模型:
B. Paillier Encryption
Paillier加密系统实现了同态加密,在现有工作中得到了广泛的应用。具体的Paillier加密系统由三种算法组成:
- (1) Key Generation
- (2) Encryption
- (3) Decryption
值得注意的是,Paillier加密系统可以证明对选择的明文攻击是安全的。
C. Lagrange Interpolation
给定n个点,拉格朗日插值法可以通过这n个点找到一个唯一的n-1阶多项式。下面简单介绍拉格朗日插值。
给定
n
n
n 个不同的插值点
x
i
,
i
=
1
,
2
,
⋅
⋅
⋅
,
n
x_i, {i = 1,2,···,n}
xi,i=1,2,⋅⋅⋅,n,以及对应的数字
f
(
x
i
)
f(x_i)
f(xi) ,函数
f
f
f ,有一个唯一的
n
n
n 次公式。那么
n
n
n 次多项式
L
n
(
x
)
L_n(x)
Ln(x) 可以写成拉格朗日形式:
L
n
(
x
)
=
∑
i
=
1
n
f
(
x
i
)
l
i
(
x
)
L_n(x) =\textstyle \sum_{i=1}^{n}f(x_i)l_i(x)
Ln(x)=∑i=1nf(xi)li(x) 。
由上述拉格朗日插值公式可知,对于一个
n
n
n 次多项式函数,选取其任意
n
n
n 个点即可恢复函数表达式:
D. Shamir’s Secret Share
秘密共享方案是一种将一个秘密分割成
n
n
n 个片段,并将这些片段分配给不同的有效成员的方案。如果对手捕获了系统中的一个成员,他只能获得一小部分秘密。只有对手得到至少t个秘密,他才能得到整个秘密。我们通常采用Shamir秘密共享技术来实现这一结果。被信任方选择一个多项式来分割一个秘密:
S
S
S 是主密钥。根据上式,给定
x
i
x_i
xi,可得
G
(
x
i
)
,
S
i
=
x
i
,
G
(
x
i
)
G(x_i), S_i ={x_i,G(x_i)}
G(xi),Si=xi,G(xi) 为对应的子密钥共享。多项式
G
(
x
)
G(x)
G(x) 也可以表示为:
当
x
=
0
,
G
(
x
)
=
S
x = 0, G(x) = S
x=0,G(x)=S时,利用
β
(
x
i
)
\beta(x_i)
β(xi)表示拉格朗日插值系数。因此,主密钥
S
S
S可以表示为:
利用拉格朗日插值多项式,我们可以很容易地恢复主密钥
S
S
S。
Our Proposed FVFL Scheme
本节首先概述了FVFL,然后详细介绍了它的程序。
A. The Overview of FVFL
As shown in Fig. 2, the system comprises of Trusted Authority (TA), Clients and Central Aggregation Server.
- Trusted Authority (TA): The TA initialize the neural network model, generate keys and parameters and distributethem to clients.
- Clients : Our scheme consists of m m m clients, and each client has their own Data Set. During the federated learning process, the system allows clients to be online at different times. In particular, when verifying the correctness of the aggregation results, it is not necessary for all clients to participate in the aggregation of the results,but the correctness of the results can still be verified.
- Central Aggregation Server : In each round of federated learning, the central server aggregates the uploaded ciphertexts, and then distributes the aggregate results to each clients. In our system model, we consider the central servers is malicious. It may try to steal the privacy information of clients through the received grandients, even worse, forge the aggregated result to send to clients.
- TA (Trusted Authority): TA初始化神经网络模型,生成密钥和参数并分发给客户端。
- 客户端:我们的方案由 m m m个客户端组成,每个客户端都有自己的数据集。在联合学习过程中,系统允许客户端在不同时间在线。特别是在验证聚合结果的正确性时,不需要所有客户端都参与结果的聚合,但仍然可以验证结果的正确性。
- 中央聚合服务器:在每一轮联邦学习中,中央服务器对上传的密文进行聚合,然后将聚合结果分发给每个客户端。在我们的系统模型中,我们认为中央服务器是恶意的。它可能试图通过接收到的数据包窃取客户端的隐私信息,甚至伪造聚合结果发送给客户端。
As the Fig. 3 describe, FVFL includes four phases: initialization phase, client training model and encryption, central server aggretation ,client decrypt and update local model.
- Phase 0 (Initialization): The TA initializes the whole system and distributes some parameters to the clients. The clients use these parameters to encrypt the data.
- Phase 1 (Clients Train Model): Clients train model according to the local data set, then encode and encrypt the computed gradients and upload them to the cloud to complete
the aggregation update.- Phase 2(Central Server Aggregation): The central server aggregates the received parameters and returns them to each client.
- Phase 3 (Update Local Model): Clients receive the aggregation results from the cloud, decrypt and decode aggregation results, Clients verify the reliability of the aggregation results,then recover the specific aggregation results and update the local model.
- 阶段0(初始化):TA初始化整个系统,并分配一些参数给客户端。客户端使用这些参数对数据进行加密。
- 阶段1(客户端训练模型):客户端根据本地数据集训练模型,然后对计算得到的梯度进行编码加密并上传到云端完成聚合更新。
- 阶段2(中央服务器聚合):中央服务器聚合接收到的参数并将它们返回给每个客户端。
- 阶段3(更新本地模型):客户端从云端接收聚合结果,对聚合结果进行解密解码,验证聚合结果的可靠性,然后恢复特定的聚合结果并更新本地模型。
B. Initialization Phase
In our FVFL framework, the TA needs to generate and distribute the following paraments.
- (1) Using the security parameter K 0 K_0 K0 as input, output the parameter of the Paillier encryption system { N , g , p , q } \{N, g, p, q\} {N,g,p,q}, we set the public key P K = { N , g } PK = \{N, g\} PK={N,g}, the private secret
λ = l c m ( p − 1 , q − 1 ) λ = lcm(p − 1, q − 1) λ=lcm(p−1,q−1), the private key S K = { λ , p , q } SK = \{λ, p, q\} SK={λ,p,q}.The TA distributes the { P K , S K } \{PK, SK\} {PK,SK} to all clients. The P K PK PK and S K SK SK are used to encrypt and decrypt gradient data respectively.- (2) The TA needs to generate a super-increasing sequence a ⃗ \vec{a} a , a ⃗ = ( a 1 , a 2 , . . . , a n + 1 ) \vec{a}=(a_1,a_2,...,a_{n+1}) a =(a1,a2,...,an+1), where a n + 1 ≫ ∑ i = 1 n a i a_{n+1}\gg\textstyle \sum_{i=1}^{n}a_i an+1≫∑i=1nai.The TA calculates the encoding parameters g i = g a i , i = 1 , . . . , n + 1 g_i=g^{a_i},i = 1, ..., n + 1 gi=gai,i=1,...,n+1. The TA chooses random number
r ∗ ∈ Z N ∗ r^*\in Z_N^* r∗∈ZN∗ and calculates f 1 = r ∗ q m o d N 2 f_1 = r^{*q} mod N^2 f1=r∗qmodN2. According to g g g calculate f 2 = g q m o d N 2 f_2 = g^q mod N^2 f2=gqmodN2, the TA sends { f 1 , f 2 , a ⃗ , g 1 , g 2 , . . . , g n + 1 } \{f_1,f_2, \vec{a} ,g_1,g_2,...,g_{n+1}\} {f1,f2,a ,g1,g2,...,gn+1} to all clients. f 1 f_1 f1 will be used to encrypt gradient data and f 2 f_2 f2 will be used to verify the aggregation result. { a ⃗ , g 1 , g 2 , . . . , g n + 1 } \{\vec{a}, g_1, g_2, ..., g_{n+1}\} {a ,g1,g2,...,gn+1} will be used to encode and decode the gradient data.- (3) The TA generates two constants sequence k ⃗ = ( k 1 , k 2 , ⋅ ⋅ ⋅ , k n + 1 ) \vec{k} = (k_1, k_2, · · · , k_{n+1}) k =(k1,k2,⋅⋅⋅,kn+1) and h ⃗ = ( h 1 , h 2 , ⋅ ⋅ ⋅ , h n + 1 ) \vec{h} = (h_1, h_2, · · · , h_{n+1}) h =(h1,h2,⋅⋅⋅,hn+1). These two constants sequence will be used as interpolation to deal with clients gradients.
- (4) According to the private secret p p p, and the TA chooses a random number r r r according to the system parameter p p p, making r < p r < p r<p. The TA Constructs two Shamir’s secret sharing polynomial G 1 ( x ) G_1(x) G1(x) and G 2 ( x ) G_2(x) G2(x) with different coefficients, the modulus of the two Shamir’s secret sharing polynomials is chosen as p p p, then p − r p − r p−r and r r r are respectively used as the master key to be shared in Shamir’s secret sharing algorithm, and 2 m 2m 2m sub-key pairs are generated by running Shamir’s secret sharing algorithm, the sub-key pair of master key r is denoted as S 1 i = ( x i , G 1 ( x i ) ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m S_1^i = {(x_i, G_1(x_i)), i = 1, 2, · · · , m} S1i=(xi,G1(xi)),i=1,2,⋅⋅⋅,m, the sub-key pair of master key p − r p−r p−r is denoted as S 2 i = { ( x i , G 2 ( x i ) ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } S_2^i = \{(x_i, G_2(x_i)), i =1, 2, · · · , m\} S2i={(xi,G2(xi)),i=1,2,⋅⋅⋅,m}, where { x i , i = 1 , 2 , ⋅ ⋅ ⋅ , m } \{x_i, i = 1, 2, · · · , m\} {xi,i=1,2,⋅⋅⋅,m} are m different random values, { G 1 ( x i ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } \{G_1(x_i), i = 1, 2, · · · , m\} {G1(xi),i=1,2,⋅⋅⋅,m} is the value obtained from the polynomial G 1 ( x ) G_1(x) G1(x) input x i , { G 2 ( x i ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } x_i , \{G_2(x_i), i = 1, 2, · · · , m\} xi,{G2(xi),i=1,2,⋅⋅⋅,m} is the value obtained from the polynomial G 2 ( x ) G_2(x) G2(x) input x i x_i xi. Then, the TA distubtes { S 1 i , S 2 i } \{S_1^i, S_2^i \} {S1i,S2i} to client i i i, and distubtes { x 1 , x 2 , ⋅ ⋅ ⋅ , x m } \{x_1, x_2, · · · , x_m\} {x1,x2,⋅⋅⋅,xm} to all clients. { S 1 i , S 2 i } \{S_1^i, S_2^i\} {S1i,S2i} will be used to encrypt the gradient data and resist the collusion attacks. { x 1 , x 2 , ⋅ ⋅ ⋅ , x m } \{x_1, x_2, · · · , x_m\} {x1,x2,⋅⋅⋅,xm} will be used to recover the master key.
- (1) 以安全参数 K 0 K_0 K0为输入,输出Paillier加密系统的参数 { N , g , p , q } \{N, g, p, q\} {N,g,p,q},设公钥 P K = { N , g } PK = \{N, g\} PK={N,g},私钥 λ = l c m ( p − 1 , q − 1 ) λ = lcm(p − 1, q − 1) λ=lcm(p−1,q−1),私钥 S K = { λ , p , q } SK = \{λ, p, q\} SK={λ,p,q}。TA将 { P K , S K } \{PK, SK\} {PK,SK}分发给所有客户端。 P K PK PK和 S K SK SK分别用于对梯度数据进行加密和解密。
- (2) TA需要生成一个超递增序列 a ⃗ \vec{a} a ,其中 a ⃗ = ( a 1 , a 2 , . . . , a n + 1 ) \vec{a}=(a_1,a_2,...,a_{n+1}) a =(a1,a2,...,an+1),其中 a n + 1 ≫ ∑ i = 1 n a i a_{n+1}\gg\textstyle \sum_{i=1}^{n}a_i an+1≫∑i=1nai。TA计算编码参数 g i = g a i , i = 1 , . . . , n + 1 g_i=g^{a_i},i = 1, ..., n + 1 gi=gai,i=1,...,n+1。TA选择随机数 r ∗ ∈ Z N ∗ r^*\in Z_N^* r∗∈ZN∗并计算 f 1 = r ∗ q m o d N 2 f_1 = r^{*q} mod N^2 f1=r∗qmodN2。根据 g g g计算 f 2 = g q m o d N 2 f_2 = g^q mod N^2 f2=gqmodN2, TA发送 { f 1 , f 2 , a ⃗ , g 1 , g 2 , . . . , g n + 1 } \{f_1,f_2, \vec{a} ,g_1,g_2,...,g_{n+1}\} {f1,f2,a ,g1,g2,...,gn+1}到所有客户端。 f 1 f_1 f1用于加密梯度数据, f 2 f_2 f2用于验证聚合结果。 { a ⃗ , g 1 , g 2 , . . . , g n + 1 } \{\vec{a}, g_1, g_2, ..., g_{n+1}\} {a ,g1,g2,...,gn+1}用于对梯度数据进行编码和解码。
- (3) TA生成两个常数序列,分别为: k ⃗ = ( k 1 , k 2 , ⋅ ⋅ ⋅ , k n + 1 ) \vec{k} = (k_1, k_2, · · · , k_{n+1}) k =(k1,k2,⋅⋅⋅,kn+1)和 h ⃗ = ( h 1 , h 2 , ⋅ ⋅ ⋅ , h n + 1 ) \vec{h} = (h_1, h_2, · · · , h_{n+1}) h =(h1,h2,⋅⋅⋅,hn+1)。这两个常数序列将被用作插值来处理客户端梯度。
- (4) 根据私钥p,TA根据系统参数选择一个随机数 r r r ,其中 r < p r < p r<p。TA构造两个具有不同系数的Shamir秘密共享多项式 G 1 ( x ) G_1 (x) G1(x) 和 G 2 ( x ) G_2 (x) G2(x) , 两个Shamir秘密共享多项式的系数模数为 p p p, p − r p−r p−r和 r r r分别用作主密钥共享在Shamir秘密共享算法,通过运行Shamir秘密共享算法生成 2 m 2m 2m个子密钥对,主密钥的查寻对r 的子密钥对表示为 S 1 i = { ( x i , G 1 ( x i ) ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } S_1^i = \{(x_i, G_1 (x_i)),i= 1,2 , · · · , m\} S1i={(xi,G1(xi)),i=1,2,⋅⋅⋅,m},主密钥 p − r p−r p−r的子密钥对表示为 S 2 i 2 = { ( x i , G 2 ( x i ) ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } S_2^i2 = \{(x_i, G_2 (x_i)),i= 1,2 , · · · , m\} S2i2={(xi,G2(xi)),i=1,2,⋅⋅⋅,m},其中 { x i , i = 1 , 2 , ⋅ ⋅ ⋅ , m } \{x_i, i = 1,2 , · · · , m\} {xi,i=1,2,⋅⋅⋅,m}是 m m m个不同的随机值,KaTeX parse error: Expected '}', got 'EOF' at end of input: …2 , · · · , m\} 是多项式 G 1 ( x ) G_1 (x) G1(x) 输入 x i x_i xi所得的值, { G 2 ( x i ) , i = 1 , 2 , ⋅ ⋅ ⋅ , m } \{G_2(x_i), i = 1, 2 , · · · , m\} {G2(xi),i=1,2,⋅⋅⋅,m} 是多项式 G 2 ( x ) G_2 (x) G2(x) 输入 x i x_i xi 所得到的值。然后,TA将 { S 1 i , S 2 i } \{S_1^i, S_2^i\} {S1i,S2i} 分配给客户端 i i i,并将 { x 1 , x 2 , ⋅ ⋅ ⋅ , x m } \{x_1, x_2,···,x_m\} {x1,x2,⋅⋅⋅,xm} 分配给所有客户端。 { S 1 i , S 2 i } \{S_1^i, S_2^i\} {S1i,S2i} 将用于对梯度数据进行加密,以抵御合谋攻击。 { x 1 , x 2 , ⋅ ⋅ ⋅ , x m } \{x_1, x_2,···,x_m\} {x1,x2,⋅⋅⋅,xm}用于恢复主密钥。
C. Client Training Model Phase
在此阶段,每个客户端基于本地数据集
D
=
{
(
x
j
,
y
j
)
,
j
=
1
,
2
,
.
.
.
,
T
}
D = \{(x_j , y_j ), j = 1, 2, ..., T\}
D={(xj,yj),j=1,2,...,T} 训练自己的模型,基于上述数据集,每个客户端采用
S
G
D
SGD
SGD 获取梯度,客户端随机选择子集
D
∗
D^∗
D∗ 计算损失函数:
客户端
i
i
i 执行
S
G
D
SGD
SGD 计算局部梯度
ω
i
\omega_i
ωi。
梯度向量是一个
n
n
n 维向量,描述为
ω
i
=
(
ω
i
1
,
ω
i
2
,
⋅
⋅
⋅
,
ω
i
n
)
\omega_i = (\omega_{i1}, \omega_{i2}, ···,\omega_{in})
ωi=(ωi1,ωi2,⋅⋅⋅,ωin)。
其次,客户端
i
i
i 向服务器发送联邦学习请求,并向服务器发送相应的
I
D
ID
ID。当服务器等待从客户机接收请求的时间超过
t
t
t 时,服务器收集所有在线客户端的
I
D
ID
ID,并将
I
D
ID
ID 集合返回给每个客户端。客户端根据
I
D
ID
ID 集合从
{
x
1
,
x
2
,
…
,
x
m
}
\{x_1, x_2,…, x_m\}
{x1,x2,…,xm} 中选择相应的xi,计算:
客户端
i
i
i 进一步计算子密钥
S
i
S_i
Si ,如下式所示:
最后,为了保证客户端数据的安全性和减少通信开销,客户端使用TA分配的参数对梯度数据进行加密和封装。首先,客户端使用常数序列
k
⃗
=
{
k
1
k
2
,
.
.
.
,
k
n
+
1
}
\vec{k}=\{k_1k_2,...,k_{n+1}\}
k
={k1k2,...,kn+1} 插值局部梯度数据
ω
i
=
{
ω
i
1
,
.
.
.
,
ω
i
n
}
\omega_i=\{\omega_{i1},...,\omega_{in}\}
ωi={ωi1,...,ωin} 和共享子密钥
S
i
S_i
Si,然后客户端向函数
F
i
(
x
)
F_i(x)
Fi(x) 输入常数序列
h
⃗
=
{
h
1
,
h
2
,
.
.
.
,
h
n
+
1
}
\vec{h}=\{h_1,h_2,...,h_{n+1}\}
h
={h1,h2,...,hn+1} ,计算相应的函数值
F
i
(
h
i
)
F_i(h_i)
Fi(hi),然后客户端
i
i
i 使用TA和子密钥
S
i
S_i
Si 分配的加密参数
{
g
1
,
g
2
,
.
.
.
,
g
n
+
1
,
f
1
,
N
}
\{g_1,g_2,...,g_{n+1},f_1,N\}
{g1,g2,...,gn+1,f1,N} 对值
{
F
i
(
h
1
)
,
F
i
(
h
2
)
,
.
.
,
F
i
(
h
n
+
1
)
}
\{F_i(h_1),F_i(h_2),..,F_i(h_{n+1})\}
{Fi(h1),Fi(h2),..,Fi(hn+1)}:
客户端
i
i
i 获取加密值
C
i
C_i
Ci 后,将加密后的数据上传到服务器,等待服务器完成联邦学习的聚合更新。
D. Aggregation Phase
聚合服务器接收到所有客户端的密文后,利用加法同态将密码聚合为:
其中
C
C
C 为聚合密文,服务器将聚合密文返回给所有客户端,注意到梯度信息为密文,服务器无法获取明文信息。中央云服务器通常不受信任。为了节省自己的计算资源,服务器可能会伪造结果。对于云服务器返回的聚合结果,客户端应该具有验证结果是否正确的能力。由于客户端在系统中可能处于离线状态,因此加密机制和验证机制必须支持客户端灵活参与。
E. Update Local Model Phase
各客户端收到聚合结果后,首先使用paillier解密算法对聚合密文
C
C
C 进行解密:
其次,我们将解码过程总结为算法2,每个客户端使用超递增序列
a
⃗
=
{
a
1
,
a
2
,
⋅
⋅
⋅
,
a
n
+
1
}
\vec{a} = \{a_1, a_2,···,a_{n+1}\}
a
={a1,a2,⋅⋅⋅,an+1} 对聚合结果进行解码,各客户端获得结果
{
F
(
h
1
)
,
F
(
h
2
)
,
⋅
⋅
⋅
,
F
(
h
n
+
1
)
}
\{F(h_1), F(h_2),···,F(h_{n+1})\}
{F(h1),F(h2),⋅⋅⋅,F(hn+1)}。
如算法3所示,客户端需要对数据集
{
(
h
1
,
F
(
h
1
)
)
,
(
h
2
,
F
(
h
2
)
)
,
⋅
⋅
⋅
,
(
h
n
+
1
,
F
(
h
n
+
1
)
)
}
\{(h_1, F(h_1)),(h_2, F(h_2)),···,(h_{n+1}, F(h_{n+1}))\}
{(h1,F(h1)),(h2,F(h2)),⋅⋅⋅,(hn+1,F(hn+1))}进行拉格朗日插值:
在得到函数
F
(
x
)
F(x)
F(x) 后,客户端将
x
=
k
n
+
1
x = k_{n+1}
x=kn+1 输入到函数
F
(
x
)
F(x)
F(x) 中,计算出
F
(
k
n
+
1
)
F(k_{n+1})
F(kn+1) ,然后通过秘密参数
f
2
f_2
f2 验证聚合结果的正确性。若
f
2
F
(
k
n
+
1
)
m
o
d
N
2
≡
1
f_2^{F(k_{n+1})} mod N^2≡1
f2F(kn+1)modN2≡1,则客户端确信聚合结果是正确的。然后客户输入常数序列
k
⃗
\vec{k}
k
到函数
F
(
x
)
F(x)
F(x) 中,计算
ω
=
{
F
(
k
1
)
,
F
(
k
2
)
,
⋅
⋅
⋅
,
F
(
k
n
)
}
\omega = \{F(k_1), F(k_2),···,F(k_n)\}
ω={F(k1),F(k2),⋅⋅⋅,F(kn)}。函数
F
(
x
)
F(x)
F(x) 的结果对应于各模型的梯度聚合结果。
在此阶段结束时,每个客户端使用聚合结果在第
k
k
k 轮本地更新相应的模型参数
M
:
M
k
+
1
=
M
k
−
η
ω
k
m
M: M_{k+1} = M_k−\eta\frac{\omega^k}{m}
M:Mk+1=Mk−ηmωk 。
总结
威胁模型
- TA(Trusted Authority)安全可靠,也不会受到任何恶意攻击;
- 系统中的客户端是诚实且好奇的,这意味着客户端将遵循协议设置,但客户端可能由于各种原因而离线;
- 服务器可能恶意伪造聚合结果,或者试图攻破部分客户端,形成合谋攻击,窃取其他客户端的机密数据。
本文解决问题
- 联邦学习过程中的客户端退出问题;
- 训练过程中服务器可能会推理客户端的数据,也会缔造假的聚合数据。
文章方法
- 使用同态加密和Shamir秘密共享来确保客户端的数据不被泄露;
- 使用超递增序列来减少通讯开销;
- 使用拉格朗日插值多项式和Shamir秘密共享来实现验证机制。
本文特色
- 允许客户端在过程中随时退出;
- 客户端对服务器发来的全局梯度具有验证机制;
- 验证机制不需要全部客户端参与。
思考
- 可以换一种加密方法吗?
- 当然可以,比如DP
- 是否能使用两个服务器,在上传局部梯度时每个服务器获得各个客户端一半的数据?
- 似乎可以的
- 有无其他验证方法?
- 等待阅读新的文献