背景:
在做机器翻译的时候,我们的单边语料大约20G大小的纯文本语料,在DataLoader加载的时候不可能一次性加载进来,所以就有了这个超大语料的加载问题。
解决方案:
data_dealing.py:
import os
import sys
root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_dir)
from CODES.CONFIG import *
from tqdm import tqdm
from CODES.UTILS import wordcount
import pickle
def data_dealing():
pos = 0
file_pos_list = []
file_path = datas_dir / "stc_weibo_train_post"
with open(file_path, "r", encoding="utf-8") as fr:
file_length = int(wordcount(file_path))
pb = tqdm(total=file_length)
for line in fr:
file_pos_list.append(pos)
pos += len(line.encode("utf-8"))
pb.update(1)
pickle.dump(file_pos_list , open(datas_dir / "big_file_seek_list.pkl", "wb"))
if __name_
标签:20G,pos,file,import,path,语料,加载
From: https://blog.csdn.net/wtl1992/article/details/140378626