- 基于Character-Based Language Model在制作之前需要对语料库中的词汇进行分割,将每个字母单拎出来存在另一个文件里使用;
- 下方是干分割工序的Python脚本:
# -*- coding: UTF-8 -*-
import string
import sys
def SplitIntoCharacters(sourceFilePath, outputFileName):
sourceFile = open(sourceFilePath)
newFile = open(outputFileName, 'a')
chn_punctuations = "!?。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
for word in sourceFile.read().split():
for character in word:
isPunct = character in string.punctuation or character in chn_punctuations
if not isPunct:
newCharacter = character.lower() + "\n"
newFile.writelines(newCharacter)
sourceFile.close()
newFile.close()
print("done!")
if __name__ == "__main__":
# print('args list:', str(sys.argv))
sourceFilePath = sys.argv[1]
outputFileName = sys.argv[2]
if sourceFilePath == ' ' or outputFileName == ' ':
print("Error: Source file path or the output file name is empty")
else:
SplitIntoCharacters(sourceFilePath, outputFileName)
# by Alexander Enharjan
- 用法是:
python3 wordSpliter (INPUT_FILE_PATH) (OUTPUT_FILE_PATH)
作者:艾孜尔江·艾尔斯兰
转载请务必标明出处!