首页 > 其他分享 >tensor版CBOW

tensor版CBOW

时间:2024-07-02 18:55:44浏览次数:3  
标签:word tensor np ----- words print CBOW nltk

小小技能

1

key = ['a','b','c']
value = [1,2,3]

vocab = dict(zip(key,value))
print(vocab)

运行效果:

{'a': 1, 'b': 2, 'c': 3}

2

key = ['a','b','c']

vocab = dict(zip(key,range(3)))
print(vocab)

运行效果:

{'a': 0, 'b': 1, 'c': 2}

读语料

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print(len(all_word))
print(all_word[0:10])

words = list(set(all_word))
words.sort()
print(len(words))
print(words[0:100])

运行效果:

124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']

语料总字数 124301826 跟 fasttext是一样的

fasttext构建的词典有:218316个不重复的word (token)

我们这里用set得到的是833184个不重复的word (token),根据上面打印的信息来看,fasttext去除了614868没有意义的word

去除无意义单词

enchant库-67406

参考:https://blog.csdn.net/hpulfc/article/details/80997252

1.安装包

pip install pyenchant
import enchant

words = ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']

d = enchant.Dict('en_US')
for word in words:
    if d.check(word): 
        print(word)

运行此代码会有一个报错: 

报错参考:https://installati.one/install-enchant-ubuntu-20-04/

ImportError: The 'enchant' C library was not found and maybe needs to be installed.

安装包 :

sudo apt install enchant

再次运行代码:

import enchant

words = ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']

d = enchant.Dict('en_US')
for word in words:
    if d.check(word): 
        print(word)
a

过一遍语料

import enchant

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---enchant检查单词---')
d = enchant.Dict('en_US')
all_good_word = []
for word in all_word:
    if d.check(word): 
        all_good_word.append(word)
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words = list(set(all_good_word))
words.sort()
print(len(words))
print(words[0:10])

运行效果:

-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---enchant检查单词---
112301050
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
----单词去重后----
67406
['a', 'aah', 'aardvark', 'aardvarks', 'ab', 'aback', 'abacus', 'abacuses', 'abaft', 'abalone']

fasttext的词典: 218316

enchant检查后: 67406

nltk库:

pip install nltk

words-72017 

import nltk
nltk.download('words')
from nltk.corpus import words

test = ['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaarg', 'aaaaaaaaaaaaahhhhhhhh', 'aaaaaaaaaah', 'aaaaaaaaab', 'aaaaaaaaaghghhgh', 'aaaaaacceglllnorst', 'aaaaaaccegllnorrst', 'aaaaaad', 'aaaaaaf', 'aaaaaah', 'aaaaaalmrsstt', 'aaaaaannrstyy', 'aaaaabbcdrr', 'aaaaaction', 'aaaaae', 'aaaaaf', 'aaaaahh', 'aaaaand', 'aaaaanndd', 'aaaaargh', 'aaaab', 'aaaabb', 'aaaabbbbccc', 'aaaabbbbccccddddeeeeffff', 'aaaabbbbccccddddeeeeffffgggg', 'aaaad', 'aaaae', 'aaaaf', 'aaaah', 'aaaahnkuttie', 'aaaar', 'aaaargh', 'aaaassembly', 'aaaatataca', 'aaaax', 'aaaay', 'aaab', 'aaaba', 'aaabb', 'aaabbbccc', 'aaac', 'aaad', 'aaadietya', 'aaae', 'aaaeealqoff', 'aaaf', 'aaagaaaaaa', 'aaagaattat', 'aaagctactc', 'aaaggacggu', 'aaagh', 'aaah', 'aaahh', 'aaahs', 'aaai', 'aaaimh', 'aaajj', 'aaake', 'aaalac', 'aaam', 'aaamazzarites', 'aaan', 'aaargh', 'aaarm', 'aaarrr', 'aaas', 'aaasthana', 'aaate', 'aaathwan', 'aaavs', 'aaay', 'aab', 'aaba', 'aabaa', 'aabab', 'aababb', 'aabach', 'aabb', 'aabba', 'aabbb', 'aabbcc', 'aabbirem', 'aabbs', 'aabbtrees', 'aabc', 'aabdul', 'aabebwuvev', 'aabehlpt', 'aabel']

for item in test:
    if item in words.words(): 
        print(item)

运行效果

[nltk_data] Downloading package words to /home/universe/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
a
aa

第一次运行会下载words.zip,开vpn

过一遍语料

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
all_good_word = []
words = list(set(words.words()))
for word in tqdm(all_word):
    if word in words: 
        all_good_word.append(word)
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words_clean = list(set(all_good_word))
words_clean.sort()
print(len(words_clean))
print(words_clean[0:10])

运行效果

[nltk_data] Downloading package words to /home/universe/nltk_data...
[nltk_data]   Package words is already up-to-date!
-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---nltk: words检查单词---
  0%|                        | 167256/124301826 [08:35<108:40:54, 317.27it/s]

100个小时???

代码改进下

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
all_good_word = []
words = list(set(words.words()))
words_clean = set()
print('nltk的不重复words数量:',len(words))
for word in tqdm(all_word):
    if word in words_clean:
        all_good_word.append(word)
    else:
        if word in words: 
            all_good_word.append(word)
            words_clean.add(word)
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words_clean = list(words_clean)
words_clean.sort()
print(len(words_clean))
print(words_clean[0:10])
[nltk_data] Downloading package words to /home/universe/nltk_data...
[nltk_data]   Package words is already up-to-date!
-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---nltk: words检查单词---
nltk的不重复words数量: 235892
  0%|                            | 101239/124301826 [01:58<43:53:10, 786.13it/s]

依然需要40个小时

再改代码如下

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
all_good_word = []
words = list(set(words.words()))
words_np = np.array(words, dtype=np.str_)
words_clean = set()
print('nltk的words不重复数量:',len(words))
for word in tqdm(all_word):
    if word in words_clean:
        all_good_word.append(word)
    else:
        if word in words_np: 
            all_good_word.append(word)
            words_clean.add(word)
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words_clean = list(words_clean)
words_clean.sort()
print(len(words_clean))
print(words_clean[0:10])
[nltk_data] Downloading package words to /home/universe/nltk_data...
[nltk_data]   Package words is already up-to-date!
-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---nltk: words检查单词---
nltk的words不重复数量: 235892
  0%|                     | 213299/124301826 [01:21<12:33:49, 2743.51it/s]

还需要12个小时,看来numpy内存的连续存储可以加快索引速度

先看下面代码的意思

x = range(1009)
t = len(x) // 100

for i in range(t+1):
    if (i+1)*100 >= len(x):
        print(i*100,len(x))
    else:
        print(i*100,(i+1)*100)
0 100
100 200
200 300
300 400
400 500
500 600
600 700
700 800
800 900
900 1000
1000 1009

把原始读取的语料list,分段赋值给numpy的数组

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
all_good_word = []
words = list(set(words.words()))
words_np = np.array(words, dtype=np.str_)
print('nltk的words不重复数量:',len(words))
words_clean = set()
num = len(all_word)
clip_num = 20000000
t = num // clip_num
for i in range(t+1):
    if (i+1)*clip_num >= num:
        all_word_sub_np = np.array(all_word[i*clip_num:num])
        print(f'---第{i+1}/{t}个片段: {i*clip_num}:{num}---')
    else:
        all_word_sub_np = np.array(all_word[i*clip_num:(i+1)*clip_num])
        print(f'---第{i+1}/{t}个片段: {i*clip_num}:{(i+1)*clip_num}---')
    sub_len = len(all_word_sub_np)
    for index in tqdm(range(sub_len)):
        if all_word_sub_np[index] in words_clean:
            all_good_word.append(all_word_sub_np[index])
        else:
            if all_word_sub_np[index] in words_np: 
                all_good_word.append(all_word_sub_np[index])
                words_clean.add(all_word_sub_np[index])
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words_clean = list(words_clean)
words_clean.sort()
print(len(words_clean))
print(words_clean[0:10])
[nltk_data] Downloading package words to /home/universe/nltk_data...
[nltk_data]   Package words is already up-to-date!
-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---nltk: words检查单词---
nltk的words不重复数量: 235892
---第1/6个片段: 0:20000000---
  0%|▏                       | 61249/20000000 [00:26<2:11:50, 2520.47it/s]

也得12个小时

再换代码

np.isin()

import numpy as np

x = np.array(['a','aa','aaa','b','bb','bbb'])
y = np.array(['a','b'])
print(np.isin(x,y))

x = x[np.isin(x,y)]
print(x)
[ True False False  True False False]
['a' 'b']
import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np
import time
import datetime

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-----')
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
all_good_word = []
words = list(set(words.words()))
words_np = np.array(words)
print('nltk的words不重复数量:',len(words))
num = len(all_word)
clip_num = 30000000
t = num // clip_num
for i in range(t+1):
    T0 = time.time()
    if (i+1)*clip_num >= num:
        all_word_sub_np = np.array(all_word[i*clip_num:num])
        print(f'第{i+1}/{t}个片段: {i*clip_num}:{num}')
    else:
        all_word_sub_np = np.array(all_word[i*clip_num:(i+1)*clip_num])
        print(f'第{i+1}/{t}个片段: {i*clip_num}:{(i+1)*clip_num}')
    index = np.isin(all_word_sub_np,words_np)
    all_word_sub_np = all_word_sub_np[index]
    all_good_word.extend(all_word_sub_np.tolist())
    all_word_sub_np = None
    index = None
    T1 = time.time()
    print(f'第{i+1}片段处理时间: {datetime.timedelta(seconds=T1-T0)}')
print(len(all_good_word))
print(all_good_word[0:10])

print('----单词去重后----')
words_clean = list(set(all_good_word))
words_clean.sort()
print(len(words_clean))
print(words_clean[0:10])
[nltk_data] Error loading words: <urlopen error [Errno 111] Connection
[nltk_data]     refused>
-----未检查单词-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
---nltk: words检查单词---
nltk的words不重复数量: 235892
第1/4个片段: 0:30000000
第1片段处理时间: 0:00:53.648143
第2/4个片段: 30000000:60000000
Killed

这里处理速度很快,但是在第2个片段处理时,会因为内存耗尽,被系统kill掉

再换个思路:语料读取后,先语料自身的word去重

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np
import time
import datetime

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-未去重-----')
print(len(all_word))
print(all_word[0:10])
print('-----未检查单词-自身去重----')
all_word = list(set(all_word))
all_word.sort()
print(len(all_word))
print(all_word[0:10])

影响效果: 

[nltk_data] Error loading words: <urlopen error [Errno 111] Connection
[nltk_data]     refused>
-----未检查单词-未去重-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
-----未检查单词-自身去重----
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa']

单词少了很多

import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np
import time
import datetime

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-未去重-----')
print(len(all_word))
print(all_word[0:10])
print('-----未检查单词-自身去重----')
all_word = list(set(all_word))
all_word.sort()
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
words = list(set(words.words()))
print('nltk的words不重复数量:',len(words))
words_np = np.array(words)
all_word_np = np.array(all_word)
all_good_word = all_word_np[np.isin(all_word_np,words_np)].tolist()
print(type(all_good_word))
print(len(all_good_word))
print(all_good_word[0:10])

运行效果: 

[nltk_data] Error loading words: <urlopen error [Errno 111] Connection
[nltk_data]     refused>
-----未检查单词-未去重-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
-----未检查单词-自身去重----
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa']
---nltk: words检查单词---
nltk的words不重复数量: 235892
<class 'list'>
72017
['a', 'aa', 'aal', 'aam', 'aardvark', 'aardwolf', 'aba', 'abac', 'abaca', 'aback']

单词小写

x = ['A','The','HELLO','hello']
x_lower = [i.lower() for i in x]
print(x_lower)
['a', 'the', 'hello', 'hello']
import nltk
nltk.download('words')
from nltk.corpus import words
from tqdm import tqdm
import numpy as np
import time
import datetime

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-未去重-----')
print(len(all_word))
print(all_word[0:10])
print('-----未检查单词-自身去重----')
all_word = list(set(all_word))
all_word.sort()
print(len(all_word))
print(all_word[0:10])
print('-----未检查单词-先单词小写-再自身去重')
all_word = list(set([i.lower() for i in all_word]))
all_word.sort()
print(len(all_word))
print(all_word[0:10])

print('---nltk: words检查单词---')
words = list(set(words.words()))
print('nltk的words不重复数量:',len(words))
words_np = np.array(words)
all_word_np = np.array(all_word)
all_good_word = all_word_np[np.isin(all_word_np,words_np)].tolist()
print(type(all_good_word))
print(len(all_good_word))
print(all_good_word[0:10])

运行效果: 

[nltk_data] Error loading words: <urlopen error [Errno 111] Connection
[nltk_data]     refused>
-----未检查单词-未去重-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
-----未检查单词-自身去重----
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa']
-----未检查单词-先单词小写-再自身去重
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa']
---nltk: words检查单词---
nltk的words不重复数量: 235892
<class 'list'>
72017
['a', 'aa', 'aal', 'aam', 'aardvark', 'aardwolf', 'aba', 'abac', 'abaca', 'aback']

语料里的单词都是小写,得到的词典还是72017

wordnet-101220
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

print(wordnet.synsets('hello'))
print(wordnet.synsets('aaaaaaaaaaa'))

if wordnet.synsets('hello'):
    print('hello: yes')

if wordnet.synsets('aaaaaaaaaaa'):
    print('aaaaaaaaaaa: yes')

 运行效果:

[nltk_data] Downloading package wordnet to /home/universe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[Synset('hello.n.01')]
[]
hello: yes
from nltk.corpus import wordnet

all_word = []
with open('./enwik9_text','r') as f:
    for line in f.readlines():
        all_word.extend(line.strip().split(' '))
print('-----未检查单词-未去重-----')
print(len(all_word))
print(all_word[0:10])
print('-----未检查单词-先单词小写-再自身去重')
all_word = list(set([i.lower() for i in all_word]))
all_word.sort()
print(len(all_word))
print(all_word[0:10])

print('---nltk: wordnet检查单词---')
all_good_word = []
for word in all_word:
    if wordnet.synsets(word):
        all_good_word.append(word)
print(len(all_good_word))
print(all_good_word[0:10])

运行效果: 

-----未检查单词-未去重-----
124301826
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
-----未检查单词-先单词小写-再自身去重
833184
['a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaaa']
---nltk: wordnet检查单词---
101220
['a', 'aa', 'aaa', 'aaas', 'aachen', 'aachens', 'aah', 'aahs', 'aalborg', 'aalst']

标签:word,tensor,np,-----,words,print,CBOW,nltk
From: https://blog.csdn.net/JIA_NG_FA_N/article/details/140121560

相关文章