我正在尝试删除由 TfidfVectorizer 创建的二元组。我正在使用 text.TfidfVectorizer,以便我可以使用自己的预处理器函数。
Init
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
测试字符串和预处理器函数:
doc2 = ['this is a test past performance here is another that has aa aa adding builing cat dog horse hurricane',
'another that has aa aa and start date and hurricane hitting south carolina']
def remove_bigrams(doc):
gram_2 = ['past performance', 'start date', 'aa aa']
res = []
for record in doc:
the_string = record
for phrase in gram_2:
the_string = the_string.replace(phrase, "")
res.append(the_string)
return res
remove_bigrams(doc2)
My TfidfVectorizer实例化和 fit_transform
custom_stop_words = [i for i in stop_words]
vec = text.TfidfVectorizer(stop_words=custom_stop_words,
analyzer='word',
ngram_range = (2,2),
preprocessor = remove_bigrams)
features = vec.fit_transform(doc2)
这是我的错误,这让我抓狂,我已经尝试了我能想到的一切并搜索了堆栈交换。
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [49], in <cell line: 5>()
3 #t3_cv = CountVectorizer(t2, stop_words = stop_words)
4 vec = text.TfidfVectorizer(stop_words=custom_stop_words, analyzer='word', ngram_range = (2,2), preprocessor = remove_bigrams)
----> 5 features = vec.fit_transform(doc2)
File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:2079, in TfidfVectorizer.fit_transform(self, raw_documents, y)
2072 self._check_params()
2073 self._tfidf = TfidfTransformer(
2074 norm=self.norm,
2075 use_idf=self.use_idf,
2076 smooth_idf=self.smooth_idf,
2077 sublinear_tf=self.sublinear_tf,
2078 )
-> 2079 X = super().fit_transform(raw_documents)
2080 self._tfidf.fit(X)
2081 # X is already a transformed view of raw_documents so
2082 # we set copy to False
File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:1338, in CountVectorizer.fit_transform(self, raw_documents, y)
1330 warnings.warn(
1331 "Upper case characters found in"
1332 " vocabulary while 'lowercase'"
1333 " is True. These entries will not"
1334 " be matched with any documents"
1335 )
1336 break
-> 1338 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
1340 if self.binary:
1341 X.data.fill(1)
File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:1209, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab)
1207 for doc in raw_documents:
1208 feature_counter = {}
-> 1209 for feature in analyze(doc):
1210 try:
1211 feature_idx = vocabulary[feature]
File c:\Development_Solutions\Sandbox\SBVE\lib\site-packages\sklearn\feature_extraction\text.py:113, in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
111 doc = preprocessor(doc)
112 if tokenizer is not None:
--> 113 doc = tokenizer(doc)
114 if ngrams is not None:
115 if stop_words is not None:
TypeError: expected string or bytes-like object
这个错误信息 "TypeError: expected string or bytes-like object" 告诉你,你的分词器(tokenizer)收到的是一个列表,而不是字符串。这是因为你的预处理器
remove_bigrams
返回的是一个列表,而分词器期望的是一个字符串。
你可以通过修改你的
remove_bigrams
函数,使其返回一个字符串而不是一个列表来解决这个问题。你可以使用空格连接处理后的单词:
def remove_bigrams(doc):
gram_2 = ['past performance', 'start date', 'aa aa']
the_string = doc
for phrase in gram_2:
the_string = the_string.replace(phrase, "")
return the_string
vec = text.TfidfVectorizer(stop_words=custom_stop_words,
analyzer='word',
ngram_range=(2,2),
preprocessor=remove_bigrams)
features = vec.fit_transform(doc2)
这样修改后,你的预处理器将会返回一个字符串,分词器就可以正常工作了。
另外,当你使用
ngram_range = (2,2)
时,你实际上是在告诉 TfidfVectorizer 只提取二元词组。如果你想要提取单词而不是二元词组,你应该使用
ngram_range = (1,1)
。