gensim中常用的Word2Vec，Phrases，Phraser，KeyedVectors

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

gensim中常⽤的Word2Vec，Phrases，Phraser，KeyedVectors gensim中常⽤的Word2Vec，Phrases，Phraser，KeyedVectors 1. Phrases 和Phraser
gensim.models.phrases.Phrases和gensim.models.phrases.Phraser的⽤处是从句⼦中⾃动检测常⽤的短语表达，N-gram多元词组。

Phrases模型可以构建和实现bigram，trigram，quadgram等，提取⽂档中经常出现的2个词，3个词，4个词。

具体可以查看官⽹,
两者不同：
Phrases基于共现频率提取bigram词组。

基于共现的统计：受min_count与threshold的影响，参数设置越⼤，单词组合成⼆元词组的难度越⼤。

class gensim.models.phrases.Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))
Phraser的⽬的是通过丢弃⼀些Phasers模型（phrases_model）状态，减少短语的内存消耗。

如果后⾯不需要⽤新⽂档更新bigram统计信息，就可以使⽤Phraser代替Phrases。

⼀次性初始化后，Phraser会⽐使⽤Phrases模型占⽤内存⼩且速度更快。

class gensim.models.phrases.Phraser(phrases_model)
Parameters
phrases_model () – Trained phrases instance.
Example：
>>> from gensim.test.utils import datapath
>>> from gensim.models.word2vec import Text8Corpus
>>> from gensim.models.phrases import Phrases, Phraser
>>>
>>> # Load training data.
>>> sentences = Text8Corpus(datapath('testcorpus.txt'))
>>> # The training corpus must be a sequence (stream, generator) of sentences,
>>> # with each sentence a list of tokens:
>>> print(list(sentences)[0][:10])
['computer', 'human', 'interface', 'computer', 'response', 'survey', 'system', 'time', 'user', 'interface']
>>>
>>> # Train a toy bigram model.
>>> phrases = Phrases(sentences, min_count=1, threshold=1)
>>> # Apply the trained phrases model to a new, unseen sentence.
>>> phrases[['trees', 'graph', 'minors']]
['trees_graph', 'minors']
>>> # The toy model considered "trees graph" a single phrase => joined the two
>>> # tokens into a single token, `trees_graph`.
>>>
>>> # Update the model with two new sentences on the fly.
>>> phrases.add_vocab([["hello", "world"], ["meow"]])
>>>
>>> # Export the trained model = use less RAM, faster processing. Model updates no longer possible.
>>> bigram = Phraser(phrases)
>>> bigram[['trees', 'graph', 'minors']] # apply the exported model to a sentence
['trees_graph', 'minors']
>>>
>>> # Apply the exported model to each sentence of a corpus:
>>> for sent in bigram[sentences]:
... pass
>>>
>>> # Save / load an exported collocation model.
>>> bigram.save("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded = Phraser.load("/tmp/my_bigram_model.pkl")
>>> bigram_reloaded[['trees', 'graph', 'minors']] # apply the exported model to a sentence
['trees_graph', 'minors']
2. Word2Vec
涵盖常⽤的word2vec算法skip-gram和CBOW，使⽤hierarchical softmax和负采样。

注意：官⽹⾥Word2Vec函数的vector_size参数应该已经换成了size，在使⽤时⽤size.
class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75,
cbow_mean=1, hashfxn=, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None)
常⽤的⼏个⽅式：
1. 普通嵌⼊：Word2Vec要求输⼊是可迭代的有序句⼦列表，即是已经分词的句⼦列表，即某个⼆维数组。

不过在数据集较⼤的情况会占
⽤⼤量RAM，可以使⽤⾃定义的⽣成器，只在内存中保存单条语句。

from gensim.models import Word2Vec
# load data
raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
# words cut
sentences= [s.split() for s in raw_sentences]
# initialize and train a word2vec model
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=4)
# save model
model.save("word2vec.model")
调⽤Word2Vec创建模型实际上会对数据执⾏两次迭代操作，第⼀轮操作会统计词频来构建内部的词典数结构，第⼆轮操作会进⾏神经⽹络训练，⽽这两个步骤是可以分步进⾏的，这样对于某些不可重复的流（譬如 Kafka 等流式数据中）可以⼿动控制:
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
# initialize model
model = Word2Vec(size=300, window=5, min_count=1, workers=4)
# prepare the model vocabulary
model.build_vocab(sentences)
# train word vectors
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
2. 多词嵌⼊：利⽤gensim.models.phrases模块，获取⼆元词组，word2vec模型可以学习短语表达，⽐如
new_york，machine_learning。

from gensim.test.utils import common_texts
from gensim.models import Phrases, Phraser
# Train a bigram detector.
phrases = Phrases(common_texts)
bigram = Phraser(phrases)
sentences = bigram[common_texts]
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(sentences, min_count=1, size=300, window=5, iter=7)
3. 预训练模型：Gensim数据存储库中附带了⼏个已经过预训练的模型：。

对于中⽂来说，gensim-data没有中⽂数据，更需要⾃⼰找语
料库训练模型.
import gensim.downloader
# Show all available models in gensim-data
print(list(()['models'].keys()))
#>>>['fasttext-wiki-news-subwords-300',
'conceptnet-numberbatch-17-06-300',
'word2vec-ruscorpora-300',
'word2vec-google-news-300',
'glove-wiki-gigaword-50',
'glove-wiki-gigaword-100',
'glove-wiki-gigaword-200',
'glove-wiki-gigaword-300',
'glove-twitter-25',
'glove-twitter-50',
'glove-twitter-100',
'glove-twitter-200',
'__testing_word2vec-matrix-synopsis']
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')
3. KeyedVector
gensim.models.keyedvectors模块实现词向量的保存和各种相似性查询。

由于训练后的词向量与训练⽅式⽆关，因此可以⽤⼀个独⽴结构来表⽰。

这个结构叫做 “KeyedVectors”，本质上是键和向量之间的映射。

每个向量都由它的查找键来标识。

参考API：
class gensim.models.keyedvectors.KeyedVectors(vector_size, count=0, dtype=<class 'numpy.float32'>, mapfile_path=None)
classmethod load(fname, mmap=None)
classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<class 'numpy.float32'>, no_header=False)
save(args, kwargs)：Save KeyedVectors to a file.
save_word2vec_format(fname, fvocab=None, binary=False, total_vec=None, write_header=True, prefix='', append=False, sort_attr='count')：Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.
KeyedVector和完整模型的主要区别：KeyedVectors不⽀持进⼀步的训练。

另⼀⽅⾯，通过舍弃训练所需的内部数据结构，KeyedVectors提供了更⼩的RAM占⽤空间和更简单的接⼝。

词向量保存的各种数据格式
⼀个是词向量保存的格式有好⼏种，⼤⼩会不同嘛，哪个速度快，选哪个呢？另外保存时使⽤⼆进制和⾮⼆进制，在读取词向量时可能带来编码问题。

训练⼀个完整的模型，然后访问其model.wv属性，该属性持有独⽴的键控向量。

词向量获取：保留在磁盘上，词向量从磁盘上现有的⽂件中实例化，以原始Google的word2vec C格式作为KeyedVectors实例。

model.save()不能利⽤⽂本编辑器查看，但是保存了训练的全部信息，可以在读取后追加训练;
model.wv.save_word2vec_format()保存为word2vec⽂本格式，但是保存时丢失了词汇树等部分信息，不能追加训练。

from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors
# initialize and train a word2vec model
model = Word2Vec(common_texts, size=300, window=5, min_count=1, workers=4)
# save model
model.save("word2vec.model")
# 1.保存为KeyedVector⽂件
model.wv.save('vectors_300.kv')
# 2.将词向量保存为txt格式/⾮⼆进制
model.wv.save_word2vec_format('token_vec_300.txt', binary=False)
# 3.将词向量保存为bin格式/⼆进制
model.wv.save_word2vec_format('token_vec_300.bin', binary=True)
# 4.将词向量保存为bin.gz格式/压缩⼆进制
model.wv.save_word2vec_format('token_vec_300.bin.gz', binary=True)
# 1. 加载KeyedVectors实例
keyed_vectors = KeyedVectors.load('vectors_300.kv')
# 2. 加载词向量txt格式/⾮⼆进制
wv_from_text = KeyedVectors.load_word2vec_format('token_vec_300.txt', binary=False)
# 3. 加载词向量bin格式/⼆进制
wv_from_bin = KeyedVectors.load_word2vec_format('token_vec_300.bin', binary=True)
# 4. 加载词向量bin.gz格式/压缩⼆进制
wv_from_bin_gz = KeyedVectors.load_word2vec_format('token_vec_300.bin.gz', binary=True)
模型保存/词向量保存格式⼤⼩
model.save("word2vec.model")48KB
model.wv.save('vectors_300.kv')23KB
model.wv.save_word2vec_format('token_vec_300.txt', binary=False)49KB
model.wv.save_word2vec_format('token_vec_300.bin', binary=True)15KB
model.wv.save_word2vec_format('token_vec_300.bin.gz', binary=True)13KB
总结：词向量还是以bin或bin.gz格式保存为佳，占⽤空间少。

词向量⽤处
⽤训练好的向量执⾏各种句法/语义NLP词任务。

import gensim.downloader as api
# load pre-trained word-vectors from gensim-data
word_vectors = api.load("glove-wiki-gigaword-100")
# Check the "most similar words", using the default "cosine similarity" measure.
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
most_similar_key, similarity = result[0] # look at the first match
print(f"{most_similar_key}: {similarity:.4f}")
#>>> queen: 0.7699。