本篇文章11346字,读完约28分钟

原始的scottishfoldcatssociallistening和发掘文案收录在话题#主题模型6#集群6#文案发掘19中

特别推荐|【复印发掘系列教程】:

复印从白到精通( 16 )----像采用sci kit-learn一样玩bert

------------- -32

最近,经常有小伙伴在公众号后台或直接在微信上问我以下问题。

如何有效地建模大量的短文案例数据?

lda建模时,如何明确主题数?

主题模型得到的结果说明水平不高,不知道该怎么办?

没有任何类别和标签,利用无监视技术提取文件的主题是自然的想法,尽管lda和nmf等主题模型被广泛使用,但大多数情况下效果并不差(主要是长拷贝),但笔者总是超参数调

因此,笔者想结合现在的soa的bert。 这是因为近年来各种nlp任务都很出色,而且不需要为采用预训练模式而标记的数据。 更重要的是,bert可以生成优秀的、有上下文的新闻单词嵌入和句子嵌入。

其次,笔者以汽车领域的顾客评论资料为例,展示了基于bert模型的主题模型的威力。

1加载所需的python库

importnumpyasnpimportpandasaspdimportjiebaimportumapimporthdbscanfromsentence _ transformersimportsentencetransformerfomslek racce rizerfromsklearn.metrics.pairwiseimportcosine _ similarityfromtqdmimporttqdmimportmatlolition

2加载数据,提前解决必要的数据

笔者这里采用的是汽车家的口碑评价数据,为20000+,大部分是长度在70以下的短文事件数据。

frompyltpimportsentencesplitterdata = PD splited _ sentences = sentence splitter.split ( ' '32

过滤句子长度短的句子:

data [ ' text _ length ' ] = data [ " review " ].apply ( lambdax:LEN

查看数据:

data.head ( )

review text_length

0也不太满意。 说起来是油耗吧。 这个车型和车的重量其实也不高,如果再低一点的话…74

1发动机的启停没什么用,轴距只有2.9多个,行李箱太大,启停比换成后视镜的电动折叠更多的人感觉到... 73

2那是走在烂路上时天窗发出噪音。 20

3班次1~2还是有顿挫感,地板油也要让他考虑1秒左右,所以后期拨一点可以吗? 42

4非常好的车。 颜值、动力、驾驶感觉都很好,但如果说定位是4门5辆轿车跑,我更倾向于。 ... 91

把复制数据分词解决,排除句中的无效词。

观察时,这个操作是为了获取主题词,生成句子表达,也就是说句子嵌入是基于bert模型的。

data [ ' review _ seg ' ] = data [ ' review ' ].apply ( LAMBDAX:'.Join

刷新数据:

data.head ( )

review text_length review_seg

0也不太满意。 说起来是油耗吧。 关于这个车型和车的重量其实也不太高,但是如果再低一点的话…74不太满意。 关于省油的车的重量其实不用更高。 结果是要买的。 是...

1发动机启停没什么用,轴间距离只有2.9多,后备箱太大,启停是后视镜电动折叠的很多人感觉到的... 73发动机启停是轴间距离2.9后备箱太启停了后视镜电动折叠的人感觉到油 是...

2那是走在烂路上时天窗发出噪音。 20说走在烂路上的天窗的异响

3班次1~2还是有顿挫感,地板油也要让他考虑1秒左右,所以后期拨一点可以吗? 42班次到二顿挫感地板油思考1秒左右后加装刻度盘比较好

4非常好的车。 颜值、动力、驾驶感觉都很好,但如果说定位是4门5座的车跑,我更倾向于。 91乘坐非常好的车的颜值动力就好了。 说四门五座的车要跑了。 是...

3创建高质量的句子嵌入

第一步是将文档量化为,将数据转换过程中出现的意义损失降到最低。 虽然可以采用doc2vec、skip-thought、elmo等多种方法,但是考虑到bert的优良特征,这次用它提取了句子嵌入。

首先,使用sentencetransformer从一系列文档创建文档嵌入。 像汽车行业的主题建模一样,在进行专门针对目标行业的事前训练时,使用大量汽车行业的无标记语finetune模型可以总结句子嵌入的效果很好,意思相近但表达不同的副本。 特别是短文事件这种难以解决的文件类型。

热门:文案发掘从小白到精通(二十二)短文案主题建模的利器

如果有长文档,建议使用工程手段将文档分成小段落和句子。 因为sentencetransformer基于bert模型,通常有语句长度的限制,通常为512个字符。

model = sentence transformer ( r ' Gao _ dir/my _ pre trained _ Chinese _ embedding s ' ) embedding s = model.encode ( data ) data

显示文档嵌入的形状。 通常是(语句数*嵌入维数)。

embeddings.shape

( 9445,512 )

在正式进入群集之前,hdbcan群集算法很容易受到“维诅咒”,因此必须首先降低维。

四句嵌入降维解决

笔者通过使用umap对上一个环节创建的语句嵌入进行降维,可以减少计算量和内存的采用量,第二是喜欢低维数据的hdbscan更容易进行聚类,第三是流形降维的过程,越来越多。

个人认为,umap是目前复制向量降维效果最好的方法,其中重要的参数有三个――n_neighbors、n_components、metric,研究这里的参数有余力。 但是,通常使用默认的参数即可。

% timeimportsyssys.setrecursionlimit ( 100000 ) umap _ embedding s = umap.umap ( n _ neighbors = 25,n_components )

沃尔特米: 53.5s

基于hdbscan的文档聚类

由于umap保存了一点原始的高嵌入结构,所以使用hdbscan寻找高密度集群(即热门话题)是有意义的。

这种聚类方法有以下两个重要参数。

因为metric这次笔者采用的测量方法是euclidean (欧几里得测量),不受高维的影响

min_cluster_size。 min_cluster_size (最小簇大小)可以调节主题数量,该数值越大挖掘的主题数量越少,反之亦然。

% time cluster = HDB SCAN.HDB SCAN

wall time: 571 ms

通过使用umap将数据嵌入二维空间,使用matplotlib对簇着色,可以可视化生成的簇。 有些群集是>; 因为可能有,所以生成了不太容易发现的50个主题,但一个主题中的句子数量(图中为小分数)太少了。

# preparedataumap _ data = umap.umap ( n _ neighbors = 15,n_components=2

# visualize clustersfig,ax = PLT.subplots ( FIG size = ( 25,15 ) ) outliers = Result.LC [ Result.Labels = = -1,]plt.scatter(outliers.x,outliers.y

6主题建模结果显示

在正式分解主题之前,必须创建辅助函数,以分解群集结果并将其转换为主题模型的分解范式。

6.1 c-tf-idf

我们想从生成的集群中知道,一个集群在副本上(基于语义)与另一个集群不同的是什么? 为了解决这个问题,我们可以“魔改”tf-idf,以挖掘个别主题而不是个别文件的重要词汇(主题词)。

如果像往常一样对一系列文档应用tf-idf,基本上就是比较文档之间的词的重要性。 假设一个类别中的所有文档(例如,一个群集)都被视为一个文档,并应用tf-idf? 结果是一个簇中单词的重要性得分( tf-idf值)。 一个簇组内的词汇越重要,就越能表示其主题。 换句话说,如果提取各簇中最重要的词,就能得到主题的说明,理解该主题在说什么!

热门:文案发掘从小白到精通(二十二)短文案主题建模的利器

具体来说,你应该在做这种事。

每个群集都转换为文档,而不是一组文档。 然后,提取各类别I词的频度t,除以总词数w的动作现在可以看作类中的频繁出现词的标准化形式,接下来,将总、未连接的文件数m除以所有类别n的词t的总频度.

使用sklearn的countvectorizer类,这种方法很容易实现。

def c_tf_idf(documents,m,ngram _ range = ( 1,1 ) )32 encoding = ' utf-8 ' ).read lines ( ) ] count = countvectorizer ( ngram _ range = ngram _ range,stop _ words = my _ stop words ) . to array ( ) w = t.sum ( AXIS =1) w ) SUM _ t = t.SUM ( AXIS =0) IDF = NP.log ( P.Divide ( m,8

提取主题中的关键字(主题词),计算每个主题的大小(主题下包含的文档数)。

def extract _ top _ n _ words _ Per _ topic ( TF _ IDF、count、docs_per_topic,n = 20 ):words = Count.G TF _ IDF _ TransPosed-n:] Top _ n _ Words = { Label

def extract _ topic _ sizes ( df ):topic _ sizes = ( df.group by ( [ ' topic

6.2计算每个主题中的top主题词

为了便于选择,我们把结果放在了pandas数据框架中。 然后创建docs_per_label,并添加一个群集中的所有文档。

DOCS _ DF = PD.Data Frame ( Data [ ' Review _ SEG ' ].to LIST ( )32 columns = [ " doc " ] ) docs _ df [32

这次生成的主题数如下。

len(docs_per_topic.doc.tolist ( ) )

34

在应用tf-idf方法之前,将一个主题中的所有文档“混炼”到一个文档中,计算一个主题中单词和所有其他主题的重要性。 然后,在各个簇中提取tf-idf值最高的词作为该主题下的主题词即可。

tf_idf,Count = c _ TF _ IDF ( DOC.DOC.values,M = LEN ( DTA ) ) top _ n _ words = extract _ TT

让我们看看主题索引18主题下的主题词

top_n_words[18]

[ '问题',0.1025568295028655,

(“否”,0.09461670245949808 ),

(“追评”,0.08719721082551075 ),

(“现在”,0.07247799984688022 ),

(“发现”,0.06975059757614206 ),

(“时间”,0.06712798723943217 ),

(“再来”,0.05536254179522642 ),

(“味道”,0.05231485606815352 ),

(“满意”,0.04991319860040351 ),

(“感觉”,0.04842748742196347 ),

(“出现”,0.04813242930859443 ),

(“一个月”,0.046830037344296443 ),

(“月”,0.04624729052000483 ),

(“新车”,0.04403148698556992 ),

(“异味”,0.04383294309480369 ),

(“临时”,0.04334722860887084 ),

(“开车”,0.04318017756299482 ),

('当前',0.04022216122961403 ),

(“质量”,0.038859284264954115 ),

(“抬起车”,0.03524355499171186 )

但是,应该排除类别索引为-1的集群群。 因为该簇的一部分被模型视为“噪声”,夹杂着很多未识别的主题,不太容易被解读。

top_n_words[-1]

[ '外观',0.03397788540335169,

(“喜欢”,0.028063470565283518 ),

(“性价比”,0.024877802099763188 ),

(“感觉”,0.0027908469892 ),

(“感觉”,0.019562629459711357 ),

(“否”,0.01852212424611342 ),

(“前灯”,0.018513381594025758 ),

(“配置”,0.018280959393705866 ),

(“比较”,0.017890116980130752 ),

(“价格”,0.017679624613747016 ),

(“非常”,0.017142161266788858 ),

(“企业品牌”,0.017058422370475335 ),

(“满意”,0.016970659727685928 ),

(“豪华”,0.016424003887498418 ),

(“优惠”,0.01609247609255133 ),

( ' xts ',0.01579185209861865 ),

(“设计”,0.015732793408522044 ),

(“动力”,0.01541712071670732 ),

(“大气”,0.014732855459186593 ),

('点',0.014718071299553026)]

排列当前所有主题和相应的主题词列表。

frompprintimportpprintforiinlist ( range ( len ) ( top _ n _ words )-1 ):print ( ' most 20 importantwordsintopic { }:\

most 20 important words in topic 0:

(“马”,0.24457507362737524 ),

(“马跑”,0.2084888573356569 )

(“不吃”,0.09709590737397493 ),

(“油耗”,0.06709136386307156 ),

(“现在”,0.059650379616285276 ),

(“否”,0.05319169690659243 ),

(“想要”,0.04764441180247841 ),

(“左右”,0.046580524081679016 ),

(“跑得快”,0.045400507911056986 )

(“在哪里”,0.044559365280351336 ),

(“公斤”,0.041230968367632854 ),

(“高速”,0.039234425817170064 ),

(“行驶”,0.03890482349013843 ),

( ' 10 ',0.037022144019066686 ),

(“个油”,0.03682216481709768 ),

(“动力”,0.03616975159734934 ),

(“正常”,0.03520558703001095 ),

(“市区”,0.034599821025087185 ),

(“结局”,0.03458202416009574 ),

(“道理”,0.031503940772350914)] )

' * * * * * * * * * * * * * * * * * * * * * * * *。

most 20 important words in topic 1:

[ '油耗',0.09524385306084004,

(“高速”,0.05653143388720487 ),

(“左右”,0.05463694726066372 ),

(“市区”,0.04736812727722961 ),

(“公斤”,0.04426042823825784 ),

(“个油”,0.0437019462752025 ),

( ' 10 ',0.04124126267133629 ),

(“现在”,0.04106957747526032 ),

(“承诺”,0.03392843290427474 ),

( ' 11 ',0.03258066460138708 ),

(“平均”,0.03254166004110595 ),

(“百公里”,0.026974405367215754 ),

( ' 12 ',0.02667734417832382 ),

('现在',0.026547861579869568 ),

(“省油”,0.024521146178990254 ),

(“比较”,0.023967370074638887 ),

(“行驶”,0.02337617146923143 ),

(“工作日”,0.02231213384456322 ),

(“驾驶”,0.02225259142975045 ),

(“磨合期”,0.019891589132560176)] )

' * * * * * * * * * * * * * * * * * * * * * * * * *。

most 20 important words in topic 2:

(“虎”,0.1972807028214997 ),

(“油耗”,0.08030819950496665 ),

(“美系车”,0.051452721555236586 ),

('现在',0.04511691339526969 ),

( ' 10 ',0.04164581302410513 ),

(“个油”,0.041420858563077104 ),

(“美国”,0.04121728175026878 ),

(“左右”,0.03493195487672415 ),

(“平均”,0.03288881578728298 ),

(“现在”,0.029076698183196633 ),

( ' 12 ',0.028824764053369055 ),

(“高速”,0.028687350320703176 ),

( ' 11 ',0.0263147428710808 ),

(“基本”,0.025791405022289656 ),

(“百公里”,0.025566436389413978 ),

(“驾驶”,0.02511085197343242 ),

(“郊外”,0.023879719505057788 ),

(“公斤以上”,0.023290821021098026 ),

(“习性”,0.023170932368572476 ),

(“朋友”,0.022668297504425915)] )

' * * * * * * * * * * * * * * * * * * * * * * * * *。

most 20 important words in topic 3:

(“油耗”,0.09774756730680972 ),

(“凯迪拉克”,0.08150929317053307 ),

(“左右”,0.03704063760365755 ),

(“个油”,0.03393914525278086 ),

(“省油”,0.033147790968701116 ),

(“现在”,0.029322670672030947 ),

(“油耗”,0.028607158460688595 ),

(“市区”,0.028138942560105483 ),

( ' 11 ',0.027057690984927343 ),

(“承诺”,0.027035026157737122 ),

(“结局”,0.025713800165879153 ),

('现在',0.025636969123009515 ),

(“美系车”,0.025507957831906663 ),

(“平均”,0.02536302802175033 ),

('前',0.024645241362404695 ),

(“动力”,0.023532574041308225 ),

(“比较”,0.02351138127209341 ),

(“下降”,0.021912206107234797 ),

(“正常”,0.02137825605852441 ),

(“可能”,0.0083805610775)] )

' * * * * * * * * * * * * * * * * * * * * * * * * *。

是...

most 20 important words in topic 31 :

[ '满意',0.4749794864152499,

(“地方”,0.3926757136985932 ),

(“否”,0.21437689162047083 ),

(“发现”,0.17910831839903818 ),

(“现在”,0.11420499815982257 ),

(“临时”,0.09540746799339411 ),

(“发掘”,0.08502606632538356 ),

(“不好”,0.06606868576085345 ),

(“完整”,0.06546918040522966 ),

(“批评”,0.06351786367717983 ),

(“续”,0.05924768082325757 ),

(“其实”,0.05517858296374464 ),

(“什么都没有”,0.0467681518553301 ),

(“真的”,0.04629681210390699 ),

(“癫痫得”,0.04599618482379703 ),

(“我太多了”,0.04599618482379703 )

(设'为',0.04599618482379703 ),

( ' 3w ',0.04599618482379703 ),

(“可以吐槽”,0.04599618482379703 ),

(“相对”,0.045510230820616476)] )

' * * * * * * * * * * * * * * * * * * * * * * * * *。

most 20 important words in topic 32 :

[ '外观',0.19202697740762065,

(“喜欢”,0.09742663275691509 ),

(“漂亮”,0.06539925997592003 ),

(“吸引”,0.051963718413741596 ),

(“时尚”,0.04628469650846298 ),

(“大气”,0.045441921472445655 ),

(“个性”,0.0447603686071089 ),

(“个体”,0.03601467530065024 ),

(“反正”,0.03586746904278288 ),

(“霸气”,0.03438681357345092 ),

(“否”,0.03315500048740606 ),

(“漂亮”,0.03302680521368137 ),

(“外观设计”,0.032328941456855734 ),

(“非常”,0.032326600304463396 ),

(“外形”,0.03215438082478295 ),

(“感觉”,0.03126961228563091 ),

(“不错”,0.029505153223353325 ),

(“外观”,0.02949619921569243 ),

(“顺眼”,0.026753843592622728 ),

(“帅”,0.026252936525869065)] )

' * * * * * * * * * * * * * * * * * * * * * * * * *。

到这里来,发现主题的工作其实已经结束了,但有时我们:

题目太多了,我想再少一点。 最好按指定数量进行类似主题的合并。

发现主题中的层次更有意义。

这时轮到主题合并登场了~

7主题合并

根据数据集的不同,可能会得到数百个主题! 可以调整hdbscan的参数,通过其min_cluster_size参数减少主题数量,但这无法得到指定的明确群集/主题数量。 一种更自然的方法是通过组合彼此最相似的主题向量来减少主题的数量。 我们可以使用同样的精彩方法,比较主题之间的c-tf-idf向量,合并最相似的向量,最后重新计算c-tf-idf向量来更新主题的表达。

热门:文案发掘从小白到精通(二十二)短文案主题建模的利器

foriintqdm ( range ( 20 ) ):# calculatecosinesimilarities = cosine _ similarity ( TF _ IDF.t ) NP.FILL

# extractlabeltomergeintoandfromwheretopic _ sizes = docs _ df.group by ( [ ' topic ' ] ).count ( ).sort _ values ( ) ) ascending = false

# adjusttopicsdocs _ df.loc [ docs _ df.topic = = topic _ to _ merge," topic " = topic _ to _ merge MAP _ MAP OLD _ Topicinenumerate ( OLD _ ToPICS ) } DOCS _ DF.ToPIC = DOCS.ToPIC.MP

# calculatenewtopicwordsm = LEN ( DATA ) TF _ IDF

topic _ sizes = extract _ topic _ sizes ( docs _ df ):topic _ sizes.head ( 10 )

结语

bertopic是一种话题建模技术,它利用bert嵌入和c-tf-idf创建密集的集群,便于说明话题,并在话题说明中保存重要词汇。 其中的中心步骤是首先做三件事

用基于bert的sentence transformers抽取句嵌入

通过umap和hdbscan,嵌入文档进行聚类,意思相近的句子聚集在集群群中。

用c-tf-idf提取主题词

此外,如果在段落级别而不是在整个文档中应用主题建模,则建议在创建嵌入语句之前在一定程度上分割数据。

最后,笔者想说的是,决定聚类效果好坏的重要因素是sentence transformers抽取句的嵌入,为了得到更好的效果,需要自己比较任务训练前的模型。

参考资料:

github/maartengr/bertopic

原标题:“文案发掘从白精通(二十二)短文事件主题建模的利器- bertopic”

阅读原文。

标题:热门:文案发掘从小白到精通(二十二)短文案主题建模的利器

地址:http://www.yunqingbao.cn/yqbxw/13356.html