基于弱语义样本的对比学习句嵌入方法

徐斌斌; 严大川; 王建尚; 王小敏

doi:10.3969/j.issn.2096-9066.2024.01.007

基于弱语义样本的对比学习句嵌入方法

Weak Semantic Samples-based Contrastive Learning for Sentence Embeddings

摘要

摘要: 为了有效消除句嵌入在语义特征空间的各向异性问题，提出一种基于弱语义样本的对比学习句嵌入方法，旨在生成有效句嵌入的同时，提升模型对文本语义相似性的识别效果。首先，采用标记重复算法构建相似样本并作为遮掩语言模型的输入，预测生成包含弱语义关系的样本；然后，将原始样本重复输入不同失活率的转换器，抽取不同的全局语义特征；最后，通过对比学习调整特征权重值，构建句嵌入。在公开数据集上进行系列对比实验，结果表明：基于弱语义样本的句嵌入表示方法性能优于其他方法，获得77.38%的相似性评估分数，为句嵌入生成和语义相似度识别任务提供了一种有效的解决方案。

Abstract: In order to effectively eliminate the anisotropy problem of sentence embedding in semantic feature space, a contrastive learning sentence embedding method based on weak semantic samples is proposed to generate effective sentence embedding while improving the model’s recognition effect on textual semantic similarity.Firstly, similar samples are constructed through the token repetition algorithm as input to the masked language model（MLM） to predict and generate samples containing weak semantic relationships, and then the original samples are input into Transformers with different dropout rates to extract different global semantic features.Finally, the weight values of the features are adjusted through contrastive learning to obtain sentence embeddings.In a series of comparative experiments on public datasets, the results show that the sentence embedding representation method based on weak semantic samples outperforms other methods, achieving the highest similarity evaluation score of 77.38%,providing an effective solution for sentence embedding generation and semantic similarity recognition tasks.

HTML全文

参考文献(0)

施引文献

资源附件(0)