Abstract:
In order to effectively eliminate the anisotropy problem of sentence embedding in semantic feature space, a contrastive learning sentence embedding method based on weak semantic samples is proposed to generate effective sentence embedding while improving the model’s recognition effect on textual semantic similarity.Firstly, similar samples are constructed through the token repetition algorithm as input to the masked language model(MLM) to predict and generate samples containing weak semantic relationships, and then the original samples are input into Transformers with different dropout rates to extract different global semantic features.Finally, the weight values of the features are adjusted through contrastive learning to obtain sentence embeddings.In a series of comparative experiments on public datasets, the results show that the sentence embedding representation method based on weak semantic samples outperforms other methods, achieving the highest similarity evaluation score of 77.38%,providing an effective solution for sentence embedding generation and semantic similarity recognition tasks.