Operations Research 2025
Abstract Submission

2221. Unsupervised classification of large-scale text-based datasets with Large Language Model Embeddings

Invited abstract in session TC-12: Insights through Unsupervised Learning, stream Artificial Intelligence, Machine Learning and Optimization.

Thursday, 11:45-13:15
Room: H10

Authors (first author is the speaker)

1. Tim Kunt
2. Ida Litzel
Digital Data and Information for Society, Science and Culture, Zuse Institute Berlin
3. Thi Huong Vu
Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin

Abstract

We propose an unsupervised classification approach to large-scale text-based datasets using Large Language Models. Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure, enabling us to use tools and methods from graph theory as well as conventional classification methods, the former has newly found potential through the usage of LLM embedding models.
Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~70 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts. Further, we discuss strategies to combine these emerging methods with traditional graph-based approaches, potentially compensating each other's shortcomings.

Keywords

Status: accepted


Back to the list of papers