2221. Unsupervised classification of large-scale text-based datasets with Large Language Model Embeddings
Invited abstract in session TC-12: Insights through Unsupervised Learning, stream Artificial Intelligence, Machine Learning and Optimization.
Thursday, 11:45-13:15Room: H10
Authors (first author is the speaker)
| 1. | Tim Kunt
|
| 2. | Ida Litzel
|
| Digital Data and Information for Society, Science and Culture, Zuse Institute Berlin | |
| 3. | Thi Huong Vu
|
| Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin |
Abstract
We propose an unsupervised classification approach to large-scale text-based datasets using Large Language Models. Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure, enabling us to use tools and methods from graph theory as well as conventional classification methods, the former has newly found potential through the usage of LLM embedding models.
Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~70 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts. Further, we discuss strategies to combine these emerging methods with traditional graph-based approaches, potentially compensating each other's shortcomings.
Keywords
- Artificial Intelligence
- Big Data
- Graphs and Networks
Status: accepted
Back to the list of papers