Tokenization Workshop (TokShop)

Let's Talk about Tokenization


Call for Papers


Important Dates

Topics of Interest

Tokenization—the process of converting raw data into discrete units for model input and output—has emerged as a critical component across machine learning domains. Originally central to natural language processing (NLP), tokenization is now equally essential in multimodal learning, computer vision, speech processing, and other areas. Recent research has shown that tokenization strategies significantly impact model utility, efficiency, and generalization, sparking a surge of interest in this foundational topic.

The Tokenization Workshop (TokShop) at ICML aims to bring together researchers and practitioners from all corners of machine learning to explore tokenization in its broadest sense. We will discuss innovations, challenges, and future directions for tokenization across diverse data types and modalities. Topics of interest include:

By broadening the scope of tokenization research beyond language, this workshop seeks to foster cross-disciplinary dialogue and inspire new advances at the intersection of representation learning, data efficiency, and model design.


Guidelines

Our author guidelines follow the ICML requirements unless otherwise specified.

Organizers

Name 1 Tomasz Limisiewicz

Meta
University of Washington

Name 4 Valentin Hofmann

Allen Institute for AI
University of Washington

Name 5 Sachin Kumar

The Ohio State University

Name 3 Jindřich Libovický

Charles University

Name 2 Jindřich Helcl

University of Oslo

Name 6 Orevaoghene Ahia

University of Washington

Name 7 Elizabeth Salesky

Google Deepmind

Name 8 Farhan Samir

University of British Columbia