Second Tokenization Workshop (TokShop)

Let's Talk about Tokenization

Call for Papers


Important Dates

Topics of Interest

Tokenization–the process of converting raw data into discrete units for model input and output–has emerged as a critical component across machine learning domains. Originally central to natural language processing (NLP), tokenization is now equally essential in multimodal learning, computer vision, speech processing, and other areas. Recent research has shown that tokenization strategies significantly impact model utility, efficiency, and generalization, sparking a surge of interest in this foundational topic.

The Second Tokenization Workshop (TokShop) at COLM 2026 aims to bring together researchers and practitioners from all corners of machine learning to explore tokenization in its broadest sense. We will discuss innovations, challenges, and future directions for tokenization across diverse data types and modalities. Topics of interest include:

By broadening the scope of tokenization research beyond language, this workshop seeks to foster cross-disciplinary dialogue and inspire new advances at the intersection of representation learning, data efficiency, and model design.


Guidelines

Our author guidelines follow the COLM requirements unless otherwise specified.

Organizers

Name 1 Tomasz Limisiewicz

University of Washington
Meta

Name 4 Valentin Hofmann

LMU Munich

Name 5 Sachin Kumar

The Ohio State University

Name 3 Jindřich Libovický

Charles University

Name 2 Jindřich Helcl

University of Oslo

Name 6 Orevaoghene Ahia

University of Washington

Name 7 Elizabeth Salesky

Google Deepmind

Name 7 Yuki Asano

University of Technology Nuremberg

Name 7 Yuval Pinter

Ben-Gurion University of the Negev