Call for Papers
Important Dates
- Submission begins: May 20, 2026
- Submission deadline: June 23, 2026 (11:59pm, anywhere on earth)
- Notification of acceptance: July 24, 2025
- Camera-ready papers due: TBA (11:59pm, anywhere on earth)
- Workshop date: TBA
Topics of Interest
Tokenization–the process of converting raw data into discrete units for model input and output–has emerged as a critical component across machine learning domains. Originally central to natural language processing (NLP), tokenization is now equally essential in multimodal learning, computer vision, speech processing, and other areas. Recent research has shown that tokenization strategies significantly impact model utility, efficiency, and generalization, sparking a surge of interest in this foundational topic.
The Second Tokenization Workshop (TokShop) at COLM 2026 aims to bring together researchers and practitioners from all corners of machine learning to explore tokenization in its broadest sense. We will discuss innovations, challenges, and future directions for tokenization across diverse data types and modalities. Topics of interest include:
- Subword Tokenization. Examination of current techniques such as WordPiece, BPE, and UnigramLM, as well as extensions to improve their efficiency and applicability.
- Tokenization for Various Modalities. Techniques of tokenization for images, audio, and video. Study of representation alignment across modalities.
- Multilingual Tokenization. Focus on ensuring tokenization methods are equitable and effective across various languages. Identification of relevant failure modes caused by tokenization.
- Tokenizer Modification. Methods for updating tokenizers after model training to improve the model’s efficiency or performance without retraining from scratch.
- Alternative Approaches to Represent Input. Investigation into alternative input representations for data such as patches, bytes, or pixels.
- Tokenization and Statistics. Statistical analysis of subword properties. For instance, the study of compression effectiveness of different tokenization methods.
By broadening the scope of tokenization research beyond language, this workshop seeks to foster cross-disciplinary dialogue and inspire new advances at the intersection of representation learning, data efficiency, and model design.
Guidelines
Our author guidelines follow the COLM requirements unless otherwise specified.
- Paper submission is hosted on OpenReview.
- Each submission should contain up to 9 pages, not including references or appendix (shorter submissions also welcome).
- Please use the provided LaTeX template (Style Files) for your submission. Please follow the paper formatting guidelines general to COLM as specified in the style files. Authors may not modify the style files or use templates designed for other conferences.
- The paper should be anonymized and uploaded to OpenReview as a single PDF.
- You may use as many pages of references and appendix as you wish, but reviewers are not required to read the appendix.
- Posting papers on preprint servers like ArXiv is permitted.
- We encourage each submission to discuss the limitations as well as ethical and societal implications of their work, wherever applicable (but neither are required). These sections do not count towards the page limit.
- This workshop offers both archival and non-archival options for submissions. Archival papers will be indexed with proceedings, while non-archival submissions will not.
- The review process will be double-blind.
Organizers