Schedule
8:45 a.m. β 9:00 a.m. β Opening
9:00 a.m. β 9:10 a.m. β Coffee Break
9:10 a.m. β 10:00 a.m. β Invited Talk (Yuval Pinter): Beat them? Join them? Fix them? Tokenization Research in a Downstream World
10:00 a.m. β 10:50 a.m. β Invited Talk (Desmond Elliott): Insights from Pixel Language Modeling
10:50 a.m. β 12:00 p.m. β
Poster Session: Tokenization of Text
- Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations β Brian Zheng, Alisa Liu, Orevaoghene M Ahia, Jonathan Hayase, Yejin Choi, Noah Smith
- Subword Tokenization Strategies for Kurdish Word Embeddings β Ali Salehi, Cassandra Jacobs
- Continuous Chain of Thought Enables Parallel Exploration and Reasoning β Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak
- Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 β Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic
- Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives β Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez
- Evaluating Morphological Alignment of Tokenizers in 70 Languages β Catherine Arnett, Marisa Hudspeth, Brendan OβConnor
- Byte Latent Transformer: Patches Scale Better Than Tokens β 14 presenters
- Contextual morphologically-guided tokenization for pretrained Latin BERT models β Marisa Hudspeth, Patrick J. Burns, Brendan OβConnor
- SuperBPE: Space Travel for Language Models β Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah Smith, Yejin Choi
- zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression β Saibo Geng, Nathan Thomas Elian Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
- How Much is Enough? The Diminishing Returns of Tokenization Training Data β Varshini Reddy, Craig Schmidt, Yuval Pinter, Chris Tanner
- FLEXITOKENS: Flexible Tokenization for Evolving Language Models β Abraham Owodunni, Orevaoghene M Ahia, Sachin Kumar
- Sampling from Your Language Model One Byte at a Time β Jonathan Hayase, Alisa Liu, Noah Smith, Sewoong Oh
- BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization β Sander Land, Catherine Arnett
- ByteSpan: Information-Driven Subword Tokenisation β Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery
With Prerecorded Videos:
- MorphTok: Morphologically Grounded Tokenization for Indic Languages β Maharaj Brahma, N J Karthika, Atul Singh, Devaraja Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar
- Causal Estimation of Tokenisation Bias β Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel
- Adversarial Tokenization β Renato Geh, Zilei Shao, Guy Van den Broeck
- InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability β Kirill Semenov, Martin Popel
12:00 p.m. β 1:00 p.m. β Lunch Break
1:00 p.m. β 1:50 p.m. β Invited Talk (Adrian ΕaΕcucki): Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs
1:50 p.m. β 3:00 p.m.
Poster Session: Tokenization Across Modalities
- How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them β Disen Liao, Freda Shi
- Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs β Greyson Brothers
- Canonical Autoregressive Generation β Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez
- You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models β Mucong Ding, Sean McLeish, Kazem Meidani, Igor Melnyk, Nam Nguyen, C. Bayan Bruss, Furong Huang
- Conditional Unigram Tokenization with Parallel Data β Gianluca Vico, JindΕich LibovickΓ½
- One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression β Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi
- Tokenizing Nonverbal Communication in Salsa Dance β Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige TuttΓΆsΓ, Angelica Lim
- Watermarking Autoregressive Image Generation β Nikola JovanoviΔ, Ismail Labiad, Tomas Soucek, Martin Vechev, Pierre Fernandez
- QuickMerge++: Token Merging with Autoregressive Prior β Dong Liu, Yanxuan Yu
- Overcoming Vocabulary Constraints with Pixel-level Fallback β Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva
- Continuous Autoregressive Generation with Mixture of Gaussians β Alex Quach, Johnson Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini
- Motion-Focused Tokenization for Source-Free Video Domain Adaptation β Tzu Ling Liu, Ian Stavness, Mrigank Rochan
- Discrete JEPA: Learning Discrete Token Representations without Reconstruction β Junyeob Baek, Hosung Lee, Christopher Hoang, Mengye Ren, Sungjin Ahn
- CAT: Content-Adaptive Image Tokenization β Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, LILI YU, Chunting Zhou
- Entropy-Driven Pre-tokenization for Byte Pair Encoding β Yifan Hu, Ningyue Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig Schmidt, Chris Tanner
With Prerecorded Videos:
- Tokenisation is NP-Complete β Philip Whittington, Gregor Bachmann, Tiago Pimentel
- HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling β rongkun xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang
- GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling β Prabhav Sanga, Jaskaran Singh, ARUN DUBEY
- Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation β Marco Cognetta, David Pohl, Junyoung Lee, Naoaki Okazaki
3:00 p.m. β 3:30 p.m. β Coffee Break
3:30 p.m. β 4:30 p.m. β Panel: Future of Tokenization
- Albert Gu (Carnegie Mellon University)
- Alisa Liu (University of Washington)
- Kris Cao (Cohere)
- Sander Land (Cohere)
- Yuval Pinter (Ben-Gurion University of the Negev)
4:30 p.m. β 5:00 p.m. β Best Paper Session
5:00 p.m. β 5:30 p.m. β Closing Remarks