Schedule

8:45 a.m. – 9:00 a.m. – Opening

9:00 a.m. – 9:10 a.m. – Coffee Break

9:10 a.m. – 10:00 a.m. – Invited Talk (Yuval Pinter): Beat them? Join them? Fix them? Tokenization Research in a Downstream World

10:00 a.m. – 10:50 a.m. – Invited Talk (Desmond Elliott): Insights from Pixel Language Modeling

10:50 a.m. – 12:00 p.m. –

Poster Session: Tokenization of Text

  • Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations β€” Brian Zheng, Alisa Liu, Orevaoghene M Ahia, Jonathan Hayase, Yejin Choi, Noah Smith
  • Subword Tokenization Strategies for Kurdish Word Embeddings β€” Ali Salehi, Cassandra Jacobs
  • Continuous Chain of Thought Enables Parallel Exploration and Reasoning β€” Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak
  • Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 β€” Preston Firestone, Shubham Ugare, Gagandeep Singh, Sasa Misailovic
  • Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives β€” Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez
  • Evaluating Morphological Alignment of Tokenizers in 70 Languages β€” Catherine Arnett, Marisa Hudspeth, Brendan O’Connor
  • Byte Latent Transformer: Patches Scale Better Than Tokens β€” 14 presenters
  • Contextual morphologically-guided tokenization for pretrained Latin BERT models β€” Marisa Hudspeth, Patrick J. Burns, Brendan O’Connor
  • SuperBPE: Space Travel for Language Models β€” Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah Smith, Yejin Choi
  • zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression β€” Saibo Geng, Nathan Thomas Elian Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
  • How Much is Enough? The Diminishing Returns of Tokenization Training Data β€” Varshini Reddy, Craig Schmidt, Yuval Pinter, Chris Tanner
  • FLEXITOKENS: Flexible Tokenization for Evolving Language Models β€” Abraham Owodunni, Orevaoghene M Ahia, Sachin Kumar
  • Sampling from Your Language Model One Byte at a Time β€” Jonathan Hayase, Alisa Liu, Noah Smith, Sewoong Oh
  • BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization β€” Sander Land, Catherine Arnett
  • ByteSpan: Information-Driven Subword Tokenisation β€” Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery

With Prerecorded Videos:

  • MorphTok: Morphologically Grounded Tokenization for Indic Languages β€” Maharaj Brahma, N J Karthika, Atul Singh, Devaraja Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar
  • Causal Estimation of Tokenisation Bias β€” Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago Pimentel
  • Adversarial Tokenization β€” Renato Geh, Zilei Shao, Guy Van den Broeck
  • InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability β€” Kirill Semenov, Martin Popel

12:00 p.m. – 1:00 p.m. – Lunch Break

1:00 p.m. – 1:50 p.m. – Invited Talk (Adrian ŁaΕ„cucki): Learning Dynamic Segmentation and Compression of Sequences in Transformer LLMs

1:50 p.m. – 3:00 p.m.

Poster Session: Tokenization Across Modalities

  • How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them β€” Disen Liao, Freda Shi
  • Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs β€” Greyson Brothers
  • Canonical Autoregressive Generation β€” Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez
  • You Only Train Once: Efficient Tokenizer Selection for Arithmetic in Language Models β€” Mucong Ding, Sean McLeish, Kazem Meidani, Igor Melnyk, Nam Nguyen, C. Bayan Bruss, Furong Huang
  • Conditional Unigram Tokenization with Parallel Data β€” Gianluca Vico, JindΕ™ich LibovickΓ½
  • One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression β€” Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, Yu Yamaguchi
  • Tokenizing Nonverbal Communication in Salsa Dance β€” Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige TuttΓΆsΓ­, Angelica Lim
  • Watermarking Autoregressive Image Generation β€” Nikola JovanoviΔ‡, Ismail Labiad, Tomas Soucek, Martin Vechev, Pierre Fernandez
  • QuickMerge++: Token Merging with Autoregressive Prior β€” Dong Liu, Yanxuan Yu
  • Overcoming Vocabulary Constraints with Pixel-level Fallback β€” Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva
  • Continuous Autoregressive Generation with Mixture of Gaussians β€” Alex Quach, Johnson Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini
  • Motion-Focused Tokenization for Source-Free Video Domain Adaptation β€” Tzu Ling Liu, Ian Stavness, Mrigank Rochan
  • Discrete JEPA: Learning Discrete Token Representations without Reconstruction β€” Junyeob Baek, Hosung Lee, Christopher Hoang, Mengye Ren, Sungjin Ahn
  • CAT: Content-Adaptive Image Tokenization β€” Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, LILI YU, Chunting Zhou
  • Entropy-Driven Pre-tokenization for Byte Pair Encoding β€” Yifan Hu, Ningyue Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig Schmidt, Chris Tanner

With Prerecorded Videos:

  • Tokenisation is NP-Complete β€” Philip Whittington, Gregor Bachmann, Tiago Pimentel
  • HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling β€” rongkun xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang
  • GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling β€” Prabhav Sanga, Jaskaran Singh, ARUN DUBEY
  • Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation β€” Marco Cognetta, David Pohl, Junyoung Lee, Naoaki Okazaki

3:00 p.m. – 3:30 p.m. – Coffee Break

3:30 p.m. – 4:30 p.m. – Panel: Future of Tokenization

  • Albert Gu (Carnegie Mellon University)
  • Alisa Liu (University of Washington)
  • Kris Cao (Cohere)
  • Sander Land (Cohere)
  • Yuval Pinter (Ben-Gurion University of the Negev)

4:30 p.m. – 5:00 p.m. – Best Paper Session

5:00 p.m. – 5:30 p.m. – Closing Remarks