“Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” a Presentation from Google

Zizhao Zhang, Staff Research Software Engineer and Tech Lead for Cloud AI Research at Google, presents the “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding” tutorial at the May 2022 Embedded Vision Summit.

In computer vision, hierarchical structures are popular in vision transformers (ViT). In this talk, Zhang presents a novel idea of nesting canonical local transformers on non-overlapping image blocks and aggregating them hierarchically. This new design, named NesT, leads to a simplified architecture compared with existing hierarchical structured designs, and requires only minor code changes relative to the original ViT.

The benefits of the proposed judiciously-selected design are threefold:

  1. NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets
  2. When extending key ideas to image generation, NesT leads to a strong decoder that is 8X faster than previous transformer-based generators, and
  3. Decoupling the feature learning and abstraction processes via the nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model.

See here for a PDF of the slides.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top