The fusion of transformer and convolutional designs has resulted in steady enhancements within the accuracy and effectivity of the fashions. On this examine, we current FastViT, a hybrid imaginative and prescient transformer structure that achieves the most effective trade-off between latency and accuracy. To realize this, we introduce a novel token mixing operator known as RepMixer, which is a basic element of FastViT. RepMixer makes use of structural reparameterization to cut back the reminiscence entry price by eliminating skip-connections within the community. Moreover, we make use of train-time overparametrization and huge kernel convolutions to enhance accuracy, whereas retaining the affect on latency minimal. Our experiments reveal that our mannequin is 3.5x sooner than CMT, a latest state-of-the-art hybrid transformer structure, 4.9x sooner than EfficientNet, and 1.9x sooner than ConvNeXt on a cell machine, whereas sustaining the identical accuracy on the ImageNet dataset. Furthermore, at comparable latency, our mannequin achieves 4.2% greater Prime-1 accuracy on ImageNet in comparison with MobileOne. Our mannequin persistently outperforms competing architectures in numerous duties resembling picture classification, detection, segmentation, and 3D mesh regression, displaying vital enhancements in latency on each cell gadgets and desktop GPUs. Moreover, our mannequin reveals excessive robustness in opposition to out-of-distribution samples and corruptions, surpassing different strong fashions.