The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise …
