What distinguishes a Transformer architecture from traditional RNN-based models in NLP?
-
A
It processes all positions in parallel using self-attention instead of sequential recurrence
-
B
It uses convolutional filters instead of attention
-
C
It requires pre-trained word embeddings to function
-
D
It can only handle fixed-length input sequences