MultiHead vs Grouped Query Attention. Are Claude, Llama3, Gemma are choosing speed over quality?
frontier model providers such as anthropic claude 3.5 sonnet, and Google Gemini / Gemma 2B and Meta Llama3 are trending towards using grouped query attention vs traditional multiheaded attention in transformer models as their attention mechansim. Interesting OpenAI with GPT4o doesn't seem to be making this trade off.
Although this choice speeds up AI inference, it does impact content quality for output such such as summarization. in this video chris shows that you get better coherent output from models such as llama2 or claude 3opus over new models such as llama3 or gemini or gemma. in the end, in certain scenarios such as summarization or generative content, gpt4o still beats sonnet.
repo
https://github.com/chrishayuk/mha_gqa...