Redefining LLM Performance: Meta’s MobileLLM Shines with Smaller Size, Greater Efficiency
Meta researchers are challenging the conventional wisdom in the field of large language models (LLMs) with their groundbreaking project, MobileLLM. Their ambitious goal? To demonstrate that smaller models can achieve high quality not through sheer size, but through intelligent design.
Challenging the Scaling Law
Traditionally, the prevailing belief has been that performance in transformer models is directly correlated with the number of parameters, the size of the training dataset, and the number of training iterations (Kaplan et al., 2020). MobileLLM aims to disrupt this “scaling law” by focusing on architectural innovations rather than simply increasing model size.
A prevalent belief (Kaplan et al., 2020) in the field suggests that the performance of transformer models is primarily determined by the number of parameters, the size of the training dataset, and the number of training iterations […] Our experimental results, specifically for small models with limited model capacity, reveals that going deeper is more crucial than going wider for performance improvement.
Smart Design: Embedding Sharing and Grouped-Query Attention
To achieve this paradigm shift, Meta researchers combined deep and thin architectures with embedding sharing and grouped-query attention mechanisms. This resulted in four models of varying sizes (125M, 350M, 600M, and 1B parameters) which outperformed previous state-of-the-art models in various tasks.
A key technique employed is **embedding sharing**, previously used for Meta’s TinyLlama. This strategy involves reusing the same weights across input and output embedding layers. While less effective for larger models, embedding sharing proves highly beneficial for smaller models, reducing the overall number of parameters and improving efficiency.
On a 30-layer 125M-parameter model,
sharing the input and output embeddings reduces the number of parameters by 16M, approximately 11.8% of total parameters with a 0.2 points drop in average accuracy. The marginal accuracy drop can be readily restored by reallocating the saved parameters to add more layers.
Another noteworthy technique is **immediate block-wise weight sharing**, which replicates weights between adjacent blocks. This minimizes latency without significantly increasing model size, a crucial advantage in situations where memory movement dictates overall latency.
Moreover, MobileLLM demonstrates impressive improvements in question-answering and reading comprehension tasks, further solidifying its effectiveness.
MobileLLM: A Sustainable and Efficient Approach to LLMs
Meta researchers highlight the growing need for LLMs on mobile devices to reduce cloud costs and latency, as well as address the environmental concerns associated with larger models’ increasing energy consumption and carbon emissions. They champion the shift towards on-device models as a viable solution, offering improved performance through reduced latency while also promoting sustainability.
MobileLLM is available on Hugging Face for developers and researchers to explore.
Expand your knowledge about MobileLLM and explore its potential for your own projects. Visit the MobileLLM Hugging Face collection today!