The information you’ve requested pertains to the introduction and discussion of a first fully open-source model based on MoE (Mixture-of-Experts) architecture. This model, named OLMoE (Open Mixture-of-Experts Language Models), is characterized by its 7B parameters and a 1B inference cost, with its training code, intermediate checkpoints, training logs, and training data all being publicly available. The focus is on the benefits of using MoE in language models (LMs) to balance performance and cost, particularly in scenarios where high-performance LMs are often inaccessible due to high construction and deployment costs. The MoE approach is highlighted for its efficiency, as it selectively activates subsets of experts in each layer, unlike dense models that activate all parameters for each input.
The document also mentions that despite the advancements in LMs, most MoE models are proprietary, with limited access to training data, code, and other resources. This lack of open resources hinders the development of cost-effective open-source models that can match the performance of proprietary models. To address this, researchers from institutions like the Allen Institute for AI and Contextual AI have introduced OLMoE, aiming to provide a high-performance, open-source MoE language model that is comparable to state-of-the-art (SOTA) models in terms of cost-efficiency.
Key features of OLMoE include:
– Pre-training with 5.1 trillion tokens on a 69B parameter model, where each input token activates only 13B parameters.
– Faster training speed compared to dense models with similar activation parameters, with a reported speedup of about 2 times.
– Superior performance on benchmarks like MMLU, GSM8k, HumanEval, when fine-tuned with instructions and preferences, surpassing larger models like Llama2-13B-Chat, OLMo-7B-Instruct (0724), and DeepSeekMoE-16B.
The architecture of OLMoE involves using N_L transformer layers with a mixture-of-experts (MoE) approach in place of a single dense feed-forward network. For each input token, only a subset of k experts is selected and activated. The selection process is handled by a router (r), which is trained to map input data to the appropriate experts based on their routing probabilities. The output of each selected expert is then aggregated to form the final output of the layer.
The training process for MoE models typically involves converting a large dense model into a sparse one, focusing on balancing the load and training the router. In this case, the authors use 13B active parameters out of the 69B total, with 64 experts per layer, 8 of which are activated. They employ a ‘no-drop’ token routing method, where 8 experts are assigned to handle each input token.
To train OLMoE, the authors introduce auxiliary loss functions, including load balancing and router z loss, to optimize the model’s performance. The effectiveness of different routing methods is also explored, with the ‘no-drop’ token routing being found to be superior to expert-based routing.
The training data used for OLMoE is sourced from diverse datasets like DCLM and Dolma 1.7, encompassing various types of content such as web-crawled data, programming problem solutions, mathematical problem solutions, and academic papers. A new dataset, OLMOE-MIX, is created by combining these sources, with filtering to remove repetitive content, low-starred GitHub projects, and documents with high-frequency words. The dataset is shuffled at the beginning of each training round, with over 5 trillion tokens used in the pre-training phase. In the final stage, the learning rate is linearly decayed to 0.
The document also discusses the fine-tuning process, where the model is adapted for specific tasks. OLMoE-1B-7B-INSTRUCT, a variant of OLMoE fine-tuned with instructions and preferences, outperforms larger models on benchmarks, demonstrating the effectiveness of the fine-tuning strategy.
In conclusion, the OLMoE model is presented as a significant advancement in the field of open-source MoE language models, offering a high-performance alternative that can match or even surpass the capabilities of proprietary models while being more accessible and affordable due to its open-source nature.
Views: 0