# Mixture of Experts (MoE)

## MoE Training

* Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models ([SIGCOMM 2023](https://paper.lingyunyang.com/reading-notes/conference/sigcomm-2023)) \[[Paper](https://dl.acm.org/doi/10.1145/3603269.3604869)]
  * THU & ByteDance
* Accelerating Distributed MoE Training and Inference with Lina ([ATC 2023](https://paper.lingyunyang.com/reading-notes/conference/atc-2023)) \[[Paper](https://www.usenix.org/conference/atc23/presentation/li-jiamin)]
  * CityU & ByteDance & CUHK
* SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization ([ATC 2023](https://paper.lingyunyang.com/reading-notes/conference/atc-2023)) \[[Paper](https://www.usenix.org/conference/atc23/presentation/zhai)] \[[Code](https://github.com/thu-pacman/SmartMoE-AE)]
  * THU

## MoE Inference

* Accelerating Distributed MoE Training and Inference with Lina ([ATC 2023](https://paper.lingyunyang.com/reading-notes/conference/atc-2023)) \[[Paper](https://www.usenix.org/conference/atc23/presentation/li-jiamin)]
  * CityU & ByteDance & CUHK
* Optimizing Dynamic Neural Networks with Brainstorm ([OSDI 2023](https://paper.lingyunyang.com/reading-notes/conference/osdi-2023)) \[[Paper](https://www.usenix.org/conference/osdi23/presentation/cui)]
  * SJTU & MSRA & USTC

## Models

* Mixtral-8x7B \[[Hugging Face](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)] \[[Blog](https://mistral.ai/news/mixtral-of-experts/)]
  * Mistral AI
