Geometric Diffusion Models for Molecular Simulation and Free Energy Calculations

Results of this project will be shown here by the end of June 2024.

Apply to this project here

Goal:
A central challenge in biochemistry is calculating free energies, such as the binding free energy of a small molecule to a protein. Computing these is of great value to a variety of fields, such as drug discovery, where accurate binding free energies would help identify drug candidates by determining how strongly a molecule interacts with a target protein. Our goal is to use generative modeling methods to explore different biophysically relevant aspects of molecules’ Boltzmann distributions, such as their differences in Free Energies or transition rates in the presence of a catalyst/enzyme.

We aim for a publication at an ML conference (e.g., ICLR, ICML, NeurIPS) and particularly encourage students who are interested in academia (or generally a career in research) to apply to this project. Last year’s project resulted in an ICLR workshop paper [1]. Feel free to reach out with any questions: hstark@mit.edu, celine.marquet@tum.de

Main Methods:
This project is based on several geometric deep learning methods, such as GNNs and SE(3) equivariant networks like e3nn [2]. We aim to use Diffusion Models trained on the Boltzmann distributions (the distribution of possible 3D structures) of molecules and to map between them. This is for the purpose of calculations that quantify physical differences of the distributions (such as free energy difference) and to sample their 3D structures.

Highly related reading in the direction of this project:

  • Boltzmann generators [3] A normalizing flow to sample 3D structures of molecules and compute ensemble quantities such as free energies
  • Targeted Free Energy Perturbation [4] Normalizing flow to map between the distributions of two systems and calculate their free energy difference
  • Distributional Graphormer [5] Diffusion Model to sample the Boltzmann distribution of proteins.

In the project, you will become familiar with:

  • Diffusion models [7], [8]
  • E3 Equivariant graph neural networks [9]

Data:
Protein structures from the Protein Data Bank (PDB) and protein simulation data of 12 fast-folding proteins [6]. We will also generate our own data with molecular dynamics simulations using OpenMM and employ its energy functions for training.

References:

[ 1] Mohamed Amine Ketata, Cedrik Laue, Ruslan Mammadov, Hannes Stärk, Menghua Wu, Gabriele Corso, Céline Marquet, Regina Barzilay, Tommi S. Jaakkola “DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models” https://arxiv.org/abs/2304.03889

[2] M. Geiger, T. Smidt, “e3nn: Euclidean Neural Networks”, https://arxiv.org/abs/2207.09453

[3] Frank Noé, Simon Olsson, Jonas Köhler, Hao Wu “Boltzmann Generators -- Sampling Equilibrium States of Many-Body Systems with Deep Learning”
[4] Andrea Rizzi, Paolo Carloni, Michele Parrinello, “Targeted free energy perturbation revisited: Accurate free energies from mapped reference potentials”
[5] Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank Noé, Haiguang Liu, Tie-Yan Liu “Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning”
[6] Kresten Lindorff-Larsen, Stefano Piana, Ron O Dror, and David E Shaw. “How fast-folding proteins fold.”
[7] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models”
[8] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations”
[9] V. G. Satorras, E. Hoogeboom, and M. Welling, “E(n) Equivariant Graph Neural Networks