Can a given molecule be made biosynthetically?

Biosynthesis of a molecule is achieved via the utilization of metabolic pathways inside a living organism. These pathways typically break down starting molecules (e.g., D-glucose) and re-arrange the atoms to make a desired final molecule (e.g., Ethanol or Curcumin). Usually, these metabolic pathways involve multiple steps and multiple enzymatic catalysts.
At Bota, we produce some (but not all) of our commercially relevant molecules through Biosynthesis. Sometimes we have projects that require chemical finishing steps (i.e., in a chemical plant). Identifying whether a molecule can be made purely through metabolic pathways (biosynthesis) or needs to involve chemical finishing steps is one key to kicking off a project.
The assessment can be complicated because sometimes enzymatic catalysts are evolved by humans using directed evolution to extend the reach of wild-type biology. For example, promiscuous enzymes can be evolved to take on a new reaction (e.g., if an enzyme decarboxylates one substrate natively, it might be able to be evolved to decarboxylate a similar substrate as well). Therefore, the scope of what biochemistry alone can achieve is blurry, and not always 100% defined. We seek algorithms to help humans make a judgment call.

The purpose of this hackathon is to decide whether a given molecule can be made by Biosynthesis from simple substrates such a D-Glucose or from known molecules are Bio-reachable. You can use any data set to train a model to identify whether the molecule is Bio-reachable (producible through a metabolic pathway) without any chemical reaction finishing steps.

Dataset (recommend):
MetaNetX - all the molecules in these curated metabolic models with reactions to produce them can be assumed to be Bio-reachable as they are part of the cells' native metabolisms. Please use InchiKeys as the input when specifying a molecule, as this is the best representation that does not rely on arbitrary human naming conventions. Perhaps features about these molecules can be extracted to build a classifier into Bio-reachable or requires chemistry in a lab? If successful (valid model when doing cross-validation on the training data), that would allow extension of this model to the entire Bio-reachable space.
The database MetaNetX is certainly incomplete (it does not have ALL Bio-reachable molecules - some papers were missed, and new discoveries await biochemists!). MetaNetX also has thousands of Bio-reachable molecules with no assigned reactions yet to produce them. There are gaps in human knowledge about how many of these molecules are biosynthesis. Therefore, the scope of what biochemistry in cells alone can achieve is blurry, and not always 100% defined. We seek algorithms to help humans make a judgment call, learning from those molecules that are known + annotated to be Bio-reachable today (have producing reactions).
PubChem - an open chemistry database at the National Institutes of Health (NIH), has become a key chemical information resource for scientists, students, and the general public. PubChem contains molecules information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others. Therefore, you can extend your features and enrich the dataset through PUG REST, a rest API to access the molecules information mentioned above.

A great reference to read is Pathway design using de novo steps through uncharted biochemical spaces - PMC ( In this paper, known enzymatic reactions as well as "generalized reaction rules" are seamlessly blended to compute routes to molecules of interest. The approach however is not open source, and moreover it relies on a complex optimization approach. We believe a simpler and more generalizable machine learning approach could be of utility.
This reference is provided for context only. Here we do NOT seek routes to target molecules, only the best guess at whether they are reachable biologically based on novel AI/ML approaches.It will be a bonus if the model can help interpret why a molecule is Bio-reachable or not (what feature in the model)?