In this episode, Alison and I talk to Marnix Medema, from Wageningen University in the Netherlands. Marnix and his collaborators have a fantastic set of software tools for exploring secondary metabolite biosynthetic gene clusters (BGCs). Most people in the field are going to be familiar with AntiSMASH, which is available for analysis of your genomes on the web, or you can download a docker container or Anaconda/bioconda package or the source code and run it yourself on your own computer. MIBiG is a repository of experimentally characterized and manually annotated BGCs. And now BiG-SCAPE and/or BiG-SLICE can compare your gene clusters and group them into families, which you can also look at in BiG-FAM. This ecosystem for secondary metabolism has become transformative to the natural products community, and Marnix’s enthusiasm for science and excitement about creating fantastic collaborative research projects is infectious. This episode should make a nice jumping-off point into the deeper end of genome mining after the Genome Mining Primer episode last time.
Transcript
Dan Udwary: You’re listening to the Department of Energy Joint Genome Institute’s Natural Prodcast, a podcast about natural products and the science and scientists of secondary metabolism. Hey there, and welcome back to Natural Prodcast, and this will be episode 11. This continues our little sprint on genome mining.
In the last episode, Alison and I talked about the basics of genome mining, which is using DNA sequence to identify and interpret biosynthetic, secondary metabolism pathways. And today, we get to present our conversation with Marnix Medema, from Wageningen University in the Netherlands. Marnix is still pretty young, but he’s been dead center and a lot of the really important recent events in data analysis and secondary metabolism over the last decade or so.
He’s involved with a lot of the common tools that genome miners use, including antiSMASH, MIBiG, and newer efforts, like BiG-scape and BiG-SLiCE, and now the BiG-FAM database of biosynthetic gene cluster families. And he tells us about forthcoming efforts in paired-omics data sets and seeking to interpret function from BGCs.
It’s a long, wide ranging conversation and I’m so happy we got the chance to talk and that now I can bring it to you. So enjoy, episode 11 of Natural Prodcast.
Alison Takemura: Marnix, where are you right now?
Marnix Medema: I’m in my room at home. Or actually not my room, it’s the room of my little baby daughter, who inherited this room from her sister. And it was supposed to be her room now but, well, it became my office due to the pandemic. And she’s sleeping in our bedroom together with us still while she’s still four months old. So it’s not a bad location for a little baby. Rooming in, is what they call it, right?
Alison Takemura: Yeah. Crashing. Maybe it’s what I’m used to thinking of, like someone crashing at someone else’s place. OK. That helps me.
Dan Udwary: Congratulations.
Alison Takemura: Yeah.
Marnix Medema: Thanks.
Alison Takemura: Definitely. Wow. Yeah. What a time. Four months.
Marnix Medema: Yeah. Remember Dan, when you were supposed to visit us in the Netherlands early this summer, then it was– well, I think the day your visit was planned was just a few days before the due date.
Dan Udwary: Oh, no.
Marnix Medema: It was quite uncertain whether that meeting was going to happen in the first place. And yet everything got canceled due to the pandemic.
Dan Udwary: Right. Well, at least we can get to talk to you today.
Marnix Medema: Yes, absolutely.
Dan Udwary: So we have a lot to talk about. I think there’s always a lot to talk to you about, especially in terms of genome mining and things. But before that, I wanted to tell a quick story. Looking back, the year was probably 2010, 2011, I was a, uh- actually, I’ve got the t-shirt on today. [Dan points to his URI T-shirt.] I was a professor at the University of Rhode Island. And was working on a natural products/biosynthetic gene cluster database, and a tool for predicting natural product structures from genomes.
And I was writing grants furiously to try to get some funding for this thing because I needed more people to work on things, and to actually do some chemistry. And a few months after I submitted one of those big grants that — fingers crossed, tenure relying on this — antiSMASH version 1 comes out. The reviewers were not kind to my submission, they were saying that natural product identification– biosynthetic gene cluster identification was now a solved problem and so my tools were not needed.
It all worked out in the end because later I brought all those tools to Warp Drive and we used them for a while before they switched to antiSMASH. So I just wanted to start by saying, thank you for destroying my academic career, because things are much better now.
Marnix Medema: Well, I had no idea, Dan. I’m so sorry. That was of course never the intention.
Dan Udwary: No! No…
Marnix Medema: To be honest, I didn’t even know of your existence back then. So there was no hostilities.
Dan Udwary: That is certain, yes. So welcome. I think we have all kinds of fun things to talk about today and it will be really good to maybe start with how did you get into this field? How did you get into natural products?
Marnix Medema: Well, it’s actually in a way a fun story, because I was bound to do something else. I was a Master’s student at the University of Groningen and then I wrote my own grant proposal in the hope that it would get funded. And like yours, it didn’t. However – and this was a proposal on synthetic biology of engineering gene expression in bacteria – however, the professor was actually quite enthusiastic about the project and he arranged funding to do it nonetheless from his own group’s funds. So I could start that project, and I was about to start that project when another vacancy came up, that triggered me and was by Eriko Takano and Rainer Breitling at the University of Groningen as well [at that time].
And they had this project about this kind of weird system. And then they wanted so to look at secondary metabolism in bacteria. And what I get intrigued by was that the project was a mixture, initially, of experimental and computational work. And that it involves some of these crazy genes, some of these enormous genes that encode these very large enzymes. And I was like, OK, wow, these are really cool.
And I get to talk to them. And then I really get very enthusiastic about the project, and especially also about the more computational aspects of it, which I had found out during my Masters that I really liked. So in the end I chose to do that project over the project that I had written myself. But one thing I should admit, the experimental component never happened because I had so much fun with the computer. I was scripting and coding, that I never ever touched a pipet again except for, of course, some cases when journalists come when they want to see you handle a pipet for making a nice video.
Alison Takemura: Marnix, It’s just from the biological perspective, like what was this bacterium, and why did your group want to study it?
Marnix Medema: You mean the PhD project that I didn’t finish?
Alison Takemura: Yeah. Right.
Marnix Medema: This was part of a large initiative in the Netherlands at the time called Genbiotics, which was a large program to find novel antibiotics. And this particular project focused initially on Streptomyces clavuligerus, which was an industrial strain, a Streptomyces strain, the bacterium, that had been worked on by the Dutch company DSM for quite some time. So they had actually sequenced the genome a number of years earlier as one of the first Streptomyces genomes that had just not been released yet. And there was also some transcriptome data.
And my first task during that project was actually to look at the genome and to figure out, OK, can we find interesting biosynthetic gene clusters in there to find new antibiotics. And, as many projects in bioinformatics, what, something that was born out of frustration in the sense that I was doing all these manual BLAST searches and HMMER searches trying to find these gene clusters.
And I had heard the first rumors of people sequencing like these enormous numbers of genomes, like 20 genomes, which was just incredible at that time, that people would actually sequence 20 genomes. So I thought OK, if this is really happening, if people are really going to sequence 10 or 20 genomes at once then we really need to have automated tools for this. Because well, you can’t do this all manually, and that’s how for me the antiSMASH journey started during my PhD.
And then I actually found out, through a connection of my supervisor, that in parallel, people in Tubingen in southern Germany were working on a very similar project like six months after I started. Another PhD student, Kai Blin, actually started there with his PhD project, and when Eriko, one of my supervisors found out about that, we actually talked to them and we said OK, it’s much better to do it together, to join forces instead of to make two competing tools.
If we had known about your tool, Dan, we would have contacted you as well and we could have made an even bigger and better for eventually one of antiSMASH with the three of us, but unfortunately we didn’t know about that. But it was really fun working together with Kai and getting this thing off the ground. Two fanatic PhD students.
Alison Takemura: It sounds great. Yeah, it sounds like it was really needed at that time.
Dan Udwary: So we should probably just back up a tiny bit and just say what antiSMASH actually is and does.
Marnix Medema: Yeah. So antiSMASH. We built antiSMASH as a tool to automate the identification of biosynthetic gene clusters. So it’s actually a software tool and it takes as input – it takes genome sequence, then it uses a number of similarity searches and then looking for certain marker, genes and domains, to identify enzyme coding genes that are signatures for the presence of a specific type of biosynthetic gene cluster.
And antiSMASH has a library of those kinds of signature domains and it looks for particular combinations of these and then it automatically finds these gene clusters in your genome and outputs them in the visual HTML file that you can browse.
Dan Udwary: Yeah. And I think at least one of the things that I think is really important about antiSMASH, is that it’s still under development. That you guys are – Kai probably most, right? – is you guys are still adding to it and increasing its capabilities. It’s really important for software like this.
Marnix Medema: Yeah, absolutely. This is still a collaboration, indeed, with my group. And Tilmann Weber’s group, and in particular, Kai Blin and Simon Shaw were working with Tilmann on this. But this also has become very open to the community as well, so the versions of antiSMASH that we published since then. So we’re now nearing version six. But they’ve seen contributions from people all over the world who have made additions to antiSMASH, contributed features. That has been great fun to actually work with so many people on this framework. And I think this is really– I personally like this really as a model for a collaboration. You have your code as open source and you can work with it together with whomever is interested.
Alison Takemura: I can ask a quick question. Just what is one of your favorite added features, Marnix, to antiSMASH that someone else developed?
Dan Udwary: Pick a favorite child, right? And make everybody else mad.
Marnix Medema: Yeah, it’s difficult. Yeah, one of the features that I do quite like is, for example– and now I think was quite a nice example of collaboration as well– was the integration of the RODEO algorithm by the group of Doug Mitchell from Illinois. So this is an algorithm that identifies, using machine learning models, that very tiny precursor peptide, encoding genes that encode the very small precursor peptides that end up being post-translationally modified into ribosomally synthesized peptide natural products. And they built a few of these very reliable support-vector machine learning models that we were able to integrate into antiSMASH, and I thought that was really a nice example of teamwork and combining things that another group has developed into our framework.
Alison Takemura: Cool, Thank you. I wish I could use this tool. I don’t have this kind of research area though, so.
Dan Udwary: You can, Alison, because antiSMASH is readily available over the web and–
Marnix Medema: In reality, there is many. I would say, perhaps, half of or more of the users of antiSMASH are not in this field either. So they used antiSMASH and they report findings using antiSMASH. And it’s actually quite a challenge also in making a tool that is not too easy to– for which it’s not too easy to misinterpret the results. Because of course, if you’ve never heard about biosynthetic gene clusters and it’s one of the 10 analysis that you do, then it’s very easy to think that, OK, I find 25 gene clusters, there must be 25 new compounds.
Or everything that has similarity to a gene cluster with a known compound from the reference database: It must be identical. It must be that compound that is produced from that gene cluster. And this has been quite a challenge, and there is actually some cases where we’ve changed things in antiSMASH because, well, some things were prone to misinterpretation if you were new to the field.
Dan Udwary: For sure. I mean, that goes all the way back to NCBI, gene annotations and ontologies and things that — things have always been misinterpreted by people who don’t quite know what they’re looking at, and hard to do on a large scale. And so–
Marnix Medema: But in the end, I do feel that if you develop a bioinformatic tool… it’s like when you write a book, the reader always knows best. So if all the readers are misinterpreting, you would have done something wrong as a writer. I think you should have the same attitude as a developer of software, because if a lot of users are misinterpreting the results, then you need to do better as a developer. So that has been also a journey of continuous improvement.
Dan Udwary: What’s the development process these days for antiSMASH?
Marnix Medema: So at the moment we actually add features as they come, and this is done through also a process where you say, OK, I want to build this feature, then you– well, Simon is currently inclined or basically doing the reviews of the code. So if you submit some code, if you do a pull request, it will check it and see whether the quantity is up to standard, and maybe it gives you some feedback as a developer. And then when it’s, well, when the code is good enough, it can be integrated into the code base. And at the moment, we have rolling updates, so maybe a few months later it would appear on the web version of antiSMASH as well.
Initially, in the first years of antiSMASH, we only made updates for every NAR paper that we wrote about, like, a new version of antiSMASH. And then we introduced a bunch of new features all at once to be able to publish a new paper on a new version of antiSMASH. But that gave so much stress around these deadlines, that we thought, OK, it’s actually nicer to be able to continuously add things, and then every two years we just summarize, what are all the new things that happened over those two years? So that’s the new model that we’re now following. And I think proposed it at a moment, and I think that was a very good idea because it really alleviates the amount of stress around this, especially Christmas time when NAR, the Nucleic Acids Research that’s a journal that–
Dan Udwary: Yeah.
Marnix Medema: takes these articles – when they would like to have these submitted.
Dan Udwary: So it sounds very– you guys are very open to outside input, sort of an open source model.
Marnix Medema: Yeah, yeah. We even had, at some point, a software consultant who worked for Novartis. He actually, for Novartis, he built some features. And for them it was actually useful to have it as part of the main codebase, so that if we get updates, then their features would go along. So he actually contributed them. That was also a really nice example, I think, of industry and academia working together. So even something like that can work very well.
Dan Udwary: Yeah, great. And so generating data is super important, but access to data is also pretty critical. And I see that over the last few years I think you’ve delved more into providing data to whoever. And so I wanted to first talk about MIBiG. Is that how I should say that: “mybig?” “M-I-Big?”
Marnix Medema: “M-I-Big.”
Dan Udwary: MIBiG, you’re like that. OK.
Marnix Medema: Yeah.
Dan Udwary: So tell us about MIBiG.
Marnix Medema: So it was actually a project that I did as a postdoc. So I wrote a grant proposal and then got funded to work for two years at the Max Planck for marine microbiology in Bremen, Germany. And I wanted to work with a group that was a group of Frank Oliver Glöckner, because they are world experts at standardizing data. So they had actually devised a standard called the “Minimum Information of Any Sequence” (MIxS) standard a number of years before that, which stipulated if you submit a DNA sequence to a database, what kind of metadata would you then need to record to be able to interpret that data to make it interpretable for everyone?
And I wanted to do something similar for MIBiG because I saw that there was this, well, big problem in the field, that there were many papers reporting sequences and pathways– sequence is not right to say. But of gene clusters and pathways encoded by them. Well, they had come out over actually multiple decades of time. But still to be able to use those data, you would actually manually have to go through all those papers, find all that information, and manually edit, and put it together in Excel sheets or something. So this was, well quite terrible for anyone who wants to actually use those data.
Dan Udwary: Yeah.
Marnix Medema: And I was already planning to start a group in computational biology of natural products, and I thought, OK, even for my own research and my future group, I need to be able to have this data in a standardized way to be able to train new algorithms, et cetera. And that’s not just selfishly. Of course, this was also very important for the whole community, so. This resonated with a lot of people, and I thought, OK, let’s make it open. Let’s not just keep it for myself, but let’s make this an open endeavor where we involve everyone in the whole community, whoever wants to contribute. And make this an open resource that is owned by the community.
And then I contacted a number of PIs, and in the end, I think 80 research groups from around the world contributed with, well, more than 150 authors of the paper in the end. And it was actually a fun process of putting together such a standard and also going back into and finding [and] annotating gene clusters that have been published in the decades before that using that standard. And one of the most– one of the things that most impressed me at that time was the number of emails that I was exchanging with people. So I think I wrote 3,000 emails in the course of one and a half year while emailing with all of those 150 plus people.
I have to admit, as a PI, I’m no longer impressed by that number. But as a postdoc, I thought, OK, wow, these are so many emails. But it was great as well, because it was a really fantastic way to get to know the whole community. So I was interacting with almost everybody in the field, and that was so much fun just getting to know people a little bit and knowing what they’re doing, getting familiar with many of these pathways.
Dan Udwary: Yeah, that’s a great way to build out a community.
Marnix Medema: Yeah. It was a lot of fun.
Alison Takemura: And is that just the start, like the conception of this project was a way to bring the whole community together? But it sounds like this database is still an ongoing effort. And so is it a way to keep the community going together?
Marnix Medema: Yeah, so people are indeed still submitting data, so. And especially, actually, at the start of the pandemic, everybody was looking for something to do on their computer, so I got so many submissions. We got so many submissions. But indeed, people are still contributing this data and that’s fantastic too. That indeed this is a great way to also bring attention to the work that people are doing, because if you submit data on something you’ve published and you put it in a public database, people are able to find your gene cluster and cite it. But it also makes the data usable for people who want to do the next steps in science.
Alison Takemura: And what does MIBiG stand for?
Marnix Medema: It’s the “Minimum Information on a Biosynthetic Gene cluster,” where cluster, at the end, doesn’t have it’s own letter, but that’s simply because MIBiG sounded nicer than MIBiGC or MIBIC.
Dan Udwary: Yeah. And the big family continues. So next I think was BiG-SCAPE and now BiG-SLiCE. How did those come about?
Marnix Medema: Yeah, in several ways, actually. So BiG-SCAPE was one of the things I started working on when I started up my group in Wageningen. And yeah, this builds on technology that was developed, and a group of Michael Fischbach, where I was a visiting scholar during my PhD. And I worked together with Peter Cimermancic. He was a grad student with Michael. And we did it together. Did the global analysis of biosynthetic gene clusters across, well, at that time, all genomes that were out there. It was 1,100 at that time.
And we used a technology called “sequence similarity networking” of gene clusters, where you would make a network where two– where every gene cluster would be a node, and they would be connected by edges if the gene clusters shared at least a certain level of similarity. So if similarity was higher than a certain kind of volume. And if you do that, that allows you to get a bird’s-eye perspective on the biosynthetic diversity element of these genomes. And you see how they’re– visually, how they are all connected.
And you can color these nodes. For example, based on taxonomy, or based on class of biosynthetic pathways, et cetera. And if you then also put in the known gene clusters, so the ones that ended up in MIBiG in the end, then you would be able to very quickly see, OK, which gene clusters and genomes are related to those known ones? And which groups or families of gene clusters are potentially novel. So it might yield novel pathways of interest.
And at that time, the algorithms that we used were– they were working well, but they were not really as scalable as needed to make this a software for everyone to use. So I got a visiting postdoc in my lab, who ended up staying longer and he was a great person to work with, Jorge Navarro Munoz. And Jorge actually started working on this project, first, together with a master’s student whom he was supervising, and then he took over the project himself.
And it really streamlined this whole process and made it so much faster than it was before. And then, well, this also became a much bigger project, and another PhD student in my group, Satria [Kautsar] built a really nice visualization tool around it, so you would actually be able to navigate these networks very easily. Also in an HTML format, which is also relatively scalable, so you can put in at least a few thousand gene clusters in a view and it still remains readable. And you can really quickly find relationships between gene clusters of interest.
And yeah, that was really a lot of fun to work on, and I think it’s also useful to have that technology also democratized so that everybody is able to use it. So if you sequence your own bunch of genomes, you’re able to do this to make these networks for your own data.
Dan Udwary: Very helpful, yes. And in that same sort of leapfrogging of data generation to data storage, now– well, behind the scenes, we had to cancel this two weeks ago, and since then your paper on BiG-FAM has come out, which seems to be a database of these gene cluster families, correct?
Marnix Medema: Yeah, maybe I should first explain what a gene cluster family is.
Dan Udwary: Sure.
Marnix Medema: Because a gene cluster family is a, I think, is a handy unit to be able to study biosynthetic diversity. There is no precise definition, like where to draw the boundaries. But in principle, a gene cluster family would be a set of gene clusters across different species, which all encode the production of either the same or similar molecules. And then of course, where you draw the boundary depends on, OK, do you really want them to be the same molecule or similar enough, and then how similar?
Dan Udwary: Yeah, similarity is a slippery slope in chemistry.
Marnix Medema: Yeah, exactly. There is no black and white in clustering in general in computer science.
Dan Udwary: Sure, Sure.
Marnix Medema: But it’s a really helpful way of, for example, finding out which organisms, for example, are likely for which organisms, their genomes contain the same biosynthetic gene clusters or similar biosynthetic gene clusters. And also correlating these gene clusters to metabolites. So knowing, OK, which metabolites are actually associated with these gene clusters, or to activity? So if you see a biological activity and you correlate that with the presence or expression across species in, for example, a strain collection with the presence of a certain type, certain family, of gene clusters. So this is a very helpful concept, I should say.
So you asked about BiG-FAM, and BiG-FAM is a database of those biosynthetic gene cluster families. And those were built using a new algorithm that Satria, whom I just mentioned who was a student in my group, builds. And idea about BiG-SLiCE came about especially because we noticed that BiG-SCAPE was very scalable compared to approaches before that. But when you wanted to include more than let’s say, 100,000 gene clusters, then the computation times became too long.
And we were really interested in OK, can we reconstruct gene cluster families for all genomes that are out there. So that would enable you to do a global analysis of those biosynthetic gene cluster families to look [at] and also to provide a global resource of gene cluster families across all genomes. And the acceleration of genome sequencing, especially in bacteria, has been enormous. There are now hundreds of thousands of genomes that are publicly available. So this compared to the 20 genomes that I told you about earlier, this is like orders of magnitudes a difference.
So that means also that the algorithm had to scale also by orders of magnitude. And that’s why Satria figured out some new clever tricks to make an algorithm that is a little bit more rough, but is a lot more scalable, so that it actually allows you to do this. And that way that algorithm was in the end used to populate BiG-FAM.
Alison Takemura: And is that related to BiG-SLiCE?
Marnix Medema: Yeah, so BiG-SLiCE is the very scalable algorithm that allows you to cluster gene clusters like in this case, 1.2 million gene clusters into gene cluster families. And it does that by not relying on sequence alignments while still taking into account sequence similarity, but it does that by– maybe I should explain it this way.
So before there were basically two ways of dealing with sequence similarity, either we just looked at which protein domains are in a gene cluster or we did sequence alignment, where you really look at the exact sequences of the domains. Well, if you just look at which domains are in there, that usually wasn’t enough information. But if you need to align all the sequences especially if you do this for hundreds of thousands or more than a million gene clusters, this is computationally too expensive.
So the middle road that Satria found is like, he said, “OK, let’s take the 100, 200 most important protein domains that are defining for biosynthetic pathways and let’s split these domains into 100, for example, sub domains”, where he made a big phylogeny of all the proteins that had the protein domain. And then for every clade within that phylogeny it made its own Soc? Pfam model, so subdomain model.
So now, when you enter a gene cluster into BiG-SLiCE, BiG-SLiCE identifies for each gene that belongs to an important enzyme family to which sub family and quite specifically it belongs. And that gives you information that is close to the information that you would get from a sequence alignment, so quite high resolution information. But it doesn’t actually require you to do big sequence alignments of all those sequences against each other. So it’s much faster.
The fun thing, and this maybe even the coolest feature that Satria came up with, is that this allows you to actually locate a gene cluster family in feature space because a gene cluster is then represented as a feature vector of these domains and subdomains. And what is really useful about that is because a gene cluster family is represented by a point in space. That means that if you have a new gene cluster and you want to know how is it related to all those other gene cluster families, you don’t need to recompute everything, you just compute the feature vector for that gene cluster and you place it into the feature space. So, within seconds you know how it is related to all of those other gene cluster families instead of having to recompute everything when you add a data point.
And this also allows a very fast querying mode where when you run a gene cluster or when you run a genome through antiSMASH, you identify gene clusters and BiG-FAM allows placing each of those gene clusters into the gene cluster families that are surrounding it in the database.
Dan Udwary: As a point of clarification, for people who aren’t familiar with secondary metabolism or biosynthetic gene clusters, somebody thinking why don’t why don’t you just BLAST it? What are you doing? The important thing to remember about a gene cluster is that it’s a collection of genes and sometimes multiple operons. And those genes and operons can rearrange, but yet still make the same molecule, biosynthetically, by the same exact pathway even if the genes are scrambled or altered in the chromosome.
And so it becomes very difficult then to compare especially across species that might be a little bit more genetically distant where the BGCs may have drifted but they might still be making the same natural product molecule. It becomes a tricky task to really make sure that those things are doing the same thing, or at least are closely related.
Marnix Medema: Yeah. Yeah, absolutely. And for example, if you do this querying with BiG-FAM that would usually just be the start of analysis. So you would see, OK, this gene cluster is related to these 500 gene clusters. And now let’s look into detail, let’s make phylogenies of the underlying genes, let’s make alignments of those gene clusters, let’s put those 500 gene clusters with your query one into BiG-SCAPE and see how they relate in detail.
So it’s a very fast way to prioritize this set of gene clusters that are most related to, and do then follow up analysis to be able to study the relationships in more detail. Especially if you want to look at the evolution of these gene clusters, for example, you really need to build phylogenies and do all those more complicated things.
Dan Udwary: That’s right. That’s right.
Alison Takemura: These tools do sound really powerful. And Dan, as someone who works in this field, why do you use Marnix’s tools?
Dan Udwary: Well, I think like we said, so antiSMASH has been under development for a while. Version 5 has been around for a bit. Version 6 is on the way, I hear? You just said. So antiSMASH is definitely the most comprehensive set of tools for identifying gene clusters. There’s nothing that is quite as good. There are a few other efforts maybe that are out there and doing slightly different approaches, but antiSMASH has become the I don’t want to say industry standard because it’s more academics, but it is the standard. And I think Marnix’s other tools, BiG-SCAPE and BiG-SLiCE, are all based on antiSMASH results. And so if you want to do deeper analysis on things, then you’re staying in the “BiG” branded family I think.
Marnix Medema: Yeah. It’s an ecosystem of software right?
Dan Udwary: For sure. Yeah. It’s built up that way and it’s great.
Alison Takemura: What’s an area where you’ve been using these tools recently, Dan? What kind of question have you applied them?
Dan Udwary: Well, at JGI we have the Atlas of Biosynthetic Clusters. And this is a very raw database of biosynthetic gene clusters. But with all of the other IMG kind of hooks into IMG’s data. IMG being Integrated Microbial Genomics at JGI, a large data set of bacterial genomes. And so yeah, we use antiSMASH to identify the BGCs that go into ABC.
And then more recently, we use it for any kind of genome mining. Whenever I do some genome mining, antiSMASH is usually the place where I’m going to start. We talked about this when we did the primer episode, we talked about genome mining through some metagenome associated genomes. And yes, antiSMASH is particularly valuable for finding even fragmented pieces of biosynthetic gene clusters that we want to identify.
Marnix Medema: Yeah. There’s so much to do there especially looking at metagenomes. So we’re actually eager soon to also work on better tools to visualize and explore these gene clusters and metagenomes, especially if you want to know to which metagenome assembled genomes within a metagenome they belong and to which – what are the taxonomic assignments of those, et cetera. Instead of just getting a list of 5,000 gene clusters from a metagenome which is great because that’s a big number, but it’s very difficult to analyze.
But Alison actually, my group is a big user of our tools as well because we do not just develop tools, we also do a lot of the analysis. So we actually like to be among the first people to use these tools as well on exciting projects. And like Dan at JGI, we, for example, are really interested in metagenomes and microbiomes and figuring out which gene clusters are responsible for phenotypes that are caused by microbiomes. And what are the functions of these gene clusters? And that’s a whole fascinating area I think that is opening up right now in the field.
Alison Takemura: Do you have a story from your lab that you could share with us?
Marnix Medema: Yeah, for example, some work we published last year it was a collaboration with a group of Jos Raaijmakers at the Netherlands Institute of Ecology and Victor Carrion who was the postdoc leading that project. So they were studying this phenomenon called “disease suppression”, where the microbiome of a plant, so the bacteria living on and inside the roots are protecting the plants against pathogens. So usually fungal or oomycete pathogens.
In this case they worked on a system of sugar beets where bacteria on and inside the roots were protecting the sugar beets against Rhizoctonia solani, which is a big bad fungal pathogen. And they showed very clearly that this was the action of the microbiome because if you sterilize the soil, the protective effect is gone. You can transplant the effect by taking the microbial fraction out of one soil, transplanting it into another soil, and then you also have that same disease suppressive effect. But well, causal gene so that the mechanisms were still largely unknown.
And thus far in the microbiome field, many people had just been looking at like 16S taxonomic profiles of these communities looking at who is in there, but never like what are they doing? So that’s why we thought, OK, we really need to look at the biosynthetic gene clusters. And then they sequenced a really nice and good assembly. They were able to assemble a really high quality metagenome of the endosphere microbial communities for all the bacteria that are living inside the roots. And from that we got around 800 biosynthetic gene clusters, many of them full length.
And then we started prioritizing, OK, which gene clusters are more highly abundant in the suppressive soils compared to the conducive soils? So which ones are enriched in the soils that are suppressive? And then also when the pathogen is present, which gene clusters for which is their expression activated under that condition? And we actually used sequence similarity networking to prioritize a number of these candidates that we used an early version of BiG-SCAPE.
And then Victor did QPCRs to figure out a number of gene clusters that were particularly overexpressed in the presence of the pathogen. And there was one gene cluster that really stood out and it was from a flavobacterium. And then we got help from colleagues at Wageningen University who developed a CRISPR-Cas system for that group of bacteria and that enabled a mark out of that gene cluster. And that was able to show that if the gene cluster is knocked out, the suppressive effect is gone.
So apparently that biosynthetic gene cluster is at least essential for the full suppressive phenotype that you would normally see from that community. It doesn’t mean that the product is necessarily an antifungal. It could also be perhaps eliciting a response by another bacterium, for example. But at least we know for sure that this gene cluster plays an essential role to get this phenotype.
I think that’s just a nice example of going from a whole community with many microbes, hundreds of gene clusters, and then being able to zoom into particular gene clusters that play key roles in these microbiome associated phenotypes.
Alison Takemura: That’s an amazing story. I’ll have to share that with my partner later. So thank you.
Marnix Medema: With your partner? Is he working in plants?
Alison Takemura: No, he doesn’t work in plants but that’s the kind of story where it’s just so relatable. I mean, take a sugar beet. Think about if a sugar beet might get disease, but then it’s microbes might help it. And so maybe my being a communicator lends me to collect little stories and then share them at the dinner table.
Marnix Medema: Yeah. I can imagine that it is a story that I guess appeals to also a broader audience than a new algorithm, which is still more abstract.
Dan Udwary: Tell us about DECIPHER. Where is your group going with that?
Marnix Medema: Yeah. So that’s an ERC grant that we recently got awarded. And this is indeed one of the new directions that my group will be going into. And one of the key aims that’s related to what I was just talking about is elucidating functions of the gene clusters in their community context. So we want to develop new algorithms that are able to predict functions of these biosynthetic gene clusters in a microbiome, for example.
Dan Udwary: That’s a big reach, function. How do you start to tackle that?
Marnix Medema: Yeah. It is very challenging and these grants actually want you to do high risk, high gain research. So if it’s not high risk, they wouldn’t give you the money. So we wanted to go for something high risk and I think this is something that if it works even to some extent I think it would be really useful.
And I really think this needs an integration of many different types of data, including expression data, use of metatranscriptomics, where you measure gene expression levels of multiple microbes within a community, see how the expression of gene clusters is related to each other mutually, sometimes correlated or anti correlated, also with other cellular functions in the same and as it was other bacteria in the community. So that can tell you something about the role that the expression of these gene clusters play in these interactions.
But also things like trying to predict at least parts of the structures of the molecules that use biosynthesis is encoded by these gene clusters and figuring out what the structure activity relationships are and using those structural features to see if you can use that to predict activities, and the mechanisms of action, perhaps, even.
I think that relates to a whole different field like computational drug discovery field, which has been quite distinct from the natural product drug discovery field. But this is an area where there has been a lot of work on those structure activity relationships using machine learning and artificial intelligence. I think there is a lot of things that we can learn from that scientific community, and of course, hopefully they can learn something from us as well. But I think cross pollination between the omic space/natural product field and the computational drug discovery field, I think that has a lot of potential for the future.
Dan Udwary: But it sounds like you’ve answered the usual question we ask is, where do you think your lab is going over the next say five or 10 years? Sounds like function is a big part of that. Are there other directions you want to go?
Marnix Medema: Yeah. Function is indeed a big thing. And I think another key component is data integration with genomics with metabolomics and with transcriptomics because I think we can learn much more not only about function, but also about structure and about ecology of gene clusters if we’re able to better integrate these kinds of data. And actually also for that we need also new standards and new platforms.
So we’ve been recently working together with my colleague Justin van der Hooft and together with Peter Dorrestein in San Diego on a new platform for paired omics data, where we’re thus far documenting links between genome and metabolome data, and biosynthetic gene clusters, and tandem mass spectra. So that you would actually document somewhere which metabolome data, for example, in the GNPS framework in San Diego belong to the same samples as genomes that are in GenBank and which gene clusters that are in let’s say MIBiG, for example, belong to the same molecules as which mass spectra from mass spectrometry data in GNPS.
This will also be, again, foundational for the future to be able to build new tools and to be able to leverage those data in a better way. Because that really allows you to connect those heterogeneous types of data in a systematic way, where you can take large scale mass spec data and large scale genome data and start to put them together to learn new things.
Dan Udwary: So saying that all those pieces need to come together and integrate is one thing, but how do we actually go about starting to do that? We talked to Roger Linington about this as well in the context of NPAtlas in his data there and the big community effort it took to build that. How do we get all of these different big community efforts to really come together?
Marnix Medema: Right. Yeah. Yes, so usually it starts with a few people who share an idea, and then you extend with kind of a seed community. Maybe 10 or 20 people who are enthusiastic and then you start brainstorming to improve the idea and to get the ball rolling. And in the end when you have a decent critical mass, then you really just by email or meeting people at conferences you try to approach everyone and we try to be as inclusive as possible. And then everyone who is willing and able to join, can join the community effort. Then we can do something together.
Dan Udwary: So stop planning and just do it.
Marnix Medema: Yeah, well it requires quite a bit of planning in the sense that you need to have something to show to the community before you can convince them of course. So that needs, also, this seed phase where you really need to get the input already from a smaller group of experts who say, OK, I also think this is a good idea. I support it and I get critical feedback to get something off the ground that you can show to the rest of the community. Like OK, we have something, would you want to be part of it? If you approach it like we have nothing, would you want to be part of it? They think, OK, that’s going to be a hell of a lot of work. I’m not doing it. If you have something to show and it already shows concrete potential and there is already a group of experts – a little bit larger group of experts – behind it, then people are very happy to join.
Alison Takemura: And what are people signing up for when they do say they want to be part of it?
Marnix Medema: Yeah, we try to be as transparent as possible when we initiate something like that. So in this case, it means providing feedback on the platform, on how it’s set up, on the metadata that is recorded making sure that we’re not missing anything important. And actually supplying and submitting the data, helping out with making and documenting those links. And sometimes people do much more than you require of them. I mean, some people went as far as saying, OK, we’re going to take 300 strains from our labs and measure them on their mass spec because we think it would be great to add that to the database. And they just go and do it. And then you say, wow, this is fantastic.
Alison Takemura: Yeah. That is amazing. All that enthusiasm, it’s so infectious.
Marnix Medema: Yeah. But I do think this is, at least personally, I find this the most rewarding way of doing science collaborating as a community. Maybe from an industrial point of view it might be judged upon as a little bit naive, but in the end we’re using taxpayers’ money for the public good and I think collaborating is the best way to achieve something for the public good, instead of trying to compete with each other or trying to just make lots of separate small things that are all competing for the attention of the community.
Dan Udwary: Well, that brings us full circle back to my introduction. So maybe that’s a good place to wrap it up. Marnix, it’s been great talking to you. Thanks so much for joining us and looking forward to seeing all the great stuff that comes out of your lab in the future.
Marnix Medema: Thanks so much, Dan. It was a pleasure.
Dan Udwary: I’m Dan Udwary, and you’ve been listening to Natural Prodcast, the podcast produced by the US Department of Energy Joint Genome Institute, a DOE Office of Science user facility located at Lawrence Berkeley National Lab. You can find links to transcripts, more information on this episode, and our other episodes at naturalprodcast.com. Special thanks as always to my co-host, Alison Takemura. If you like Alison and you want to hear more science from her, check out her podcast Genome Insider. She talks to lots of great scientists outside of secondary metabolism. And if you like what we’re doing here, you’ll probably enjoy Genome Insider too. So check it out.
My intro and outro music are by Jazzahr. Please help spread the word by leaving a review of Natural Prodcast on Apple podcasts, Google Spotify, or wherever you got the podcast. If you have a question or want to give us feedback, tweet us @jgi or to me @danudwary. That’s D-A-N U-D-W-A-R-Y.
If you want to record and send us a question that we might play on air, email us at [email protected]v. That’s [email protected]. And because we’re a user facility, if you’re interested in partnering with us, we want to hear from you. We have projects in genome sequencing, DNA synthesis, transcriptomics, metabolomics, and natural products in plants, fungi, and microorganisms. If you want to collaborate, let us know. Find out more at jgi.doe.gov/user-programs. Thanks and see you next time!