Background Most bacterial genomes contain integrated bacteriophages—prophages—in various states of decay. Many are active and able to excise from the genome and replicate, while others are cryptic prophages, remnants of their former selves. Over the last two decades, many computational tools have been developed to identify the prophage components of bacterial genomes, and it is a particularly active area for the application of machine learning approaches. However, progress is hindered and comparisons thwarted because there are no manually curated bacterial genomes that can be used to test new prophage prediction algorithms. Methods We present a library of gold-standard bacterial genome annotations that include manually curated prophage annotations, and a computational framework to compare the predictions from different algorithms. We use this suite to compare all extant stand-alone prophage prediction algorithms to identify their strengths and weaknesses. We provide a FAIR dataset for prophage identification, and demonstrate the accuracy, precision, recall, and f 1 score from the analysis of seven different algorithms for the prediction of prophages. Results We identified different strengths and weaknesses between the prophage prediction tools. Several tools exhibit exceptional f 1 scores, while others have better recall at the expense of more false positives. The tools vary greatly in runtime performance with few exhibiting all desirable qualities for large-scale analyses. Conclusions Our library of gold-standard prophage annotations and benchmarking framework provide a valuable resource for exploring strengths and weaknesses of current and future prophage annotation tools. We discuss caveats and concerns in this analysis, how those concerns may be mitigated, and avenues for future improvements. This framework will help developers identify opportunities for improvement and test updates. It will also help users in determining the tools that are best suited for their analysis.