"Due to the large amount of oral contributions some reassignments became unavoidable. Therefore we ask for your kind understanding that your contribution "Challenges of curating approved drugs: will the correct structures please stand up?" (Reference code: 5179-12120) cannot be accepted as a lecture." GDCh congress team
Challenges of curating approved drugs: will the correct structures please stand up?
The molecular structures of approved human medicines represent the crown jewels from approximately half a century of global drug R&D. Paradoxically however, there is neither a consensus “gold standard” set of structural representations, nor even agreement on the total count. The cheminformatic problem was highlighted in a 2009 comparison of database subsets of approved drugs that recorded only 807 exact structures in-common (PMID 20298516). In this work we have explored analogous intersects that can now be generated within PubChem. For example, selecting DrugBank “approved” maps 1533 substance submissions (SIDs) to 1504 compound identifiers (CIDs). Performing CID intersects with other sources expected to capture the same drugs shows a stepwise drop-off. Starting with ChEMBL produces 1358 matches-in-common, adding in the FDA Substance Registration system drops to 1028, Therapeutic Target Database 808 and “INN or USAN” 745, respectively. Thus adding in each source reduces the consensus by ~10% and the final 5-way intersect is only ~50% of what we might expect. We have generated other metrics to dissect out some of the contributing factors. For example, each of the 754 consensus CIDs had, on average, 93 submitters. This popularity for drugs is unsurprising but it inevitably introduces representational noise. We explored this problem via the PubChem “same connectivity” operator. This established that each of the 754 drugs has, on average 59 variants (i.e. different CIDs). We dissected this “structural multiplexing” for the representative case of Taxol/paclitaxel. All five sources had chosen CID 36314 (i.e. it was in the 754) as did 194 other sources. However it is related to 135 “same-connectivity” CIDs from no less than 694 SIDs (although some were split mixtures). Further results will be shown that indicate the major causes of multiplexing. These including alternative salt forms, stereo enumerations and E/Z resolution as well as virtually deuterated drugs from patents. Our results should not be taken as a criticism of valuable sources that curate drugs. However the discordance between those examined above (and others inside or outside PubChem) highlights the challenges of database representation. The problems apply not just to approved medicines but all pharmacologically active chemical structures. We have also observed an increase in drug structure multiplexing in PubChem over the last couple of years that is at least partly due to expanding patent extraction and vendor submissions. This has led to IUPHAR/BPS Guide to PHARMACOLOGY checking CID consensus sets and, where appropriate, adding cross-pointers to alternative structures as part of our drug curation process (PMID 24234439). The continued need for this strategy indicates definitive lists will remain elusive until there is a) more collective engagement for standardisation and b) that pharmaceutical companies move beyond just paying lip service to transparency by provenancing all their clinically tested drug structures in public databases.