Watermarking molecules

This post was chosen as an Editor's Selection for ResearchBlogging.orgI’ve posted twice about Anonymous hacking into Stratfor — and, more generally, their hacktivism has been making bigger and bigger waves.  CNN recently ran a fairly positive story on the support hacktivists are providing the Occupy movement.  Many of these hacktivists are quite active on Twitter and elsewhere.  However, from the perspective of both international and corporate espionage, the “quiet” hacks are the worst: someone makes off with information and the victim never knows.  As security expert Kevin Mandia told the New York Times:

The hacks that do the most damage don’t have Twitter feeds.

Another security expert, Jeremy Falkenrath, in an interview on Bloomberg News (at about 7:00 into the video), discussed, quite matter-of-factly, the hacker-for-hire market that companies in the chemical industry deploy against one another to learn trade secrets.  With this as the backdrop, I’d like to discuss one of the main open questions of cheminformatics: Is secure encryption of molecules possible?  For example, it would be nice if a company could encrypt a molecule, but then allow some third party to run in silico tests with it, having access to the molecule’s properties but not the structure itself.

Encryption of molecules

Part of the reason for the traditional closed-data policies of pharmaceutical companies is the total absence of any way to encrypt chemical structural information.  This has been recognized as an open problem for many years, the American Chemical Society held a special meeting in 2005 about it, a summary of which appeared in Nature.  While there were presenters at that meeting who felt molecular encryption was possible, and others who felt it was impossible, the practical reality as we enter 2012 is that, so far, the voices in favor of “impossible” have been correct.  Almost no new theoretical literature has been produced since 2005, and the industry appears no nearer a practical solution than it was in, say, 1975.

I recently had an idea to expand upon a proposal by Eggers et al. in 2001, to watermark in silico representations of molecules.  My idea, however, is going nowhere — just like all other attempts so far to implement chemical watermarking.  At least I can get a blog entry out of my failure though!  I hope readers of this page find my little attempt entertaining or informative.

Acknowledgement: The material in this post is based on conversations I have had with cheminformaticians Rajarshi Guha and Jörg Kurt Wegner.

Watermarking molecules

A digital watermark, of course, is a special code embedded into a file so that its provenance can be determined.  The file has to be large enough so that perturbations to individual data points are not feasibly discoverable to a would-be forger, even though there are enough data points to produce a unique fingerprint for that file, at least with high probability.  In 2001, Eggers et al. published Digital Watermarking of Chemical Structure Sets, which proposed the use of SCS watermarking on databases of molecules, so that the company that had created or owned the molecules would be able to prove that another company’s claim to independent discovery was, in fact, infringement.

A single molecule is not large enough to embed an indiscernible watermark, which is why Eggers et al. considered “structure sets” — a large list of molecules that are somehow related.  Technically, each molecule is considered as a graph embedded into three-dimensional space.  (In fact, this graph may be the “reduced graph,” or a graph of the molecule with all hydrogen atoms removed, because reduced graphs performed better on one empirical test in their study.)  I will ignore a lot of technical details here, but the core idea behind their watermarking process is to perturb the 3D coordinate of specially-chosen atoms by a small, “random-appearing” amount.  If enough locations of atoms in the structure set are modified this way, it produces a digital watermark.

More interesting to me than the watermarking method was the list of potential attacks Eggers et al. considered.  If the attacker can do anything, it is famously difficult to encrypt data while still providing functionality.  However, we are working in the chemistry domain.  On the one hand, this adds technical obstacles (due to “real-world messiness”), but, on the other hand, it may limit the attacks that can be performed on the system, since the attackers need to obtain a structure that is chemically useful, not just some fragment of plaintext.  Eggers et al. tried to provide a watermarking method resilient to the following attacks.

  • Removal of data from the original dataset.
  • Injection of non-watermarked structures into the dataset.
  • Reordering of individual records in the dataset.
  • Reordering of atoms and bonds in the structure records.
  • Global 3D transforms, e.g., rotations or translations.
  • Changes of structural notation conventions.
  • Removal of hydrogen atoms from structures.

My new idea and why it fails

I will cut to the chase and explain why my idea doesn’t work — and why, perhaps, no watermarking concept is feasible in the chemical domain.  Then I will conclude, anticlimactically, by describing my idea, in case someone can get something out of it.

Consider the domain of music piracy.  The attacker can hear the music, and perhaps even has access to guitar tabs or a symphony score.  However, it would not be cost-benefit for the attacker to hire a brand-new orchestra to re-record the original piece of music — and, even if the attacker did that, the new track would not sound exactly the same.  By contrast, in chemistry, every carbon atom is identical to every other carbon atom.  So in the case of a chemikcal structure set, the attacker could choose a molecule(s) of interest, and then just rebuild the structure from scratch, and use the 3D coordinates (and all other information) from the brand new molecule.  Presto, no watermark.

Eggers et al. consider this attack, sort of.  They say:

We assume that no unlicensed copies of the software used to generate the original protected dataset are provided in circulation.  Furthermore, the computation time for large datasets in often significant.  Depending on the type of algorithm used and the size of the dataset, it can be up to several CPU months.  Thus, simple regeneration of the data is often not a feasible approach.

While the claim about significant recomputation time may have been true in 2001, it no longer appears to be.  As one chemist said to me, recomputation of a molecular structure “wouldn’t be hard at all.”  This leaves the only defense of a watermark the unavailability of licensed software to perform the recomputation.  That, however, relies on an attacker who is willing to perform industrial espionage but who draws the line at software piracy.  So, no go.

Now we get to my idea.  In the field of image processing there has been a lot of work since 2001 on 3D watermarking.  One survey is 3D Watermarking: Techniques and Directions, by Koz et al., which considers strengths and weaknesses of many different watermarking techniques.  One method that caught my eye was Robust Image Watermarking Based on Generalized Radon Transformations, by Simitopolous et al., because it is specifically designed to be resistant to “geometric attacks” like rotation, scaling, translaton and cropping, which closely follow the list of anticipated attacks in the Eggers et al. paper.  Applying this to chemistry might provide a much better watermarking tool than the more generic watermarking method chosen by Eggers et al.

However, I’ve decided not to consider this further, because it seems that the most I could get out of it would be a paper that would be theoretically “cute,” with no chance of being practical, ever.  I would love to be proven wrong, so if you can think of a way to get around the “rebuild the molecule from scratch” attack, please do comment.  In the meantime, I am on to other things.

ResearchBlogging.org Joachim J. Eggers, W.D. Ihlenfeldt, & Bernd Girod (2001). Digital Watermarking of Chemical Structure Sets Information Hiding, 200-214 DOI: 10.1007/3-540-45496-9_15

About these ads

4 responses to “Watermarking molecules

  1. That is a good point about current computer power and rebuilding the molecule.

    I am by no means an expert in cheminformatics or familiar with the industry dynamics. But from what you described here, doesn’t it seem like a method of encryption is much more applicable than a simple watermark on a standardized data format?

    I suppose my question here is what is the advantage of watermarking over an encrypted file and an encryption key? When working with the file you would decrypt it real-time with the key. Time for key-decryption is negligible while trying to crack it would take astronomically more power. Software piracy would not matter in the latter situation either.

    • Thanks for the question. One goal would be for a company to share chemical information with a partner or subcontractor, without leaving itself open to vulnerability. So encryption where you can’t see any molecular feature isn’t what we want. Rather, we want a “molecule in a locked box” somehow, so other entities can use the molecule for certain tests or whatever, but their privileges are limited.

  2. Taking a brief look at the paper and your description I get the feeling any such scheme would be susceptible to the following attack:

    Take the given database of structures and run a standard energy minimization algorithm starting with the structures in the database. Unlike attempts to solve for molecular structure by brute force this can be done fairly cheaply since we only look for the nearest local minimum and don’t bother trying to tunnel across higher energy solutions to find a lower absolute minimum.

    Since the molecular structure database, in order to provide correct results to it’s users, must provide structures very similar to those taken by the molecules in the real world and such real world structures always occupy a local energy minimum the results of our computations surely compute the structures at least as well as the original dataset.

    Moreover, since local energy minimization algorithms should be independent of initial state as long as that state is close to the computed energy minimum our output data would necessarily be indistinguishable from a run of the same program starting on any other nearly correct database of molecular structures.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s