# Monthly Archives: February 2011

## Treewidth of molecular graphs

In 2003, Yamaguchi et al. tested the treewidth of 9712 molecules, and found that 19.4% had treewidth one, 75.7% had treewidth two, 4.9% had treewidth three, and only one molecule had treewidth four.  They also tested the local tree-width; you can see those results in Table 2 of their paper here.  Their conclusion: “tree-width of compounds is not only a useful complexity measure, but it is also an appropriate factor to consider in characterizing chemical compounds.”

Unfortunately, it is not clear whether that conclusion is correct.  Rajarshi Guha recently combined an existing treewidth package with the Chemistry Development Kit (CDK), an open-source chemistry suite.  He tested 10,000 molecules at random, and compared treewidth to other molecular descriptors.  His findings were consistent with the findings from 2003: almost all molecules had treewidth 1, 2 or 3; a small fraction had treewidth 4.  However, treewidth appears “degenerate” from a chemistry point of view, as it does not correlate with molecular size or other features of interest.  His conclusion: “I don’t think treewidth is a useful descriptor for modeling purposes.”  Chemist Joerg Kurt Wegner has followed up on  a Friendfeed discussion, suggesting other ways treewidth might be used.  For now, though, it appears to be a tool in search of a job.

Much more explanation, graphs and links can be found in Rajarshi’s blog entry on this subject; the background to his blog entry is in FriendFeed conversations here and here.

Atsuko Yamaguchi, Kiyoko F. Aoki, & Hiroshi Mamitsuka (2003). Graph Complexity of Chemical Compounds
in Biological Pathways Genome Informatics, 14, 376-377

## Determining the minimum polynomial that bounds a PTIME Turing Machine is undecidable

This blog post is motivated by a question John Sidles recently asked on the Theoretical Computer Science Q&A site.  Emanuele Viola answered it (providing a very clever proof), and Raphael (I don’t know his last name) rephrased that answer, hoping the proof would be a bit clearer.  I think things might be made clearer still, so this is my attempt to express Emanuele’s fine idea in my own words.

Suppose we know for sure that a Turing machine runs in polynomial time.  The question is: can we say more than that?  Is there some algorithm to determine the minimum $k$ so that the Turing machine runs in $\mathcal{O}(n^k)$?  Unfortunately, the answer is NO.

Proposition: Let $M \in \mathsf{PTIME}$ be a Turing machine.  There is no algorithm that determines the minimum $k$ such that $M \in \mathcal{O}(n^k)$.

Proof: We will show that if we can solve this problem, we can solve the Halting Problem.  Let $N$ be an arbitrary Turing machine, and $x$ an arbitrary natural number (intuitively, $x$ is an input to $N$).  Now define the Turing machine $M$ as follows:

1. Let $n$ be the length (i.e., number of bits) of the input to $M$.
2. Simulate $N$ running with input $x$ for $n$ steps.
3. If $N$ accepts during step 2, loop for $n^2$ steps and then halt.
4. Otherwise, loop for $n^3$ steps and then halt.

Note that the value of $x$ is independent of the choice of input to $M$.  So $M \in \mathcal{O}(n^2)$ iff $N$ halts on $x$, and $M \in \mathcal{O}(n^3)$ iff $N$ does not halt on $x$.  We know that $M$ is in $\mathsf{PTIME}$.  (In fact, we know something stronger: it is either in $\mathcal{O}(n^2)$ or $\mathcal{O}(n^3)$.)  If we could determine the minimum exponent $k$ (for simplicity, either 2 or 3) such that $n^k$ upper-bounds the running time of $M$, we would be able to decide whether an arbitrary Turing machine accepts an arbitrary input — the Halting Problem.  So no algorithm can exist that will provide us the value of $k$. $\Box$

I really like this argument.  It’s almost poetic: nothing unnecessary included.

Presentation slide proposing methods to discredit and destroy Wikileaks. Obtained from leaked HBGary emails. Source: ArsTechnica

Looks pretty bad, doesn’t it?  Well, it’s worse.  Not only did the security firm HBGary prepare a package of dirty tricks against Wikileaks, hoping to get paid by Bank of America’s law firm to put them into action, but they also constructed a similar package to use against labor unions, hoping to drum up business from the US Chamber of Commerce.  As I write this, there is no confirmation that either B of A or the US C of C actually paid for these services to be rendered, but the authenticity of the leaked emails does not appear to be in doubt.  The CEO of Palantir, whose company logo appears on the slide I linked, has apologized at least twice, severed all ties with HBGary, and placed on leave the engineer who developed this slide.

The best coverage I have seen of this sordid affair is at Ars Technica.  Many commenters at Ars have stated that Nate Anderson should win a Pulitzer Prize for his coverage of the hack that led to the leaked emails, and the ongoing aftermath.  That’s not hyperbole: this article by Anderson is the most riveting tech news story I have read in years, maybe ever.

A couple months ago, Richard Lipton proposed a method to stop Wikileaks.  Essentially, the method boiled down to this: for every potentially compromising document generated, automatically generate a set of documents that look like it, but are different somehow — statements are contradicted but otherwise identical, numerical values are inflated or deflated, etc.  Then, if the “real” documents are leaked, ensure all the shadow documents are leaked as well, so nobody knows what to believe.  From the Palantir slide’s first bullet point, it appears that practice is keeping abreast of theory, or perhaps leaping ahead: “Create messages around actions to sabotage or discredit the opposing organization.  Submit fake documents and then call out the error.”

More than 70,000 leaked emails from HBGary, HBGary Federal and rootkit.com are available for download and search on at least five mirror sites worldwide.  They got there because Aaron Barr, CEO of HBGary Federal, went to the media with the (incorrect) claim that he had uncovered the identities of key members of the hacker group Anonymous.  In response, Anonymous entered his computers, erased gigabytes of research data, downloaded and decrypted his hashed password database, remotely wiped his iPad, seized control of his LinkedIn profile and Twitter account — and, oh yes, posted 70,000+ emails that tell a story of three companies that specialized in dirty cybertricks, which is why the fallout from this story will be studied for months, or longer.

I took a graduate class in cryptography.  I even did well.  I had heard of salting passwords and dictionary attacks before this, but I didn’t really understand them.  I had an intellectual grasp of them yes, but I’m talking now about the type of understanding that grabs your solar plexus, squeezes and won’t let go until you’ve really really got it.  I believe this Ars Technica article by Peter Bright should be required reading in every cryptography class, and in every CS class when computer security is discussed.  Normally I would also link to Wikipedia articles on “salting passwords,” for example, but not this time, I won’t.  Bright does a superb job of making you feeeeeel how important it is to defend yourself against a dictionary attack, and I don’t want anything to get in the way of that.  Bottom line: security professionals protected themselves like amateurs, and found their defenses easily compromised once their CEO went out of his way to provoke a hacker collective known for its willingness to attack.

It would not surprise me if Lipton’s idea gained traction, very soon.  If nothing else, it would make searching through the email database far more difficult, because naive search algorithms would generate lots of false positives.  It might be worth turning the question around, to ask something I don’t know how to answer: Is there an algorithmic method to separate real documents from shadow documents, assuming they are uploaded together in the same torrent?

## Computing with billions of cores

Supercomputer performance and projection. Source: top500.org

On November 11, 2010, the Chinese supercomputer Tianhe-1A earned the title of “fastest computer in the world,” when it was clocked performing operations at the rate of 2.57 petaflops.  Two of the top five computers are in China; two are in the USA; one is in Japan.  Two weeks ago, IBM announced plans to build a machine that will surpass them all: ten petaflops, with 750,000 cores.  IBM projects that by 2020, they will be able to deliver machines with millions of cores.

FLOPS stands for “floating point operations per second,” so one gigaflop is one billion floating point operations per second.  A teraflop is is one trillion flops, and a petaflop is one quadrillion ($10^{15}$) flops.  A floating point operation is an operation like addition or division of two numbers with fractional components.  The processors of PCs are usually measured differently: MIPS (millions of integer operations per second), or cycles per second, e.g. gigahertz.  The relation between a processor’s cycle speed and its flops can be complex, because it depends on pipelining of data and other factors, so some people consider flops to be a better measure of how fast a computer really is when it is trying to solve problems.

The graphical processing units (GPUs) of desktop PCs are pretty frikkin fast, too.  The AMD Firestream 9270 claims to perform over one teraflop “in raw single precision performance.”  So a high-end home gaming system in 2011 can run about as fast as the world’s fastest computer in 1998.  Perhaps more significantly, my Droid Incredible smartphone houses a Qualcomm Snapdragon processor that runs at 1 Ghz, and Qualcomm plans to release a 2.5 Ghz quad-core chip for use in phones by 2012.  This is critical to our current discussion, because there are now more than five billion cell phones in the world — to say nothing of other mobile devices, like RFID chips — and projects distributed across many computers already run faster than Tianhe-1A.

As a researcher in biomolecular computation, I think about things like, “Suppose we had five hundred million cores, but they could only talk to one another slowly, via diffusion.  What problems might we solve efficiently that we cannot solve now?”  (Erik Winfree in this talk at ASPLOS 2008 made the point that a submarine has $10^6$ parts and $10^{10}$ transistors, while a whale has $10^{17}$ cells and $10^{27}$ proteins.)  Only recently, though, I have come to realize that my question doesn’t just apply at a cellular scale: it applies to the very near future of macro-scale computing as well.

Experts in pervasive computing have understood this for years, I am sure, and I imagine I am unfashionably late to the party.  Still, I wonder how many computer scientists — students, teachers, or practitioners — understand this on a gut level.  According to Wikipedia, the distributed computing project Folding @ Home sustained 6.2 petaflops in April 2010, several months before the highest-performance computers in the world, um, couldn’t even break past three petaflops 😉 (A participant in Folding @ Home loads software onto his or her computer(s) that runs a program in the background (e.g., when the computer is idle) that determines a protein’s structure when it folds up.  There is no known fast algorithm for this — indeed, folding is an NP-complete problem — but the problem parallelizes very well.  Hand each user a different molecule, and everyone is good to go.)

In stark contrast to, say, the Turing machine, which was conceptualized many years before the first modern computer was actually built, current computer science theory appears to be struggling to articulate an understanding of this new reality, even as the high-performance engineers and managers of distributed projects are charging ahead.  In 2009, the Workshop on Theory and Many-Cores went so far as to say:

This low level of activity should be a concern to the theory community, for it is not clear, for example, what validity the theory of algorithms will have if the main model of computation supported by the vendors is allowed to evolve away from any studied by the theory.

This problem seems more like an institutional one than a mathematical one to me.  Current university faculty worldwide were educated with a one-processor world view, and it is not as though all interesting problems there have been solved.  So why not work on interesting problems that are already accessible, instead of trying to think a brand new way and risk that the new problems might be less interesting, or still inaccessible?  Still, the sheer computing power available within a few years by, for example, running an app in the background on every cell phone in China, is a prospect I find both intoxicating and otherworldly.  Whoever answers the question I cannot will literally predict the future.