Random Chip Errors Can Improve Performance

Mistakes in silicon chips could help computers continue to get more powerful, according to US researchers

What’s so great about perfection? Doesn’t making a little mistake every now and then make us all a bit more human, more tolerable?

Maybe so, but until recently, mistakes were not something that chipmakers would tolerate. Now, an article by Mark Ward at BBC News, “Mistakes in silicon chips to help boost computer power,” says researchers believe it could be a good thing for manufacturers to produce chips that are allowed to make a few errors.

This heretical thinking came about because “as components shrink, chip makers struggle to get more performance out of them while meeting power needs. Research suggests [that] relaxing the rules governing how they work and when they work correctly could mean they use less power but get a performance boost,” although special software would be needed to cope with such error-laden chips.

Smaller chips less reliable

One researcher, Professor Asen Asenov from the Department of Electronics and Electrical Engineering at the University of Glasgow, notes that unreliability “increases as components shrink. Professor Asenov has been using large-scale simulations on grid computers to study how the behaviour of transistors changes as they get smaller.”

Another researcher, Professor Rakesh Kumar at University of Illinois, says that the tiny components in chips are already starting to give rise to errors, so instead of trying to eliminate this, it should be embraced to produce so-called “stochastic processors” that are subject to random errors. “The hardware is already stochastic, so why continue pretending it’s flawless? Why put in more and more money to make it look flawless?”

The article points out that it all depends on how many errors designers are prepared to tolerate. A one percent error rate, for example, can cut power by 23 percent. But there’s good news and bad news: The good is that, in most cases, the errors will have little impact on the performance of the computer into which the chips are built; the bad news is that, “in other cases, they could cause a system to crash.”

Professor Kumar and others are looking into ways to make applications more tolerant of mistakes. This “robustification” of software, as he calls it, “involves re-writing it so an error simply causes the execution of instructions to take longer.” Kumar thinks this is something that programmers and designers should be doing now anyway: “The work on applications and programs may be more immediately useful [since] it can be applied to existing applications. This should make them cope with bugs that are showing up now and prepare them for use with future processors.”