The Sporadic Glitch
A day in the hard life of a firmware developer 😉
It can occasionally happen that a manufacturer does not test their products carefully enough in all variations, leaving it to the customer to find remaining bugs. This is the banana-principle: The product ripens with the consumer. This is, of course, annoying for the customer, since fixing such bugs can be a time-consuming process. It is equally understandable that it is becoming more difficult for the manufacturer to cover all test cases, as the technology is growing increasingly complex. This is especially true for devices which use an internal microprocessor to run more or less bug-free software. Luckily, such devices can often be updated, allowing the manufacturer to improve the device after releasing it. If this feature is used to correct fundamental errors, one might sometimes wonder what the manufacturer even tested before shipping the product.
In order to steer clear of such criticism, manufacturers and their developers like to talk of sporadic glitches which are very rare and only occur under certain circumstances and could therefore hardly be predicted. If such problems can be reproduced at will, one can, however, hardly speak of a sporadic issue. The manufacturer has simply neglected this scenario.
And yet they exist, such sporadic issues with technical devices. The main cause are the tolerances to which any hardware is subject. One possibility of recognizing such issues during development is running the devices in extreme conditions. If a device masters such a test without complaint, it should work reliably in normal conditions. Even here, of course, it is not possible to test everything. One will therefore, for instance, be limited to stress testing the software or testing the device in harsh conditions.
Should sporadic issues still occur with a customer’s device, despite all due diligence, the device is replaced. Such situations do arise and most customers show understanding. Should this happen repeatedly, appropriate measures are certainly required. But the first stage is usually perplexity: How does one examine an error that can not be recreated consistently?
As a developer of the Querx WLAN firmware, I recently had the questionable pleasure of examining such an error. Two customers had reported that the devices showed sensor errors after rebooting. The error disappeared after further reboots. Initially, one attempts to construct expanded test cases. I know from experience that one can usually create conditions in which the error occurs more reliably. But this time I was not successful - neither with our test devices, nor with the returned products. The problem was initially set aside, since no further failures occurred.
As expected, the past always catches up with us. A further customer reported the same problem and the error occurred just as sporadically in our quality control and even with a device which we are running in a long-term test. Well, as it does so often, chance helped me out. A device which showed the same error consistently after every reboot turned up in our quality assurance. A quick examination determined that no manufacturing defaults that could possibly have caused the same symptoms, were responsible. Handling the device with kid gloves, we connected this Querx WLAN to a test adapter, saved the internal conditions and installed a testing software with all due caution. With great relief, I noted that the error still occurred. After less than an hour we had found the solution. Querx delivered correct sensor data after any number of reboots.
But what exactly had caused this sporadic error? It can sometimes happen that an error disappears unexpectedly after trying out various changes to the software. The developer is blissful and a less experienced programmer neglects making further investigations that might mar this state. The older colleague, however, has been caught up by the past too often. I am closer to the latter group, even simply due to my advancing age.
You will probably expect me to lay all the facts on the table and end this blog entry. But as I do not know who will read this text (apart from various search engines), I should try to explain the following in layman’s terms. This will be difficult, but I will give it my best.
Querx WLAN TH measures the ambient temperature with a combined temperature- and humidity-sensor and converts the data into a digital signal. These signals are then transferred to the type STM32F4 microprocessor, or rather its so-called I2C interface, via the visible sensor-cable. This consists of two wires, of which one only sends a constant test signal which steadily switches between 0 and 3 volts. The other lead, the so-called data line, switches the level at the same rate or not, depending on the data that is to be transmitted. A temperature of 0 is transmitted by the data line remaining at 0 volts for 16 cycles.
At this point we can run into a problem. Due to their construction, I2C interfaces are vulnerable to external disturbances electromagnetic fields, for instance, emitted by mobile phones or electric motors. Such fields can cause short impulses in the leads, turning 16 cycles into 17. That is, by the way, the reason why the sensor cable on Querx TH is rather short. The Querx PT sensor cable can be several hundred meters long. However, disturbances are to be expected, even with a short cable, which is why the microprocessor includes a filter that suppresses interference pulses.
I have not lost you yet? Great, then we’re almost done. The microprocessor’s logical circuits consist of an accumulation of switches. A dreaded issue with such circuits’ design is the so-called glitch. Think of the light switches in your flat - in the hallway there are usually several switches for one lamp. If two switches are pressed at the same time, nothing happens. The lamp remains on or off. Just as you will not be able to press both switches at the exact same moment, brief inaccuracies can occur within the microprocessor due to tolerances. If the lamp in your hallway flashes up briefly, this is the equivalent of what is called a glitch in electrical engineering jargon. Such a glitch causes an error in the I2C-interface’s filter, which the STM32F4 microprocessor utilizes. The error thus occurs in the exact component which is implemented in order to prevent errors. The filter then transmits the wrong data to the microprocessor. Due to the complexity of such microprocessor circuits, the prevention of glitches is part of any chip designer’s basic craft. In this case, someone has obviously not done their work entirely correctly. Even worse: The programmer can usually reset such an error state to the default state via a reset signal. This fallback was not provided in our case. There is therefore no method of terminating this error state apart from disconnecting Querx from the power source.
I managed to solve this problem by changing the order in which the devices is initialized, thus preventing the glitch from occurring. The updated firmware version 3.2.15 is now available from our website. It is advised that you update your device’s firmware, even if you have never encountered this problem, since sporadic errors have a further, unpleasant trait: They will often occur at the most inconvenient of times. In this case, this might be a power shortage, after which you will surely have enough to worry about, without being bothered by a failing sensor.