How to render cloud FPGAs useless
FPGA instances are now offered by multiple cloud service providers (including Amazon EC2 F1/F2 instances, Alibaba ECS Instances, and Microsoft Azure NP-Series). The low-level programmability of FPGAs allows implementing new attack vectors including DOS attacks. While some severe attacks (such as short circuits) cannot be easily deployed as users are prevented to load own configuration bitstreams on the cloud FPGAs, it has been demonstrated that it is possible to leak information (like cloud instance scheduling policies or the physical topologies of the FPGA servers) or to mount DoS attacks by excessive power hammering. For instance, basically all cloud FPGAs provide logic cells that can be configured as small shift registers. This allows building toggle-shift-registers with 10K and more flip-flops, which can draw over 1 KW power when clocked at a few hundred MHz. In our work, we created fast ring-oscillators that bypass all design checks applied during bitstream cloud deployment and how we achieved toggle rates of 8 GHz inside an FPGA by using glitch amplification. The latter one was calibrated with the help of a time-to-digital converter (TDC). As a first attack, we used power hammering to crash AWS F1 instances by increasing power consumption to 300 W (three times the allowed power envelope). We used physical unclonable functions (PUFs) to examine the behaviour of the attacked FPGA cloud instances and we found that most remained unavailable for several hours after the attack. As a more subtle attack, we tried to cause permanent damage to FPGAs in our lab by driving fast toggling signals to virtually any available wire (and primitive) into a small region of the chip. With this, we created hotspot designs that draw 130 W in less than 1% of the available logic and routing resources of a datacenter FPGA. Even though the achieved power density was excessive, it was insufficient to induce permanent damages. This is largely due to the area inefficiencies of an FPGA that limit the power density. For instance, FPGAs use large multiplexers to implement the switchable connections and there exists only one active path that is routed through the multiplexers, hence, leaving most of the transistors sitting idle. Similarly, FPGAs provide a large number of configuration memory cells (about 1 Gb on a typical datacenter device) that draw negligible power as these do not switch during operation. All these idle elements force the power drawing circuits to be spread out, hence limiting power density. Anyway, when experimenting with different hotspot variants, we found thermal runaway effects and excessive device aging with up to a 70% increase in delay on some wires. We achieved this aging in just a few days and under normal operational conditions (i.e. by staying within the available power budget and having board cooling running). Such a large increase in latency can be considered to render an FPGA useless as it will usually not be fast enough to host (realistic) user designs. Beyond exploring these attack vectors, we developed countermeasures and design guidelines to prevent such attacks. These include scans of the user designs, use restrictions to resources like IOs and clock trees, as well as runtime monitoring and FPGA health checks. With this, we believe that FPGAs can be operated securely and reliably in a cloud setting.