Posts by Collection

portfolio

Keyword Spotting

This demonstration shows a deep neural network running on an Arduino Nano 33 BLE Sense microcontroller performing keyword spotting. The DNN can distinguish 36 different spoken keywords (Pete Warden’s Speech Commands dataset), infers at about 800ms per 1 second audio sample, and requires 190KB of ROM and 111KB of RAM.

On-Device DNN Training

Demo of on-device DNN training of an animal classification dataset with 10 classes on a STM32H743VI ARM Cortex-M7 microcontroller (OpenMV H7 R2).

publications

Energy-efficient Deployment of Deep Learning Applications on Cortex-M based Microcontrollers using Deep Compression

Published in MBMV 2023; 26th Workshop, 2023

Large Deep Neural Networks (DNNs) are the backbone of today’s artificial intelligence due to their ability to make accurate predictions when being trained on huge datasets. With advancing technologies, such as the Internet of Things, interpreting large quantities of data generated by sensors is becoming an increasingly important task. However, in many applications not only the predictive performance but also the energy consumption of deep learning models is of major interest. This paper investigates the efficient deployment of deep learning models on resource-constrained microcontroller architectures via network compression. We present a methodology for the systematic exploration of different DNN pruning, quantization, and deployment strategies, targeting different ARM Cortex-M based low-power systems. The exploration allows to analyze trade-offs between key metrics such as accuracy, memory consumption, execution time, and power consumption. We discuss experimental results on three different DNN architectures and show that we can compress them to below 10% of their original parameter count before their predictive quality decreases. This also allows us to deploy and evaluate them on Cortex-M based microcontrollers.

Recommended citation: Deutel, M., Woller, P., Mutschler, C., & Teich, J. (2023, March). Energy-efficient Deployment of Deep Learning Applications on Cortex-M based Microcontrollers using Deep Compression. In MBMV 2023; 26th Workshop (pp. 1-12). VDE.
Download Paper | Download Paper (open) | Download Slides

Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML

Published in arXiv, 2023

Deploying Deep Neural Networks (DNNs) on microcontrollers (TinyML) is a common trend to process the increasing amount of sensor data generated at the edge, but in practice, resource and latency constraints make it difficult to find optimal DNN candidates. Neural Architecture Search (NAS) is an excellent approach to automate this search and can easily be combined with DNN compression techniques commonly used in TinyML. However, many NAS techniques are not only computationally expensive, especially hyperparameter optimization (HPO), but also often focus on optimizing only a single objective, e.g., maximizing accuracy, without considering additional objectives such as memory consumption or computational complexity of a DNN, which are key to making deployment at the edge feasible. In this paper, we propose a novel NAS strategy for TinyML based on Multi-Objective Bayesian optimization (MOBOpt) and an ensemble of competing parametric policies trained using Augmented Random Search (ARS) Reinforcement Learning (RL) agents. Our methodology aims at efficiently finding tradeoffs between a DNN’s predictive accuracy, memory consumption on a given target system, and computational complexity. Our experiments show that we outperform existing MOBOpt approaches consistently on different data sets and architectures such as ResNet-18 and MobileNetV3.

Recommended citation: Deutel, M., Kontes, G., Mutschler, C., & Teich, J. (2023). Combining Multi-Objective Bayesian Optimization with Reinforcement Learning for TinyML. arXiv preprint arXiv:2305.14109.
Download Paper

microYOLO: Towards Single-Shot Object Detection on Microcontrollers

Published in ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning, 2023

This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.

Recommended citation: Deutel, M., Mutschler, C., & Teich, J. (2023). microYOLO: Towards Single-Shot Object Detection on Microcontrollers. In ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning.
Download Paper | Download Slides

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

Published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training DNNs for different sized Cortex-M MCUs (Cortex-M0+, Cortex-M4, and Cortex-M7). We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware. The results show that compared to related work, our approach requires 34.8% less memory and has a 49.0% lower latency per training sample, with dynamic partial gradient updates allowing a speedup of up to 8.7 compared to fully updating all weights.

Recommended citation: Deutel, M., Hannig, F., Mutschler, C., & Teich, J. (2024). On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
Download Paper | Download Paper (open)

Energy-Efficient AI on the Edge

Published in Unlocking Artificial Intelligence pp 359 - 380, Springer Link, 2024

This chapter shows methods for the resource-optimized design of AI functionality for edge devices powered by microprocessors or microcontrollers. The goal is to identify Pareto-optimal solutions that satisfy both resource restrictions (energy and memory) and AI performance. To accelerate the design of energyefficient classical machine learning pipelines, an AutoML tool based on evolutionary algorithms is presented, which uses an energy prediction model from assembly instructions (prediction accuracy 3.1%) to integrate the energy demand into a multiobjective optimization approach. For the deployment of deep neural network-based AI models, deep compression methods are exploited in an efficient design space exploration technique based on reinforcement learning. The resulting DNNs can be executed with a self-developed runtime for embedded devices (dnnruntime), which is benchmarked using the MLPerf Tiny benchmark. The developed methods shall enable the fast development of AI functions for the edge by providing AutoML-like solutions for classical as well as for deep learning. The developed workflows shall narrow the gap between data scientist and hardware engineers to realize working applications. By iteratively applying the presented methods during the development process, edge AI systems could be realized with minimized project risks.

Recommended citation: Witt, N., Deutel, M., Schubert, J., Sobel, C., Woller, P. (2024). Energy-Efficient AI on the Edge. In: Mutschler, C., Münzenmayer, C., Uhlmann, N., Martin, A. (eds) Unlocking Artificial Intelligence. Springer, Cham.
Download Paper

Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers

Published in Workshop on Machine Learning and Compression @ NeurIPS 2024, 2024

Convolutional Neural Networks (CNNs) have been established as the dominant approach to computer vision tasks. As a result, efficient inference of CNNs has become a major concern to enable the processing of image data close to where it is generated by camera sensors, most commonly microcontroller units (MCUs). However, major obstacles to deploying CNNs on MCUs are the strict memory and bandwidth constraints that make processing high-resolution images on many MCUs infeasible. In this work, we propose a method to fuse convolutional layers in quantized CNNs, which can serve as an additional dimension for optimizing the memory requirements of CNNs during inference. By fusing memory-intensive convolutions in the early inverted residual blocks of MobileNetv2-like CNNs, we show that memory requirements during inference can be reduced by up to 54% at the cost of only about a 14% increase in latency and no change in accuracy. As an example, we show that this reduction enables the deployment of image processing pipelines on a Cortex-M7 MCU that supports image resolutions up to 320x320 pixels compared to the 128x128 pixels resolution commonly used in related work.

Recommended citation: Deutel, M., Hannig F., Mutschler, C., & Teich, J. (2024). Fused-Layer CNNs for Memory-Efficient Inference on Microcontrollers. In Workshop on Machine Learning and Compression @ NeurIPS 2024.
Download Paper

talks

teaching

Multi-Core Architectures and Programming

Seminar, Friedrich-Alexander-Universität Erlangen-Nürnberg, Department Hardware-Software-Co-Design

Processors with multiple cores are already very widespread today. Examples of such architectures include modern graphics processors, which can consist of up to 4608 so-called streaming processors and 576 tensor computing units. Multi-core processors have a very high theoretical computing power and therefore open up fascinating new possibilities in scientific and other computationally intensive areas, such as multimedia applications, medical technology or finance. However, in order to fully exploit this performance, an efficient mapping of algorithms to the architecture of the respective multi-core processor must be found. Compared to traditional single-core processors, this often requires a radical rethink in programming. The aim of the seminar is to provide insights into state-of-the-art multi-core architectures, e.g. AI accelerators, and their programming paradigms. To gain practical development experience, NVIDIA TITAN RTX, Intel Neural Compute Sticks and Tegra AGX systems are offered, among others. The latest software development tools (TensorRT, OpenVINO, C++ 20, SYCL, CUDA, OpenCL, OpenMP + MPI) are available for project work in the team.