Authors: Mandar Gogate, Abhishek Taur


The increasing complexity and high-efficiency needs of embedded systems call for the need of better compiler design for embedded systems. Conventional compilers don’t meet the basic needs of embedded systems like low power, high-efficiency, high-speed and also they do not support architecture specific instructions like Single Instruction Multiple data(SIMD). Also as the market of available embedded platforms and variety of target architectures is increasing rapidly, there is unprecedented need of development of embedded compilers for different platforms which will give fairly efficient assembly code from C or C++ code. In this paper, we present a comprehensive and fine-grained survey of different compiler design techniques for embedded systems. In particular, we systematically analyze Retargetable and Multiprocessor compilers, which are two emergent approaches to solve embedded compiler design issues.

Keywords—Compilers, Retargetable, Embedded system, Multiprocessor System, Design patterns


Compilers have been used greatly in desktop operating systems and in fact they are the most widely used software on desktop operating systems. While in the case of embedded systems, the use of compilers was not so much encouraged. But with the development of the embedded systems and increasing demands in markets for embedded systems the scenario is different in present times. To write the assembly code manually takes a lot of time and patience. This issue created the need for compilers for embedded systems. It is easier to code in high abstraction level languages like C or C++ which have a large amount of legacy code present and the programmers don’t have to understand the underlying hardware as it was the case for the assembly language. This relieves the burden of the programmer and increases the portability of the code. Embedded systems have a lot of constraints like power, memory, and high-speed requirements. Different embedded systems have different architectures that and this calls for the need of reconfigurable compiler. These different platforms have a different instruction sets which are not accountable by the conventional compilers. Therefore, conventional compilers many times are not suitable for embedded systems as it is explained in [1].

The compiler is a big piece of code which can be divided into three phases which are called as the front end, an intermediate representation (IR) optimizer, and a back end. The front end converts the source level code into an intermediate representation code which is machine independent. The IR optimizer optimizes the IR code by removing deadlocks, redundant code, and variables which are not used etc. The back end is one of the most crucial part of the compiler and it converts the IR code which is machine independent to a machine specific assembly code. The back end of the compiler is of most importance for researchers as it converts the IR independent code to machine specific assembly code. The compilers must be modular so that they can be portable and maintain a level of abstraction. The modularity of the compiler can be used for making them retargetable which in turns helps us to use the same C or C++ code on a different architecture. In this paper, we will be discussing the CCC compiler which is a retargetable compiler discussed in [2]. In a retargetable compiler there are target independent modules and target dependent modules as shown in [3]. We will be discussing how GCC and LLVM which are two of the most popular compilers for C and C++ are made retargetable [3].

During recent times, the multicore processors have made their mark as performing multiple tasks in parallel increases the computational power but the issue with multicore processor is extracting the fine grained parallelism which is tedious as the machine-specific partitioning lacks portability and forward performance compatibility. In [4] the different forms of par- allelism are discussed and how they can be exploited using a compiler. DSP systems are required to do faster algorithmic operations and this requires faster streaming of data and also limited use of memory. In [5] new model for streaming data is introduced and the C++ compiler use is discussed in detail. [6] discusses about different design patterns used to exploit the parallelism for a multicore system.


Embedded systems present unusual challenges to a compiler due to their characteristics like limited memory, domain or application specific instruction set processor (ASIP), real- time and performance sensitive applications. The available compilers fail to satisfy either performance or memory re- quirements; thus, the user must write the assembly code which requires massive programming effort as well as is less portable compared to C or C++. Given the trend of increasing demand for embedded system market and development of high-end processor architectures human programmer can no longer outperform compilers in the terms of code quality. To eliminate the problem of poor code quality, compiler designers need domain-specific optimization methods that go beyond conventional compiler technology.

As shown in Fig.1, the com- piler usually consists of three parts: front-end, an interme- diate representation (IR) optimizer, and back-end. Front end converts source code into IR by grouping particular charac- ter strings into tokens, analyz- ing syntax for errors and book- keeping the identifiers. IR op- timizer iteratively removes any redundancies present as some optimization enables opportuni- ties for others. Back-end in- cludes three stages: code selec- tion stage translates IR into a behaviorally equivalent machine specific assembly program, reg- ister allocation handles the as- signment of variables and in- termediate results to physically available registers, Scheduling arranges instructions according to time slots considering inter- instruction dependencies.

For generation of highly efficient assembly code for ASIPs from source code researchers have proposed three approaches : Dedicated code optimization techniques consists of: Single-instruction multiple- data (SIMD) instructions which helps to improve use of functional units and exploit parallelism, Address generation units which allows parallel execution of address computation with computations in central data path, Code optimization for low power and low energy which considers heat dissipation constraints and ensure efficient use of battery using energy saving. Despite a lot of difficulties in the design of the compiler for embedded systems, there is a good news: unlike traditional compilers, high time complexity can be acceptable if the compiler gives more efficient code. So designers can use more efficient (as well as time-consuming) optimization techniques like the genetic algorithm, integrated linear programming, etc.

As new processors are being manufactured nearly every month, compilers for them should also be developed in parallel. To bolster fast compiler design for new processors, concept of Retargetable compilers have been proposed by researchers.

Retargetable compilers are cross compilers which generate assembly code for individual target machine by taking the model of that machine as input along with the source code. There are three different degrees of retargetability : Developer retargetability in which compiler developer port compiler to a new target by making significant changes, User retargetability in which experienced user can make small changes to source code and Parameter retargetability in which retargeting is done by adjusting parameters such as instruction latencies or register file size. The design of retargetable compilers is promising as well as challenging for embedded system design. One open issue that remains is the trade-off between code quality for C compilers and retargetability. As researchers are exploring new processor specific code generation techniques that continuously decrease the gap between quality offered by C compilers and assembly programming, the approach providing the right balance of flexibility, retargeting factor, and compatibility with existing design tools would be successful.


As it is discussed in [1] different processor uses different architecture which it turn have their own set of instruction sets. A conventional compiler is programmed for a fixed processor and therefore is not the best fit for the embedded systems. Researches came up with the solution of retargetable compilers which are compilers which can be modified to generate code for different target processor with just a few changes in their own source code. As mentioned in [1] there are different degrees of compiler retargetability which are Developer retargetability, user retargetability, and parameter retargetability. Developer retargetability requires an in depth knowledge of the back end of the compiler and is therefore not encouraged and parameter retargetability can help you to re-target the compiler to a narrow set of target processor, so user retargetability is employed for most of the compilers. As mentioned in [1] the GCC compiler is one example where the compiler is modified and is provided the source code as well as the target processor information as inputs.

CCC compiler is discussed in details in [2]. C++ is used for implementation of the CCC compiler and it is modular in na- ture i.e. object oriented. The modularity of the compiler helps to make it retargetable. The inheritance of the base modules is the important aspect of the compiler. The target has its own set of modules which inherits the base modules. So much of the compiler code remains intact and just adding new modules which are target specific makes the CCC compiler retargetable. The CCC compiler consists of different modules as shown in Fig. 2: back-end module, instructions module, resources module, calling-convention module, rewriter rules module, code-emitter module. The back-end module encapsulates the target dependent modules and has all the mentioned analyses which are called during the compilation process. The most important module in the back-end to be used for compilation of the source code is the rewriter module.The Rewriter module converts the High-Level IR (HLIR)into Low-level IR (LLIR). The rewriter module consists of the rewriter rules which are grouped into the rule set based on the different kind of operations that the processor executes like the arithmetic, logical, execution control, bitwise etc. During the conversion from HLIR to assembly language, the pseudo instruction set in the HLIR are compared with patterns of each rule of the rule set. The patterns and the pseudo instruction if match by opcode and type, kind and resources then the HLIR is converted to LLIR. As shown in [2] the retargetability of CCC compiler is tested for MIPS 32 bit and Crystal 32 bit platforms.

The three steps for making a GCC compiler retargetable is discussed in [3]. As described in [3] the first step is to understand the architecture of the target processor. The second step is defining the ABI (Application binary interface). the last step is to define three main files which will define the target processor and operating environment for GCC. As explained GCC uses modified version of Davidson Fraser model for compilation. The Davidson Fraser model based compiler can be retargeted by rewriting the expander and recognizer. GCC uses the peephole optimization and Define split for target specific optimization. This two optimization techniques are fairly opposite. The peephole optimization combines two or more consecutive instructions into one and Define split breaks complex instructions into two or more simple instructions.


Multicore Processor takes advantage of the independent instructions in the source code by executing them on dif- ferent cores at the same time. But finding out parallelism is not an easy task. Here comes the responsibility of the compiler. The compiler is supposed to find parallelism both during non-runtime and runtime which is called as static and dynamic as mentioned in [4]. [4] explains in detail the various kinds of parallelism and how a compiler a compiler identifies the parallelism and redundancies. There are different level of parallelism like instruction level parallelism, data level parallelism, Loop level parallelism, and pipeline level paral- lelism. In instruction level parallelism multiple independent instructions are executed simultaneously on different cores. In data level parallelism same instruction is performed on the different data sets. In Loop Level Parallelism independent iterations of the loop are executed in parallel. In pipeline parallelism, the application or the task is broken down into stages and after one stage is completed the next stage is executed by other core. So different cores execute different stages of different task simultaneously. There are different methods to like pointer analysis or shape analysis that helps the compiler to exploit the resources and reduce the redundancies. The compiler can find out different kinds of parallelisation like DOALL, DOACROSS which are mentioned [4].This kind of parallelisation has to be done by the compiler during the static or non-runtime stage. As it has been found out the correct execution of the program after static compilation is possible for many architectures but it is seen that it is optimised only for a few architecture. This brings in the need for Dynamic compilation. In pipeline parallelisation if the stages are too long then it can lead to stalling and this has to be accounted during the runtime and the compiler then should be able to change the stage length or make suitable changes so as to optimise. Different profiling and monitoring techniques can be used to exploit the parallelism during the runtime or during the dynamic compilation phase.

Parallelism in multicore processor leads to communication between the different cores. The compiler should be able to manage the communication using efficient protocols so as to schedule the use of resources efficiently. In [5] a novel programming model for data streaming is introduced which is known as SPRC which enhances RPC by supporting data streaming between the RPC client and servers. As embedded systems have limited amount of memory streaming of data is a good option. The C++ libraries are divided into layers and instances of these layers are used for importing the required functions of the application. To distinguish the SIMD instructions a dsp prefix is used in the C++ source code so as to directly schedule them with normal instructions. Long SIMD instructions are broken into smaller sets of instructions and executed in parallel. The function offloading can be done using the SPRC++ calls. The idea is to use input and output data streams which are connected to a remote server for data streaming. An array of this stream can be said as there can be multiple client-server pairs. Parallelism uses design patterns and parallelism can occur at any side i.e. at the client or server.

As parallel design patterns are with regularity they are seen as a great opportunity for power optimization in multicore processors. Design patterns were first used in object oriented programming to solve recurring problems. This design patterns decompose the software development into steps like finding the concurrency, algorithm designs, decision of implementation mechanisms and program structure designs. [6] describes how a compiler can use this design patterns in implementation of parallelism in multicore processor. The same design patterns can be used to develop other software application. Pipe and filter is one very popular design pattern used for streaming ap- plication. The program is divided into into set of filters where each filter is nothing but a set of tasks and these filters interact with each other using pipe which can be a shared queue, circular buffer or inter procedural communication (IPC). It is shown that one filter can interact with one or many filters and filters can be cascaded in communication with each other. There is stalling if the receiver and sender filter have different data rates. [6] demonstrates the rate equation which shows how to tackle this situation by making the overall data rates equal. The Map reduce design pattern decomposes the task into two phases, Map and Reduce. In Map phase each processor is assigned a user-defined amp function. In the reduce phase the results of the user mapped functions are collected and are summarised. Different map functions take different amount of time. [6] proposes propose a compiler and runtime system scheme trying to exploit the pattern of early finish threads with power management to reduce energy waste of processors. The power in the low mode for the waiting time plus the switching power is accounted for and if it is less then the power in the running mode then the processor is switched to the low mode of operation. The Puppeteer pattern is used in [6] for adjusting the frequency of the processors given the condition that the throughput is above required. The BSP model abstracts the parallel computations in three different phases. The three major phases are 1) multiple processing units that each of them is capable of performing computation or local memory operations, 2) a communication mechanism that allows interaction between each processing unit, 3) barriers to synchronise each processing unit at a certain time interval. Also the computation phase and communication phase between each synchronous point form a super-step. In [6] the application is decoupled into a sequence of super-steps. The Number of super-steps and the super-step are the inputs to the compiler and based on data flow analysis, the compiler generates gating instructions which turns of the non-functional components in the super-step. The overhead of turning off and on of the non- functional units is taken into account and then a global analysis is done for all the super-steps.


Compiler have attracted significant attention over past few years because of the increasing demand of embedded systems, time to market constraint, efficiency requirements, lack of skilled people to write extensive assembly code and finally due to lack of feasibility of writing assembly code for large applications consisting of variety of processing units. To meet this demand compiler design for embedded system has been one of the leading topic among researchers. In this paper, we have presented a rather extensive survey on Compilers for Embedded systems which consists of the design issues and two main techniques – retargetability and multiprocessor compilation to solve these issues.


[1]  R. Leupers, “Compiler design issues for embedded processors,” Design & Test of Computers, IEEE, vol. 19, no. 4, pp. 51–58, 2002.[2]  I. Povazan, M. Popovic, M. Djukic, and N. Cetic, “A retargetable c compiler for embedded systems,” in Engineering of Computer Based Systems (ECBS-EERC), 2013 3rd Eastern European Regional Conference on the. IEEE, 2013, pp. 48–54.

[3] L. Ghica and N. Tapus, “Optimized retargetable compiler for embedded processors-gcc vs llvm,” in Intelligent Computer Communication and Processing (ICCP), 2015 IEEE International Conference on. IEEE, 2015, pp. 103–108.

[4] M. Mehrara, T. Jablin, D. Upton, D. August, K. Hazelwood, and S. Mahlke, “Multicore compilation strategies and challenges,” Signal Processing Magazine, IEEE, vol. 26, no. 6, pp. 55–63, 2009.

[5] C. B. Kuan, J. J. Li, C. K. Chen, and J. K. Lee, “C++ compiler sup- ports for embedded multicore dsp systems,” in 2011 40th International Conference on Parallel Processing Workshops, Sept 2011, pp. 214–221.

[6] C.-Y. Lin, C.-B. Kuan, W.-L. Shih, and J. K. Lee, “Compilers for low power with design patterns on embedded multicore systems,” Journal of Signal Processing Systems, vol. 80, no. 3, pp. 277–293, 2014. [Online]. Available:

[7] J. A. Fisher, P. Faraboschi, and C. Young, Embedded computing: a VLIW approach to architecture, compilers and tools. Elsevier, 2005.

[8] R. Leupers and P. Marwedel, Retargetable compiler technology for embedded systems: tools and applications. Springer Science & Business Media, 2001.

[9] Y. Srikant and P. Shankar, The compiler design handbook: optimizations