Table of Contents
Yasm, like other assemblers and compilers, is at its heart just a data processing application. It transforms data from one form (ASCII source code) to another (binary object code). Thus, the data structures used to keep track of the internal state of the assembler are the most important things for a coder working on the assembler to understand. This chapter attempts to present reasoned explanations for the many decisions made while designing the most important data structures in the yasm assembler.
The use of “bytecodes” as the basic building block of the assembler was a fundamental requirement of both the goals (see Chapter 1) and the architecture (see Chapter 2) of yasm. A bytecode is essentially nothing more than a single machine instruction or assembler pseudo-instruction stored in an expanded format that keeps track of all the internal state information needed by the assembler to:
Most, if not all, other assemblers accomplish the above goals by re-parsing the source code in multiple passes. As yasm only parses the code once, bytecodes are needed to store all the information for every parsed instruction and pseudo-instruction. This fundamental difference is a trade-off choice between processing time and required memory space. The bytecode method requires that the entire source file’s content must be stored in memory at one time (its content in terms of assembler state, not the actual ASCII source). To minimize the memory space that must be used, the yasm implementation tries to make the bytecode size as small as possible.
To satisfy the above requirements, a good deal of data must be kept in the bytecode
data structure. The main bytecode data structure contains
such information as:
Additional data describing the actual contents of the bytecode is associated with the above data. There are two broad categories of this data: assembler pseudo-instructions, which are available on all architectures, and architecture-specific instructions. Data values are treated as pseudo-instructions.