Summarizing Compiler Design — Notes and bullet points (part 1)

Nidhi Raniyer
3 min readMar 13, 2021

--

A Compiler is a translator from one language, the input or source language, to another language, the output or target language.

A compiler program written in a high-level language is called source language.

A loader is a program that places programs into memory and prepares them for execution.

Preprocessors are normally fairly simple as in the C language, providing primarily the ability to include files and expand macros.

Assembly code is a mnemonic version of machine code in which names, rather than binary values, are used for machine instructions, and memory addresses.

An assembler needs to assign memory locations to symbols (called identifiers) and use the numeric location address in the target machine language produced.

The conceptually simplest way to avoid conflict among identifiers and their address is to make two passes over the input. During the first pass, each time a new identifier is encountered, an address is assigned and the pair (identifier, address) is stored in a symbol table. During the second pass, whenever an identifier is encountered, its address is looked up in the symbol table and this value is used in the generated machine instruction.

The Structure of a Compiler

1. Lexical Analysis (or Scanning)

2. Syntax Analysis (or Parsing)

3. Semantic Analysis

4. Intermediate code generation

5. Code optimization

6. Code generation

7. Symbol-Table Management

8. The Grouping of Phases into Passes

The first phase of compiler is Lexical Analysis. This is also known as linear analysis in which the stream of characters making up the source program is read from left-to-right and grouped into tokens that are sequences of characters having a collective meaning.

Main Task of a lexical analyser is to take a token sequence from the scanner and verify that it is a syntactically correct program. Secondary Tasks includes process declarations and set up symbol table information accordingly, in preparation for semantic analysis. It also constructs a syntax tree in preparation for intermediate code generation.

A Lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

A language denoted by a regular expression is said to be a regular set

A Sentinel is a special character that cannot be part of the source program. Normally, we use ‘EOF’ as the sentinel. This is used for speeding-up the lexical analyser.

Error-recovery actions in a lexical analyser:

  • Deleting an extraneous character
  • Inserting a missing character
  • Replacing an incorrect character by a correct character
  • Transposing two adjacent characters

Recognizers are machines which accept the strings belonging to certain language. If the valid strings of such language are accepted, it is said that the corresponding language is accepted by that machine, otherwise it is rejected.

Compiler produces a target program whereas an interpreter performs the operations implied by the source program.

Buffer Pair is a specialized buffering technique used to reduce the overhead required to process an input character. Buffer is divided into two N-character halves. Use two pointers. It is used at times when the lexical analyzer needs to look ahead several characters beyond the lexeme for a pattern before a match is announced.

Tokens are sequence of characters that have a collective meaning.

There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a Pattern associated with the token

Lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

Operations on languages:

  • Union
    L U M ={s | s is in L or s is in M}
  • Concatenation
    LM ={st | s is in L and t is in M}
  • Kleene Closure
    L* (zero or more concatenations of L)
  • Positive Closure
    L+ (one or more concatenations of L)

Regular expression for an identifier: letter (letter | digit)*

Hierarchical analysis or Parsing is one in which the tokens are grouped hierarchically into nested collections with collective meaning.

In semantic analysis, certain checks are performed to ensure that components of a program fit together meaningfully. It mainly performs type checking.

--

--

No responses yet