Compilers Theory

Lexical Analyzer:

Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one
Token is a pair consisting of a token name (boldface) and an optional attribute value.
- The token name is an abstract symbol representing a kind of lexical unit (keyword or identifier…)
Pattern is a description of the form that the lexemes of a token may take
lexeme is a sequence of characters in the source program that matches the pattern for a token
Token Classes:
- Keyword
- Comparison
- Identifiers
- Constants
- Punctuation Symbol
The simplest recovery strategy is “panic mode” recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left
Error Recovery Techniques:
- Delete one character from the remaining input
- Insert a missing character into the remaining input
- Replace a character by another character
- Transpose two adjacent characters
Input Buffering:
- It’s the process of reading source code file
- 2 Pointers to the input maintained:
  - Pointer LexemeBegin: marks the beginning of current lexeme
  - Pointer Forward: scans until a pattern matched is found
- Sentinel Char:
  - Special character that marks the end of source program (i.e. eof)
Alphabet: any finite set of symbols as letters, digits and punctuation
String
Over Alphabet: finite sequence of symbols drawn from that alphabet
Language: any countable set of string over some fixed alphabet
Regular Expressions Precedence:
- *, concatenation, |
Extensions for Regular Expressions: ?, +, Character Classes []

Compiler is a program that can read a program in one language – the source language – and translate it into an equivalent program in another language – the target language
Interpreter directly executes source code operations on input supplied by user
Compiler is faster but interpreter give better error diagnostics
Hybrid Compiler:

Developing EXE File:
Source program -> -> modified source program -> -> Target Assembly Program -> -> Relocatable Machine Code -> -> Target Machine Code

Structure of Compiler:
- Analysis Part (Front End):
  - Break up source program into consistent pieces
  - Then, imposes a grammatical structure on them
  - After that, it uses this structure to create an intermediate representation of the source program
  - Here also we create symbol table
- Synthesis Part (Back End):
  - Constructs the desired target program from the intermediate representation and the information in the symbol table
Compiler Phases:

Lexical Analysis:
- The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes
- For each lexeme, the lexical analyzer produces as output a token of the form <Token Name-Value>
Syntax Analyzer (Parser):
- Creates a tree-like intermediate representation that depicts the grammatical structure of the token stream
Semantic Analyzer:
- Check the source program for semantic consistency with the language definition
  - Example: operator’s type checking
Intermediate Code Generator:
- Generates an explicit low-level or machine-like intermediate representation, which we can think of as a program for an abstract machine
- Properties:
  - Easy to produce
  - Easy to translate into the target machine
Code Optimization:

Compiler Construction Tools:
- Parser Generator: requires grammatical description of the language
- Scanner Generator: requires RE description of the tokens of the language
- Syntax Directed Translation Engine:
  - Produce collections of routines for walking a parse tree and generating intermediate code
- Code Generator Generators:
  - Produce a code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine
- Data-Flow Analysis Engines:
  - Facilitates the gathering of information about how values are transmitted from one part of a program to each other part. Data-flow analysis is a key part of code optimization
- Compiler-Construction Toolkits:
  - Provides an integrated set of routines for constructing various phases of a compiler
Finite-state machines and regular expressions models useful for describing the lexical units of programs (keywords, identifiers, ..) and for describing the algorithms used by the compiler to recognize those units
Context-free grammars, used to describe the syntactic structure of programming languages such as the nesting of parentheses or control constructs
Trees for representing the structure of programs and their translation into object code
Compiler optimizations must meet the following design objectives:
- The optimization must be correct, that is, preserve the meaning of the compiled program
- The optimization must improve the performance
- The compilation time must be kept reasonable
- The engineering effort required must be manageable
Data-flow optimizations, has been developed to analyze the flow of data through the program and removes redundancies across these constructs
Optimizations for Computer Architectures:
- Parallelism
  - Hardware scheduler can change the instruction ordering to increase the parallelism in the program
  - Try to use the microprocessor mechanisms of parallelism (as processors able to work on vectors in parallel way)
- Memory Hierarchies:
  - It has been found that cache-management policies implemented by hardware are not effective in some cases, especially in scientific code that has large data structures
  - It is possible to improve the effectiveness of the memory hierarchy
    - Changing the layout of the data
    - Change the layout of code
    - Changing the order of instructions accessing the data
Program Translations:
- Binary Translation
- Hardware Synthesis
- Database Query Interpreters
- Compiled Simulation

Abdelrahman Elogeel's Blog

Documenting Experience

Introduction to Lexical Analyzer

Introduction to Compilers Theory