Introduction to Lexical Analyzer

  • Lexical Analyzer:


  • Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one
  • Token is a pair consisting of a token name (boldface) and an optional attribute value.
    • The token name is an abstract symbol representing a kind of lexical unit (keyword or identifier…)
  • Pattern is a description of the form that the lexemes of a token may take
  • lexeme is a sequence of characters in the source program that matches the pattern for a token
  • Token Classes:
    • Keyword
    • Comparison
    • Identifiers
    • Constants
    • Punctuation Symbol
  • The simplest recovery strategy is “panic mode” recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left
  • Error Recovery Techniques:
    • Delete one character from the remaining input
    • Insert a missing character into the remaining input
    • Replace a character by another character
    • Transpose two adjacent characters
  • Input Buffering:
    • It’s the process of reading source code file
    • 2 Pointers to the input maintained:
      • Pointer LexemeBegin: marks the beginning of current lexeme
      • Pointer Forward: scans until a pattern matched is found
    • Sentinel Char:
      • Special character that marks the end of source program (i.e. eof)
  • Alphabet: any finite set of symbols as letters, digits and punctuation
  • String
    Over Alphabet: finite sequence of symbols drawn from that alphabet
  • Language: any countable set of string over some fixed alphabet
  • Regular Expressions Precedence:
    • *, concatenation, |
  • Extensions for Regular Expressions: ?, +, Character Classes []

Introduction to Compilers Theory

  • Compiler is a program that can read a program in one language – the source language – and translate it into an equivalent program in another language – the target language
  • Interpreter directly executes source code operations on input supplied by user
  • Compiler is faster but interpreter give better error diagnostics
  • Hybrid Compiler:


  • Developing EXE File:
  • Source program ->            -> modified source program ->      -> Target Assembly Program ->          -> Relocatable Machine Code ->          -> Target Machine Code
  • Structure of Compiler:
    • Analysis Part (Front End):
      • Break up source program into consistent pieces
      • Then, imposes a grammatical structure on them
      • After that, it uses this structure to create an intermediate representation of the source program
      • Here also we create symbol table
    • Synthesis Part (Back End):
      • Constructs the desired target program from the intermediate representation and the information in the symbol table
  • Compiler Phases:


  • Lexical Analysis:
    • The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes
    • For each lexeme, the lexical analyzer produces as output a token of the form <Token Name-Value>
  • Syntax Analyzer (Parser):
    • Creates a tree-like intermediate representation that depicts the grammatical structure of the token stream
  • Semantic Analyzer:
    • Check the source program for semantic consistency with the language definition
      • Example: operator’s type checking
  • Intermediate Code Generator:
    • Generates an explicit low-level or machine-like intermediate representation, which we can think of as a program for an abstract machine
    • Properties:
      • Easy to produce
      • Easy to translate into the target machine
  • Code Optimization:


  • Compiler Construction Tools:
    • Parser Generator: requires grammatical description of the language
    • Scanner Generator: requires RE description of the tokens of the language
    • Syntax Directed Translation Engine:
      • Produce collections of routines for walking a parse tree and generating intermediate code
    • Code Generator Generators:
      • Produce a code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine
    • Data-Flow Analysis Engines:
      • Facilitates the gathering of information about how values are transmitted from one part of a program to each other part. Data-flow analysis is a key part of code optimization
    • Compiler-Construction Toolkits:
      • Provides an integrated set of routines for constructing various phases of a compiler
  • Finite-state machines and regular expressions models useful for describing the lexical units of programs (keywords, identifiers, ..) and for describing the algorithms used by the compiler to recognize those units
  • Context-free grammars, used to describe the syntactic structure of programming languages such as the nesting of parentheses or control constructs
  • Trees for representing the structure of programs and their translation into object code
  • Compiler optimizations must meet the following design objectives:
    • The optimization must be correct, that is, preserve the meaning of the compiled program
    • The optimization must improve the performance
    • The compilation time must be kept reasonable
    • The engineering effort required must be manageable
  • Data-flow optimizations, has been developed to analyze the flow of data through the program and removes redundancies across these constructs
  • Optimizations for Computer Architectures:
    • Parallelism
      • Hardware scheduler can change the instruction ordering to increase the parallelism in the program
      • Try to use the microprocessor mechanisms of parallelism (as processors able to work on vectors in parallel way)
    • Memory Hierarchies:
      • It has been found that cache-management policies implemented by hardware are not effective in some cases, especially in scientific code that has large data structures
      • It is possible to improve the effectiveness of the memory hierarchy
        • Changing the layout of the data
        • Change the layout of code
        • Changing the order of instructions accessing the data
  • Program Translations:
    • Binary Translation
    • Hardware Synthesis
    • Database Query Interpreters
    • Compiled Simulation