Figuring out how to write a Lexer

Intro

Below is what I typed up while figuring out how to design my lexer. There was a lot more work that went into it than what's capture here. Namely, a very lenghty process of considering an input string & exactly what should happen on every single loop, up to 19 loops (1 character from the input string added on each loop). This whole process was exhausting, exciting, and super challenging.

I didn't really neeeed to make a lexer, because there is existing software that generates ASTs from source code... But I did anyway. I started with a very very simple version (long before this doc was written), thinking... I'll abandon it if it takes more than a couple hours to build... But then I've gone over several iterations as this project has become interesting to me.

Now, I have a functional lexer, and I am working on writing Grammars. PhpGrammar will come first. Currently, I have a partial Json Grammar I used when developing the newest version of the lexer, but it misses several things that JSON allows, because I don't care to develop that grammar further. I'm sure I'll need to add some new features while I'm writing the PHPGrammar, too.

Anyway, nothing below is edited from when I originally wrote it. Hope its interesting. Thanks for reading!

There's still more work before its prime time, but you can checkout my lexer on Gitlab at https://gitlab.com/taeluf/php/lexer

My Process

My goal for a lexer is to proccess a string and produce an AST filled with information about the string. This can be done with massive regex-matching on the full string or matching on chunks of the string, one at a time. The chunks of the string could reasonably be line-by-line. Or each chunk can be a single character. Or a buffer can be maintained and have one character (or line) added at a time, then the buffer is checked.

When a chunk (buffer or character or line) is matched by a directive, additional processing is performed. This processing can be done by the lexer by reading the directive. It can be done by a method the directive is mapped to. Or a combination can be used where the lexer does some work automatically based upon declarations AND a custom method can do additional, more complex work than what is avaialable declaritvely.

Now, how directives can be defined and processed has many options. A single string or regex match could invoke processing of the current chunk, allowing the chunk to be analyzed and put into ASTs as needed.

Whether a directive should be considered on the current chunk can also be determined by many approaches. Is the lexer in the state required by the directive? Does a previous directive require this one to be considered? Does the previous sequence allow this directive? Does the following sequence of characters indicate an end to this directive?

And I've started thinking about sequences. A sequence of characters is a string that can be individually evaluated for AST purposes. For example /** comment */namespace \Tlf\Lexer; contains 5 sequences. #1 is the docblock-style comment. #2 is namespace. #3 is . #4 is \Tlf\Lexer. #5 is ;

The docblock sequence may define the following sequence or a sequence before it. For example in <?php\n/** docblock**/\nnamespace Cats;, the docblock describes the Code Block. Add an additional \n after <?php and now it describes namespace Cats;
namespace indicates the next thing we allow & expect. We can allow whitespace, comments, and docblocks. We expect a namespace name in the regex /^[a-zA-Z_\\]+$/ but we can't evaulate this until a semicolon is reached or we'll only have a partial namespace name.
- This is further complicated by namespace \Tlf/**bogus*/\/*comment. Valid segments of the namespace name are ([a-zA-Z0-9_]+|\\). These valid segments need all to be joined while invalid but allowed segments are removed. Invalid allowed statements are docblocks, comments, and whitespace
- So when I encounter namespace I begin seeking the namespace name. The namespace name is built from a series of valid segments which can be interrupted by certain other statements.
- So more accurately, ^[a-zA-Z0-9_]+$ is a valid segment and must be followed by a divider, a comment, whitespace, or a terminator (semi-colon). Anything allowed by namespace is allowed after this segment, except for another one of these segments.

Let's start again & break down the sequences for /**comment*/namespace\ Tlf/*interrupt*/\Lexer /*okay*/ ; NOTE: The \ BEFORE Tlf is not actually allowed for ns declaration & has a different functionality & is used for calling within the current namespace..

Terms

separator: docblock, whitespace, or a comment
terminator: a semi-colon
a-z regex: ^[0-9_a-zA-Z]+$ <- this can ONLY be evaluated when the next breaks the regex

/**comment*/ docblock
- Expect whatever was already expected by the previous scope/directive. (i.e. whatever is expected after <?php)
namespace begin looking for namespace name.
- Expect \ or a separator
  - A separator is a docblock, whitespace, or a comment
  - ^[0-9_a-zA-Z]+$ will be expected after a separator....
\ Begin namespace name & append this char to it
- Expect a separator or the a-z regex
whitespace / separator
- Expect a separator or the a-z regex
Tlf Append to namespace name
- Expect a separator, terminator (;), or \
/*interrupt*/ docblock
- Expect a separator, terminator, or \
\ - append to namespace name
- Expect a separator or a-z regex
Lexer - append to namespace name
- Expect a separator, terminator, or \
whitespace
- Same expectations as #7
/*okay*/ docblock
- Same expectations as #8
whitespace
- Same expectations as #9
; terminator
- Store the namespace name
- pop state / directive (we are no longer looking for a namespace name, so the last set of expectations need be removed)

What sets of expectations are there above?

Expect \ or a separator
Expect a separator or a-z regex
Expect a separator, a terminator, or a \

Every individual thing I can expect is:

\
separator (whitespace, comment, docblock)
a-z regex
;

What are my expectations BEFORE namespace? After ;?

Expectations before come from the <?php directive.
After also comes from the <?php directive.

How else can namespace be limited?

namespace not allowed if there has been another statement before it

An a-z regex also CANNOT be a keyword like class, abstract, or others

namespace begin looking for namespace name.
- Expect \ or a separator
  - A separator is a docblock, whitespace, or a comment
  - ^[0-9_a-zA-Z]+$ will be expected after a separator....

namespace begin looking for namespace name - expect a separator or an a-z regex - if the next non-separator is a \, then cancel namespace name declaration & return to previous scope. But then I have a statement that needs processing & it needs processing

Okay, so there are two directives when namespace is found. The namespace\function(); type & the namespace Declaration; type. They both have to be evaluated continuously until one of them fails validation.

So the first namespace directive:

active when namespace is found at the end of the buffer
de-activated when ; is found
expect a separator or a-z directive or ;
- When separator found, discard it
- When ; found,
  - if namespace name is empty, report error & throw away the namespace. (continue parsing)
  - if namespace name non-empty & ends with \, report error & throw away the namespace. (continue parsing)
  - if namespace name non-empty & begins with \, discard as its the namespace\foo() syntax
- When a-z directive found, append it to namespace name. expect separator, \. Inherit ; expectation
  - When ; found, bubble up to namespace's ; expectation
  - When separator found, discard it
  - When \ found, append it to namespacename. Expect separator or a-z regex. Inherit ; expectation
    - When ; found, report error & bubble up to namespace's ; expectation
    - when separator found, discard it
    - When a-z regex found, run a-z directive (which will ultimately re-run this directive)

<?php
$directives = [
    'namespace'=>[
        'begin'=> 'namespace',
        'expect'=> ['separator', 'a-z', 'terminator'],

        ':separator'=>[
            'is'=> ['docblock', 'comment', 'whitespace']
        ],
        ':a-z directive'=>[
            'terminate_on_regex'=> '/[^[a-zA-Z0-9_]$]/';
            'regex'=> '/^[a-zA-Z0-9_]+$/',
            'onNoLongerMatching' => 'append to namespace',
            'onMatch'=> 'append to namespace', //probably use a function here
            'expect'=>['separator', 'divider', 'inherit:terminator'],
        ],
        ':terminator'=>[

        ],
        ':divider' => [
            'string'=>'\\'
        ],
    ],
    'docblock'=>[], //its whole own directive
    'comment'=>[], // its whole own directive
    'whitespace'=>[], //its whole own directive
];

When anything else is found, this namespace directive is removed from directives being currently evaluated

So there are different ways to do this:

Evaluate after end-character is reached
Evaluate in an ongoing basis & create rigid expectations that limit what the final result is

The 2nd namespace directive: This rule needs to be present to ensure proper functioning of the namespace declarations rule. Without this rule, if someone uses the namespace\go(); syntax, then we will be stuck in a waiting-for-namespac state... unless the other namespace directive still catches ; & will just throw away if the rest of its stuff failed.

active when namespace is found at the end of a buffer
Expect a separator or \.
When \ found, expect a separator or [a-zA-Z0-9_] (code, one char)
- To simplify expecting code, since I'm not trying to add that feature yet, I can expect strings
When code is found, expect a semi-colon or separator
- IF a separator is found, discard it & continue prior expectation
When semi-colon is found, discard everything.