Create Language Parser
To parse a new, currently unsupported language, you'll define Directives
, written in a simplified programming language. Directives are backed by php code, either built-in to the Lexer
or added to your language parser's Grammar
.
The goal of parsing code is to create an AST - Asymmetrical Syntax Tree. It's just a multi-dimensional array that details the structure of the parsed code.
(The Lexer could potentionally parse things other than code, such as cooking recipes.)
In this file
- How the Lexer Works
- Directives
- Instruction Examples
- Available Instructions
How the Lexer Works
The Lexer creates a Token
, an Ast
stack, and a Directive
stack, then loops over each individual character in the input string, adding one character to the Token's buffer on each loop.
On each loop, the head of the Directive stack is processed, and instructions may modify the head ast, add another directive to the top of the stack, create a new ast, rewind the buffer, or do many other tasks. (Note: Most instruction sets start with a match /regex/
instruction, which must succeed to run other instructions./
Each layer of the Directive stack contains an 'unstarted' and a 'started' list (each list is an array of Directives, or an empty array).
Each Directive may contain up to 3 instructions sets: 'start', 'match', and 'stop'. (*@todo We'll talk about the special 'is' instruction set later.)
When the 'started' list is empty, 'unstarted' directives are processed. When an 'unstarted' directive is processed, its 'start' instruction set is executed. If the 'start' instruction set succeeds, the Directive is added to the 'started' list.
When the 'started' list is NOT empty, 'started' directives are processed. When a 'started' directive is processed, its 'match' and 'stop' instruction sets are executed. 'match' goes first. If 'stop' is executed successfully, the Directive is moved back to the 'unstarted' list.
A Directive may add a layer to the directive stack. Then on subsequent loops, the new head directive layer will be processed, and the previous directive layer will be paused, until it is the head layer again.
The Lexer loops in this way, over each character, until all characters are processed.
When the Lexer finishes parsing a string, it returns a detailed AST describing the input file/string.
To recap, Directives and ASTs are both on a stack. Directives are processed on each loop, executing instructions that modify ASTs, create new ASTs, and tell the Lexer which Directives it should run next.
Tip: the lexer has a 'stop_loop' setting for debugging, to stop after a given number of loops.)
Directives & Instruction Sets
A Directive
is a named array of instruction sets. Each instruction set contains an array of instructions
(lol).
Most instruction sets begin with a match /regex/
, which matches against the Token's current buffer. If the regex matches, an unstarted directive is started, and the instructions after the match instruction are processed. If the regex does not match, the rest of the instruction set is not processed.
-- @TODO sort /improve this
'is' instruciton set is handled in Grammar.php by expandDirectiveWithIs()
-- end
Instructions
There are 20+ instructions available, and you can directly call methods on your Grammar, the Lexer, or the Token.
Instructions can be a string like 'token.rewind 3'
or a key/value pair like 'ast.new' => ['type'=>'class','name'=>'_token:buffer' ...]
, which creates a new ast.
If defining a key/value pair, the key is the instruction and the key may conain arguments, and the value is an argument to pass to the instruction.
Tip: This allows arrays to be passed to instructions.
Tip: If the value is (strict boolean) false
, the instruction is disabled.
Tip: If the key begins with an underscore (_
), the instruction is disabled.
The instruction can include arguments, separated by a space. The key may end in a special/reserved argument. The reserved arguments are ...
, []
, !
, and // comment
. The value may be any php data type, depending on the instruction's requirements & any reserved args that are used.
(Tip: Add comments if you ever have two identical keys in an instruction set.)
If you only define a value (no string key), then the value is your instruction, and reserved arguments are unavailable.
Instruction Examples
"instruction a b c"
passes three arguments ('a','b','c')
to the instruction
"object:method arg1 arg2"
calls a php object's method, passing two args ('arg1', 'arg2')
"instruction a" => "b b"
passes two arguments ('a', 'b b') to the instruction
"instruction a" => ['b', 'c', 'd'
] passes two arguments to the instruction ('a', ['b','c','d'])
"instruction a ..." => ['b', 'c', 'd'
] passes four arguments ('a','b','c','d')
to the instruction.
"instruction []" => 'value'
throws an exception because []
is reserved for future use.
"instruction !" => '\_object:method arg1 arg2'
calls the named php object/method, and the return value is passed to the instruction.
For object:method
, you can call any public method on the object, and available objects are:
-
lexer
, \Tlf\Lexer -
token
, \Tlf\Lexer\Token -
ast
, \Tlf\Lexer\Ast - the head ast -
this
, \Tlf\Lexer\Grammar - the Grammar attached to the current directive (The Grammar that defines the current directive). - any other grammar name. (must be a grammar that's added to your lexer instance)
Non-Grammar methods are called like $object->method(...$args)
where $args
is what's defined in your instruction, like ['arg1', 'arg2']
.
Grammar methods receive ($lexer, $headAst, $token, $directive, $args)
, where $args
is what's defined in your instruction, like ['arg1', 'arg2']
.
Available Instructions
Additional Instructions are available in code/Commands.php
-
match /regex/
- Start directive & continue processing instruction set if/regex/
matches the current token buffer- or
buffer.match
- or
-
then :directive_name
- Add the named directive to the directive stack. Creates a new layer on the stack once per loop.- or
directive.then
-
then grammarname:directive_name
- Add a directive of another grammar, such as from the docblock grammar -
then directive_name.stop
- Add the named directive's 'stop' instruction set as a 'start' -
"then :+new_directive_name" => $directive
- Create a new directive to add to the stack, instead of referencing a named directive. -
"then _blank" => $directive
- same as to:+
-
"then directive_name" => $directive_overrides
- To add a directive, but override parts of it. See Grammar.phpgetOverriddenDirective()
.
- or
-
then.pop :directive_name layers_to_pop
- Add a directive to the stack & immediately pop the directive layer when it is matched (instead of the directive's normal functioning).-
layers_to_pop
is an int - Rewinds by the length of the first capture group from the target directive's match
- Creates new Directive stack layer before pop if it is the first
then
call this loop
-
-
buffer.notin key
- Check if the current buffer matches a string in your grammar'spublic $notin
. If no match, then clear the buffer and ... i think start/stop directive? idk- Your grammar defines
public $notin = array<string key, array array_of_strings>
. -
buffer.notin key
checks!in_array($grammar->notin['key'], 'current_buffer')
- Your grammar defines
-
"ast.new" => []
- Create a new ast- arg
type=>class
or_type
- a string type or_object:method arg1 arg2
to call on lexer, token, or head ast - arg
_setHead=true|false
- (optional) true to add to top of ast stack. false to not. default is true. - arg
_class=>PhpClass
- (optional) the Ast's php class. - arg
_setto=>property
- (optional) Set the new ast to the current head ast's named property. - arg
_addto=>property
- (optional) Push the new ast to the current head ast's named property. - arg
_setPrevious=>key
- (optional) Set the new ast to the 'previous' key.-
Ex: we create a docblock ast & we
_setPrevious=>'docblock'
, then we encounter a class and retrieve it with$lexer->previous('docblock')
, to set the class's docblock.
-
Ex: we create a docblock ast & we
- If not
_setto
or_addto
, then if the type is 'class', then the new ast is added to the current head ast's 'class' property. - Any other key/value pair - the key is the name of a property on the ast. The value is either the value to set, or it calls an
_object:method arg1 arg2
if it is a string starting with an underscore (_
). Ex:_token:buffer
would set the current buffer string to the property.
- arg
-
debug.die
ordie
- Same asdebug.print
butexit
s. -
debug.print
orprint
- Shows what php values were created from your instruction. -
directive.inherit [:directive.isn] ["match"]
orinherit
- Run commands of another directive, except for the 'match' instruction.- Ex:
inherit :string_instructions.stop
- Include literal string 'match' to enable the 'match' instruction.
- arg
:directive.isn
-isn
is the instruction set name ('start', 'match', or 'stop').
- Ex:
-
directive.start
orstart
- Mark the current directive as started.- You can start a Directive with
start
instead of match.
- You can start a Directive with
-
directive.stop
orstop
- Mark the current directive as stopped- Allows a 'match' instruction set to stop a directive
-
token.rewind [num_chars]
orrewind
- Rewind the token. -
token.forward [num_chars]
orforward
- Move the token forward -
directive.halt
orhalt
- Halt the current directive, so further instructions in the active instruction set will not be processed. Other instruction sets in the active stack list will be processed. -
halt.all
- Halt the all other instructions sets waiting to be processed. To also halt the active instruction set, call 'directive.halt' AFTER 'halt.all' -
previous.set [key]
- Set the current buffer to the 'previous' key/value set, for the given key.- Ex:
previous.set docblock
is used to capture a docblock, then when the next class or function is found, the Directive will call"ast.set docblock !" => _lexer:previous docblock
- Ex:
-
previous.append [key]
- Append the current buffer to the 'previous' key/value set, for the given key.-
"previous.append" => ['statement', 'method_declaration']
also works
-
-
previous.obj_set [key] [property] [value]
- get the previous 'key', which MUST be an object, and setobject->property = value
on that object. -
previous.ast_push [key] [property] [value]
- get the previous 'key', which MUST be an ast object, and push the value to it -
directive.stop_others
- Loop through the list of all other started directives and move them to the unstarted list. Does NOT stop the active directive (the one calling directive.stop_others). -
directive.pop [num_layers]
- Pop layers off the Directive stack. -
buffer.clear
- Clear the buffer -
buffer.clearNext [num_chars]
- Progress the buffer forward [num_chars], but do NOT add those chars to the buffer. May corrupt the token... -
buffer.appendChar [string]
- Append a string to the buffer. May corrupt the token... -
ast.pop
- Remove the head AST from the top of the stack, unless it's the last one. -
"ast.set [property] !" => '_object:method arg1 arg2'
- Set head AST's[property]
to$object->method('arg1','arg2')
's return value.-
ast.set [property]
will set the head AST's[property]
to the current buffer. - Ex:
"ast.set docblock !" => '_lexer:previous docblock'
will get the docblock from the 'previous' key/value set, and set the 'docblock' property on the head ast.
-
-
ast.push [property]
- Append the current buffer to given array property on the head ast. -
"ast.append [property] !" => '_object:method arg1 arg2'
- Append to the head AST's[property]
. Same as ast.set, except this appends.
Example
// @todo make a better example
['docblock'=>
[
'start'=>[
'match'=>'##',
],
'stop'=>[
'match'=>'/(^\s*[^\#])/m',
'rewind 2',
'this:handleDocblockEnd',
'buffer.clear',
// 'forward 2'
]
]
];