Grammar for message parsing?

smcnamara · Post by **smcnamara** » Sat Feb 25, 2012 4:18 pm

I'm struggling with a pet-project, and am not sure I'm thinking clearly anymore so am looking for some feedback to hopefully get me back on course. Perhaps by writing this out it will become clearer for me, and if not (or in addition), hopefully some of you will be able to shine a light on things.

I'm implementing some CL code to handle streaming messages from an external service. The message formats are not explicitly specified, but I have an existing implementation in another language from which I can gather the message type details. The stream takes the form of a series of null-separated ascii values.

The first thing received is:

<MSG_ID> : An identifier that indicates what type of message is on the wire.

This ID is then followed by a number of fields, with the number and type being dependent on the message type. The fields do not contain any kind of identifier specifying what field or type they might be, and the message processor is simply supposed to understand the entire format based on the MSG_ID originally received.

Each message is different, but there are a few common components I can identify:

<STRING> : a string value
<INT> : an integer value (in ascii format)
<FLOAT> : a floating point number (in ascii format)
<BOOLEAN> : int (in ascii format) 0 if false, true if ascii value != "0"
<STRICT BOOLEAN> : TRUE if ascii = "1", FALSE if != "1"
<COMPOUND FIELD> : some collection of elements that define a compound structure
<ITERATOR> : an integer value (in ascii format) that indicates that some number (0 or more) instances of a compound field follow.

The available stream parser is an abomination of code with the MSG_ID acting as the key to a switch statement, each message type being handled
by a case statement (including all sorts of special-case handling and gobs of repeated code.)

I'd rather not implement the CL code the same way, and have thought this might be a good opportunity to apply a table-driven parser. While I could construct the table by hand, that seems like a very brute force approach since there are approximately 60 different messages that might arrive, not to mention that the message contents seem to be altered fairly frequently. It seems like I should be defining a grammar that expresses the message structure, and then using that grammar to generate the parsing table, but I'm completely flummoxed on how to begin this. To explain where I'm confused (which is pretty darn early):

S = Start
M = MSG_ID
P = Payload

S --> MP
M --> digit digit*

But now I'm stuck, since the specific value of a particular M defines the rule that should be used for the Message.

MSG_ID 1 = LOGON
MSG_ID 2 = ERROR

Is there a common way (or is it OK) to specify distinct terminals in a grammar? Meaning I could add rules like:

M --> "1" | "2" | ... | "60"
1P --> 1 string string int
2P --> 2 string string int string

At this point, I'm no longer sure I'm framing the problem the right way, and am wondering if anyone has experienced a similar situation. I'm not tied to the table driven approach, but for some reason considering the system in the form of a push-down automata and driving the state transitions with a lookup table felt natural.

I appreciate any feedback or suggestions!

Post by **nuntius** » Sat Feb 25, 2012 5:40 pm

EBNF is a common grammar specification.

Regular expressions are another.

A few years ago, I wrote a basic EBNF parser for CL. It had a few shortcomings; so I have started tinkering with a different approach.

For the problem you describe, the grammar might look something like

Code: Select all

msg1="LOGON", integer, integer
msg2="ERROR", string, integer, string, string
message=msg1|msg2

While length encodings (e.g. an integer N followed by N copies of another rule) can be specified in these formalisms, they often require expanding out all possible lengths... It is usually better to allow custom rules to be inserted into the grammar.

smcnamara · Post by **smcnamara** » Sat Feb 25, 2012 5:57 pm

So if I'm understanding you correctly, it's perfectly acceptable to have a numeric literal on the right side, yes? There is never a string like "LOGON" passed. Instead, I will get a ascii-numeric 1 at the start, so in EBNF I could have:

logon_msg = 1, integer, integer
error_msg = 2, string, integer, string, string
message = logon_msg | error_msg

In that case, could I define the iterative constructs like:

block = string, string, int, int, float
iter_block = int, {block}

block_msg = string, string, iter_block, int, float

My purpose is to get to the point where I can generate the table to drive the state transitions. Does this seem like a reasonable approach or is there another direction I could be looking in?

Thanks!

Post by **gugamilare** » Sat Feb 25, 2012 7:48 pm

The solution I comonly use is cl-yacc (manual). It is very simple to use IMO.

There are also other options available.

LispForum

Grammar for message parsing?

Grammar for message parsing?

Re: Grammar for message parsing?

Re: Grammar for message parsing?

Re: Grammar for message parsing?