Grammar for message parsing?
Posted: Sat Feb 25, 2012 4:18 pm
I'm struggling with a pet-project, and am not sure I'm thinking clearly anymore so am looking for some feedback to hopefully get me back on course. Perhaps by writing this out it will become clearer for me, and if not (or in addition), hopefully some of you will be able to shine a light on things.
I'm implementing some CL code to handle streaming messages from an external service. The message formats are not explicitly specified, but I have an existing implementation in another language from which I can gather the message type details. The stream takes the form of a series of null-separated ascii values.
The first thing received is:
<MSG_ID> : An identifier that indicates what type of message is on the wire.
This ID is then followed by a number of fields, with the number and type being dependent on the message type. The fields do not contain any kind of identifier specifying what field or type they might be, and the message processor is simply supposed to understand the entire format based on the MSG_ID originally received.
Each message is different, but there are a few common components I can identify:
<STRING> : a string value
<INT> : an integer value (in ascii format)
<FLOAT> : a floating point number (in ascii format)
<BOOLEAN> : int (in ascii format) 0 if false, true if ascii value != "0"
<STRICT BOOLEAN> : TRUE if ascii = "1", FALSE if != "1"
<COMPOUND FIELD> : some collection of elements that define a compound structure
<ITERATOR> : an integer value (in ascii format) that indicates that some number (0 or more) instances of a compound field follow.
The available stream parser is an abomination of code with the MSG_ID acting as the key to a switch statement, each message type being handled
by a case statement (including all sorts of special-case handling and gobs of repeated code.)
I'd rather not implement the CL code the same way, and have thought this might be a good opportunity to apply a table-driven parser. While I could construct the table by hand, that seems like a very brute force approach since there are approximately 60 different messages that might arrive, not to mention that the message contents seem to be altered fairly frequently. It seems like I should be defining a grammar that expresses the message structure, and then using that grammar to generate the parsing table, but I'm completely flummoxed on how to begin this. To explain where I'm confused (which is pretty darn early):
S = Start
M = MSG_ID
P = Payload
S --> MP
M --> digit digit*
But now I'm stuck, since the specific value of a particular M defines the rule that should be used for the Message.
MSG_ID 1 = LOGON
MSG_ID 2 = ERROR
Is there a common way (or is it OK) to specify distinct terminals in a grammar? Meaning I could add rules like:
M --> "1" | "2" | ... | "60"
1P --> 1 string string int
2P --> 2 string string int string
At this point, I'm no longer sure I'm framing the problem the right way, and am wondering if anyone has experienced a similar situation. I'm not tied to the table driven approach, but for some reason considering the system in the form of a push-down automata and driving the state transitions with a lookup table felt natural.
I appreciate any feedback or suggestions!
I'm implementing some CL code to handle streaming messages from an external service. The message formats are not explicitly specified, but I have an existing implementation in another language from which I can gather the message type details. The stream takes the form of a series of null-separated ascii values.
The first thing received is:
<MSG_ID> : An identifier that indicates what type of message is on the wire.
This ID is then followed by a number of fields, with the number and type being dependent on the message type. The fields do not contain any kind of identifier specifying what field or type they might be, and the message processor is simply supposed to understand the entire format based on the MSG_ID originally received.
Each message is different, but there are a few common components I can identify:
<STRING> : a string value
<INT> : an integer value (in ascii format)
<FLOAT> : a floating point number (in ascii format)
<BOOLEAN> : int (in ascii format) 0 if false, true if ascii value != "0"
<STRICT BOOLEAN> : TRUE if ascii = "1", FALSE if != "1"
<COMPOUND FIELD> : some collection of elements that define a compound structure
<ITERATOR> : an integer value (in ascii format) that indicates that some number (0 or more) instances of a compound field follow.
The available stream parser is an abomination of code with the MSG_ID acting as the key to a switch statement, each message type being handled
by a case statement (including all sorts of special-case handling and gobs of repeated code.)
I'd rather not implement the CL code the same way, and have thought this might be a good opportunity to apply a table-driven parser. While I could construct the table by hand, that seems like a very brute force approach since there are approximately 60 different messages that might arrive, not to mention that the message contents seem to be altered fairly frequently. It seems like I should be defining a grammar that expresses the message structure, and then using that grammar to generate the parsing table, but I'm completely flummoxed on how to begin this. To explain where I'm confused (which is pretty darn early):
S = Start
M = MSG_ID
P = Payload
S --> MP
M --> digit digit*
But now I'm stuck, since the specific value of a particular M defines the rule that should be used for the Message.
MSG_ID 1 = LOGON
MSG_ID 2 = ERROR
Is there a common way (or is it OK) to specify distinct terminals in a grammar? Meaning I could add rules like:
M --> "1" | "2" | ... | "60"
1P --> 1 string string int
2P --> 2 string string int string
At this point, I'm no longer sure I'm framing the problem the right way, and am wondering if anyone has experienced a similar situation. I'm not tied to the table driven approach, but for some reason considering the system in the form of a push-down automata and driving the state transitions with a lookup table felt natural.
I appreciate any feedback or suggestions!