Dumb and obvious thoughts about parsing binary protocols

6 thoughts
last posted May 14, 2014, 6:42 a.m.
0
get stream as: markdown or atom
0

The subject of this stream (and all it relates to) is all pretty damn obvious in retrospect, of course.

Maybe some of these lessons might've been learned by just reading C-based tutorials? Let's think about that...

Allons-y!

0

Recently having a requirement to do some stuff with a packed/binary protocol, and not being the greatest fan of C (or rather, the effort involved with, and the userland around it), I decided to use Python and the relatively excellent Construct library for this.

pip install construct after initializing a venv, fire up some test data, and so far so good.

0

Wrong assumption number one

Validating the start of data, and then reading forward for the fixed packet/data length

Why this is wrong

Corruption comes in all flavours, including too-short messages in the datastream. Validate with a window of n>=2 that reading your normal message length doesn't eat into the start of a next message.

Example

Assuming our first 4 bytes identify a message, and writing some python-esque pseudocode (where function or variable names are not written, consider them already implemented for the example case)

def validate(data):
    messageOffsets = findMessages(data)
    if messageOffsets[1] < messageLength:
        return badMessage
    else:
        """ validation here """
        if data[0:4] == '\x01\x02\x03\x04':
            return data[:messageLength]
0

If you forget lesson one, you are in for a whole lot of pain. None of your following data will make any sense whatsoever.

0

Lesson two

hd(1) is your friend.

elegua% echo hello | hexdump -v -e '/1 "%02X "'   
68 65 6C 6C 6F 0A

elegua% hd -n 32 example      
00000000  05 02 01 01 54 00 75 02  56 00 80 02 28 00 2d 00  |....T.u.V...(.-.|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

elegua% hd -s 17 -n 16 example
00000011  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

elegua% hd -s 12 -n 10 example
0000000c  28 00 2d 00 00 00 00 00  00 00                    |(.-.......|

In this way you can read any given file offset you need to, to see what's wrong where (like in the case of data being too short).

Note: hd is an alternate binary name for hexdump

0

Lesson 3

Remember to implement the checksums (if there are some)! This will help in the following cases:

  • your protocol doesn't have a clearly defined end-of-message marker
  • chances are that your message might also have the header in the payload data, in which case our aforementioned findData && badMessage structure would break down

For either of these cases, you could attempt a parse and know whether you have a valid message or not. This would then roughly become like so:

def validate(data):
    messageOffsets = findMessages(data)
    if messageOffsets[1] < messageLength:
        try:
            parseMessage(data[:messageLength])
            return data[:messageLength]
        except badDecode:
            return failure
    else:
        """ validation here """
        if data[0:4] == '\x01\x02\x03\x04':
            return data[:messageLength]