utf-8 in SBCL

Discussion of Common Lisp
ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 12:32 pm

Your loop has a very bad complexity due to repeated concatenation, and is unnecessarily complex. It can be rewritten as (not tested):

Code: Select all

(defun read-header (stream)
  (with-output-to-string (header)
    (loop for line = (read-line stream nil)
          while (and line (string/= line (string #\Return)))
          do (write-string line header))))
Although that might be brittle, since I am not sure if there are no settings in SBCL with which it will convert the #\Return #\Newline sequence to just a newline. In fact, this will not work at all unless the other side staggers the sending so that the #\Return would be received separately. Are you sure your code actually works?
sdp wrote:Now I just have to figure out how to read-byte from a non-binary stream.
This is possible in SBCL in principle, but I am not sure how to make usocket produce such a stream, and I am not sure if it is a good idea. I would suggest reading the data as a stream of bytes (preferably using READ-SEQUENCE), locating header terminator as binary pattern (which would eliminate the issue above) and then decode just the header.

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 1:16 pm

It was more subtly broken, the default eol-style for my mac is just :lf, but the input spec uses :crlf, so I was getting #\Return at the end of every line. I fixed the eol-style, so this works properly:

Code: Select all

(defun read-header (stream)
  (with-output-to-string (header)
    (loop for line = (read-line stream nil)
          while (and line (string/= line ""))
          do (write-string line header))))
I wasn't aware of with-output-to-string, thanks! This is my first non-trivial project in lisp, so I'm really only aware of some small subset of the spec.
This is possible in SBCL in principle, but I am not sure how to make usocket produce such a stream, and I am not sure if it is a good idea.
As I have it now: I have a listening usocket with :element-type '(unsigned-byte 8) and once someone connects, I have a flexi-stream wrapped around the usocket stream which takes an external format specification '(:utf-8 :eol-style :crlf) but can be read bytewise.
I would suggest reading the data as a stream of bytes (preferably using READ-SEQUENCE), locating header terminator as binary pattern (which would eliminate the issue above) and then decode just the header.
This is probably the right way to go, since only the header is specifically utf-8.

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 1:34 pm

sdp wrote:I fixed the eol-style, so this works properly:
Note that READ-LINE discards the newline, so if the header includes internal newlines which you want to include in the extracted copy WRITE-LINE would probably be better.
sdp wrote:I have a flexi-stream wrapped around the usocket stream which takes an external format specification '(:utf-8 :eol-style :crlf) but can be read bytewise.
Flexi-streams might actually be a better solution, especially if message size doesn't have a reasonable bound on size, since the in-memory binary streams would already deal with any necessary buffering. I am actually not even sure how READ-SEQUENCE actually would interact with a blocking stream, especially if the end of the message is not signalled by closing the socket or the length is known. Although I suppose that is included in the header?

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 1:57 pm

...if the end of the message is not signalled by closing the socket or the length is known. Although I suppose that is included in the header?
The length is not known, after the header it's a potentially infinite two-way conversation until either server or client sends a control sequence.

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 2:22 pm

sdp wrote:The length is not known, after the header it's a potentially infinite two-way conversation until either server or client sends a control sequence.
But is the length of individual messages known? I don't actually have much experience with network programming, but I think that reading one byte at a time is unlikely to be efficient even given system buffering, and you can't do a sequence read on blocking socket without knowing the message length, because you might, well, block. And I don't think usocket allows no-blocking reads.

For serious network programming I belive iolib is currently best, since it handles nonblocking communication and multiplexing. Of course if your needs are limited then that would be an overkill. Another option would be to use SBCL socket API directly, since usocket is a compatibility library it presents only features common to most implementations it covers, and I note that it allows socket-receive only for datagram sockets, while SBCL API for TCP as well, while allowing a non-blocking operation.

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 3:29 pm

But is the length of individual messages known?
It is part of the message headers (not the same header I'm reading at the opening of the socket).
For serious network programming I belive iolib is currently best, since it handles nonblocking communication and multiplexing.
After thinking about it, I would expect to use non-blocking communication, so I can make asynchronous reads/writes.

This is the first network programming I've ever done, not counting web apps where this was all abstracted away for me. I mostly get the theory, but I'm learning a lot by sitting down and actually hacking out this project.

Post Reply