utf-8 in SBCL

Discussion of Common Lisp
sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 12:00 am

I am trying to read a utf-8 stream:

Code: Select all

(defun read-header (stream)
  (let ((header "")
	(line (read-line stream nil)))
    (loop while (not (eq line ""))
       do
	 (setf header (concatenate 'string header line))
	 (setf line (read-line stream nil)))
    header))
but I keep running into this error:

Code: Select all

debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD
                                                               "initial thread" RUNNING
                                                               {1002B083B1}>:
  decoding error on stream
  #<SB-SYS:FD-STREAM for "socket 127.0.0.1:31337, peer: 127.0.0.1:51315"
    {1002C33C01}>
  (:EXTERNAL-FORMAT :UTF-8):
    the octet sequence (159) cannot be decoded.
I am running:
  • SBCL 1.0.44 with these *features*:
    (:QUICKLISP :ASDF2 :ASDF :ANSI-CL :COMMON-LISP :SBCL :SB-DOC :SB-TEST :SB-LDB
    :SB-THREAD :SB-LUTEX :SB-PACKAGE-LOCKS :SB-UNICODE :SB-EVAL
    :SB-SOURCE-LOCATIONS :IEEE-FLOATING-POINT :DARWIN :X86-64 :INODE64
    :DARWIN9-OR-BETTER :UNIX :MACH-O :BSD :DARWIN :MACH-EXCEPTION-HANDLER
    :SB-LUTEX :UD2-BREAKPOINTS :GENCGC :STACK-GROWS-DOWNWARD-NOT-UPWARD
    :C-STACK-IS-CONTROL-STACK :LINKAGE-TABLE :COMPARE-AND-SWAP-VOPS
    :UNWIND-TO-FRAME-AND-CALL-VOP :RAW-INSTANCE-INIT-VOPS
    :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-VECTORS
    :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-FIXED-OBJECTS :ALIEN-CALLBACKS
    :CYCLE-COUNTER :COMPLEX-FLOAT-VOPS :FLOAT-EQL-VOPS :INLINE-CONSTANTS
    :MEMORY-BARRIER-VOPS :OS-PROVIDES-DLOPEN :OS-PROVIDES-DLADDR
    :OS-PROVIDES-PUTWC :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-SUSECONDS-T)
  • QuickLisp 2010101600
  • slime-20101107-cvs (latest QuickLisp version)
  • usocket-20101006-svn (latest QuickLisp version)
I imagine I can get around this by reading in the bytes and not trying to coerce them into a character, but U+009F strikes me as a valid enough utf-8 character, I'm really not sure why it wouldn't work. Am I doing something wrong?

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 4:48 am

Codes 128 to 159 are specified to be undefined in UTF-8, since they correspond to nonstandarized ASCII control characters.

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 9:05 am

Thanks for your quick reply! =)

Shouldn't it still be read as a character like #\u009F?

Code: Select all

COMMON-LISP-USER> (char-code #\u009F)

159
COMMON-LISP-USER> (code-char 159)

#\Application-Program-Command
159 isn't the only code that fails, for example:

Code: Select all

decoding error on stream
#<SB-SYS:FD-STREAM
  for "socket 127.0.0.1:31338, peer: 127.0.0.1:57163"
  {1003149E21}>
(:EXTERNAL-FORMAT :UTF-8):
  the octet sequence (167) cannot be decoded.
   [Condition of type SB-INT:STREAM-DECODING-ERROR]
But we know:

Code: Select all

COMMON-LISP-USER> (code-char 167)

#\SECTION_SIGN
COMMON-LISP-USER> (char-code #\section_sign)

167

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 9:40 am

I explored the characters a little further, I can store such character to a string, create a stream from that string and read from it:

Code: Select all

COMMON-LISP-USER> (setf *my-string* (make-string 1))
COMMON-LISP-USER> (setf (char *my-string* 0) #\u009F)

#\Application-Program-Command
COMMON-LISP-USER> *my-string*

"?"
COMMON-LISP-USER> (setf *my-stream* (make-string-input-stream *my-string*))

#<SB-IMPL::STRING-INPUT-STREAM {100323AE61}>
COMMON-LISP-USER> (read-line *my-stream* nil)

"?"
T
Now I'm really confused. I'm using usocket to create a stream from a socket, could the problem be something to do with the sockets?

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 9:53 am

While character with code 159 does exist, in UTF-8 it is encoded as:

Code: Select all

CL-USER> (sb-ext:string-to-octets (string (code-char 159)) :external-format :utf-8)
#(194 159)
UTF-8 is a binary format, in which byte 159 by itself does not denote anything. Remember that unicode and a particular translation of it into a byte sequence are not the same thing.

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 10:26 am

UTF-8 is a binary format, in which byte 159 by itself does not denote anything.
After a quick inspection of the UTF-8 standard, I can see that for characters between U+0080 and U+07FF, UTF-8 uses 16 bits.
Remember that unicode and a particular translation of it into a byte sequence are not the same thing.
I'm very fuzzy on the notion of unicode. I know that UTF-8 (being as you say the particular translation of of unicode into a byte sequence) is a method of character encoding. I will hesitantly derive from this that unicode is the set of characters, but please correct me if I missed the mark.

Fair enough: in that case, it seems like the problem I'm running into is malformed utf-8 bytes. Since my input is small enough, I'll read in a sample in bytes and try to figure out where it's going awry.

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 10:41 am

As the ever helpful Wikipedia says, Unicode is a computer industry standard which includes great many things. Among them the character set and a number of encodings to translate strings of those into sequences of bytes. UTF-8 is a fairly popular one of those, but its variable length nature tends to cause problems such as this. If you are not transmission speed constrained and control the encoding used by the server a UTF-32 might be better. That is what SBCL uses internally, at least on my system.

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 10:57 am

If you are not transmission speed constrained and control the encoding used by the server a UTF-32 might be better. That is what SBCL uses internally, at least on my system.
I do control the encoding used by the server, but not the client and the specs say utf-8 so I'll just have to make it work.

I have found something interesting though:

Code: Select all

decoding error on stream
#<SB-SYS:FD-STREAM
  for "socket 127.0.0.1:31339, peer: 127.0.0.1:58177"
  {10031DFC01}>
(:EXTERNAL-FORMAT :UTF-8):
  the octet sequence (232 253 124) cannot be decoded.
   [Condition of type SB-INT:STREAM-DECODING-ERROR]

Restarts:
 0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character boundary and continue.
Before this conversation, I never thought to try the first restart, but when I do:

Code: Select all

decoding error on stream
#<SB-SYS:FD-STREAM
  for "socket 127.0.0.1:31339, peer: 127.0.0.1:58177"
  {10031DFC01}>
(:EXTERNAL-FORMAT :UTF-8):
  the octet sequence (174) cannot be decoded.
   [Condition of type SB-INT:STREAM-DECODING-ERROR]

Restarts:
 0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character boundary and continue.
After those two errors, it seems to work as expected. I saw elsewhere that SBCL can get out of sync while reading, could it be the case here?

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: utf-8 in SBCL

Post by ramarren » Sat Nov 13, 2010 11:12 am

It could be, but I think it is a bit more likely that the other side is sending malformed data. You should really verify what bytes are actually received. Are you sure that the stream contains UTF-8 encoded data only, and doesn't contain some additional binary control characters?

sdp
Posts: 9
Joined: Fri Nov 12, 2010 11:34 pm

Re: utf-8 in SBCL

Post by sdp » Sat Nov 13, 2010 11:41 am

Ah, I wasn't properly finishing my loop at the end of the header and it was continuing to read into a binary section.

Here's the modified code that exits properly:

Code: Select all

(defun read-header (stream)
  (let ((header "")
	(line (read-line stream nil)))
    (loop while (not (string= line (string #\Return)))
       do
	 (setf header (concatenate 'string header line))
	 (setf line (read-line stream nil)))
    header))
Now I just have to figure out how to read-byte from a non-binary stream.

Post Reply