Page 1 of 2
utf-8 in SBCL
Posted: Sat Nov 13, 2010 12:00 am
by sdp
I am trying to read a utf-8 stream:
Code: Select all
(defun read-header (stream)
(let ((header "")
(line (read-line stream nil)))
(loop while (not (eq line ""))
do
(setf header (concatenate 'string header line))
(setf line (read-line stream nil)))
header))
but I keep running into this error:
Code: Select all
debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #<THREAD
"initial thread" RUNNING
{1002B083B1}>:
decoding error on stream
#<SB-SYS:FD-STREAM for "socket 127.0.0.1:31337, peer: 127.0.0.1:51315"
{1002C33C01}>
(:EXTERNAL-FORMAT :UTF-8):
the octet sequence (159) cannot be decoded.
I am running:
- SBCL 1.0.44 with these *features*:
(:QUICKLISP :ASDF2 :ASDF :ANSI-CL :COMMON-LISP :SBCL :SB-DOC :SB-TEST :SB-LDB
:SB-THREAD :SB-LUTEX :SB-PACKAGE-LOCKS :SB-UNICODE :SB-EVAL
:SB-SOURCE-LOCATIONS :IEEE-FLOATING-POINT :DARWIN :X86-64 :INODE64
:DARWIN9-OR-BETTER :UNIX :MACH-O :BSD :DARWIN :MACH-EXCEPTION-HANDLER
:SB-LUTEX :UD2-BREAKPOINTS :GENCGC :STACK-GROWS-DOWNWARD-NOT-UPWARD
:C-STACK-IS-CONTROL-STACK :LINKAGE-TABLE :COMPARE-AND-SWAP-VOPS
:UNWIND-TO-FRAME-AND-CALL-VOP :RAW-INSTANCE-INIT-VOPS
:STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-VECTORS
:STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-FIXED-OBJECTS :ALIEN-CALLBACKS
:CYCLE-COUNTER :COMPLEX-FLOAT-VOPS :FLOAT-EQL-VOPS :INLINE-CONSTANTS
:MEMORY-BARRIER-VOPS :OS-PROVIDES-DLOPEN :OS-PROVIDES-DLADDR
:OS-PROVIDES-PUTWC :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-SUSECONDS-T)
- QuickLisp 2010101600
- slime-20101107-cvs (latest QuickLisp version)
- usocket-20101006-svn (latest QuickLisp version)
I imagine I can get around this by reading in the bytes and not trying to coerce them into a character, but U+009F strikes me as a valid enough utf-8 character, I'm really not sure why it wouldn't work. Am I doing something wrong?
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 4:48 am
by ramarren
Codes 128 to 159 are specified to be undefined in UTF-8, since they correspond to nonstandarized ASCII control characters.
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 9:05 am
by sdp
Thanks for your quick reply! =)
Shouldn't it still be read as a character like #\u009F?
Code: Select all
COMMON-LISP-USER> (char-code #\u009F)
159
COMMON-LISP-USER> (code-char 159)
#\Application-Program-Command
159 isn't the only code that fails, for example:
Code: Select all
decoding error on stream
#<SB-SYS:FD-STREAM
for "socket 127.0.0.1:31338, peer: 127.0.0.1:57163"
{1003149E21}>
(:EXTERNAL-FORMAT :UTF-8):
the octet sequence (167) cannot be decoded.
[Condition of type SB-INT:STREAM-DECODING-ERROR]
But we know:
Code: Select all
COMMON-LISP-USER> (code-char 167)
#\SECTION_SIGN
COMMON-LISP-USER> (char-code #\section_sign)
167
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 9:40 am
by sdp
I explored the characters a little further, I can store such character to a string, create a stream from that string and read from it:
Code: Select all
COMMON-LISP-USER> (setf *my-string* (make-string 1))
COMMON-LISP-USER> (setf (char *my-string* 0) #\u009F)
#\Application-Program-Command
COMMON-LISP-USER> *my-string*
"?"
COMMON-LISP-USER> (setf *my-stream* (make-string-input-stream *my-string*))
#<SB-IMPL::STRING-INPUT-STREAM {100323AE61}>
COMMON-LISP-USER> (read-line *my-stream* nil)
"?"
T
Now I'm really confused. I'm using usocket to create a stream from a socket, could the problem be something to do with the sockets?
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 9:53 am
by ramarren
While character with code 159 does exist, in UTF-8 it is encoded as:
Code: Select all
CL-USER> (sb-ext:string-to-octets (string (code-char 159)) :external-format :utf-8)
#(194 159)
UTF-8 is a binary format, in which byte 159 by itself does not denote anything. Remember that unicode and a particular translation of it into a byte sequence are not the same thing.
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 10:26 am
by sdp
UTF-8 is a binary format, in which byte 159 by itself does not denote anything.
After a quick inspection of the UTF-8 standard, I can see that for characters between U+0080 and U+07FF, UTF-8 uses 16 bits.
Remember that unicode and a particular translation of it into a byte sequence are not the same thing.
I'm very fuzzy on the notion of unicode. I know that UTF-8 (being as you say the particular translation of of unicode into a byte sequence) is a method of character encoding. I will hesitantly derive from this that unicode is the set of characters, but please correct me if I missed the mark.
Fair enough: in that case, it seems like the problem I'm running into is malformed utf-8 bytes. Since my input is small enough, I'll read in a sample in bytes and try to figure out where it's going awry.
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 10:41 am
by ramarren
As the ever helpful
Wikipedia says, Unicode is a computer industry standard which includes great many things. Among them the character set and a number of encodings to translate strings of those into sequences of bytes. UTF-8 is a fairly popular one of those, but its variable length nature tends to cause problems such as this. If you are not transmission speed constrained and control the encoding used by the server a UTF-32 might be better. That is what SBCL uses internally, at least on my system.
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 10:57 am
by sdp
If you are not transmission speed constrained and control the encoding used by the server a UTF-32 might be better. That is what SBCL uses internally, at least on my system.
I do control the encoding used by the server, but not the client and the specs say utf-8 so I'll just have to make it work.
I have found something interesting though:
Code: Select all
decoding error on stream
#<SB-SYS:FD-STREAM
for "socket 127.0.0.1:31339, peer: 127.0.0.1:58177"
{10031DFC01}>
(:EXTERNAL-FORMAT :UTF-8):
the octet sequence (232 253 124) cannot be decoded.
[Condition of type SB-INT:STREAM-DECODING-ERROR]
Restarts:
0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character boundary and continue.
Before this conversation, I never thought to try the first restart, but when I do:
Code: Select all
decoding error on stream
#<SB-SYS:FD-STREAM
for "socket 127.0.0.1:31339, peer: 127.0.0.1:58177"
{10031DFC01}>
(:EXTERNAL-FORMAT :UTF-8):
the octet sequence (174) cannot be decoded.
[Condition of type SB-INT:STREAM-DECODING-ERROR]
Restarts:
0: [ATTEMPT-RESYNC] Attempt to resync the stream at a character boundary and continue.
After those two errors, it seems to work as expected. I saw elsewhere that SBCL can get out of sync while reading, could it be the case here?
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 11:12 am
by ramarren
It could be, but I think it is a bit more likely that the other side is sending malformed data. You should really verify what bytes are actually received. Are you sure that the stream contains UTF-8 encoded data only, and doesn't contain some additional binary control characters?
Re: utf-8 in SBCL
Posted: Sat Nov 13, 2010 11:41 am
by sdp
Ah, I wasn't properly finishing my loop at the end of the header and it was continuing to read into a binary section.
Here's the modified code that exits properly:
Code: Select all
(defun read-header (stream)
(let ((header "")
(line (read-line stream nil)))
(loop while (not (string= line (string #\Return)))
do
(setf header (concatenate 'string header line))
(setf line (read-line stream nil)))
header))
Now I just have to figure out how to read-byte from a non-binary stream.