Strings, characters, and binary encodings

Post by **findinglisp** » Thu Jul 10, 2008 10:55 pm

Am I the only one who ever gets frustrated when working with Common Lisp's strings and characters, particularly when you have to stuff them into binary streams? I was trying to do some simple stuff tonight and just about pulled my hair out bumping into what I think is silly behavior. Let me count the annoyances:

You can't call READ-BYTE on a text stream. This means that you can't easily use WITH-INPUT-FROM-STRING to create a stream that will be read using READ-BYTE. Can anybody think of a reason that READ-BYTE wouldn't return (CHAR-CODE (READ-BYTE stream)), if stream was a text stream?
You can't include any binary data in a string. The reader doesn't interpret things like \n or \0 as C would. I suppose this is a symptom of trying to be compatible with early Lisps that came before C was invented, but this behavior is standard in just about every programming language out there today.
Because of the previous limitation, you need to use something like CONCATENATE to put some binary data in the middle of a string. Unfortunately, something like (CONCATENATE 'STRING "Foo" #\newline) doesn't work because #\newline isn't a sequence. So you have to use something like (CONCATENATE 'STRING "Foo" (LIST #\newline)) to create this. It would seem that CONCATENATE could just treat any ATOM as a list of length 1 and be done with it.
Then, you're constantly using CODE-CHAR and CHAR-CODE to convert back and forth between character objects and integers.

Any code I write that does a lot of string manipulation to implement a protocol seems to explode in verbosity to work around all that. Sure, I can create wrapper functions/macros that hide it all, but why wouldn't the standard library just do the right thing with all this?

Maybe I've been playing around too much with Perl and Ruby lately and I'm used to all the fast-and-loose type coercion that goes on there. I sure reduces the typing, though.

Alexander Lehmann · Post by **Alexander Lehmann** » Fri Jul 11, 2008 12:20 am

Hi,

although this isn't a complete answer to your posting you might want to have a look at this and search for "Reader". The author shows a pretty straightforward way to define a useful reader macro for strings. However, this function would have to be manipulated in order to get the exact behaviour that you desire.

Post by **findinglisp** » Fri Jul 11, 2008 1:02 am

Yup, absolutely. As I said, the great news is that CL is so flexible, all these things are fairly easily corrected by wrapping the existing functionality with other functions or macros, including different readtables. The frustrating thing is that they don't work the way you'd (I'd?) expect right out of the box. It's just an annoyance, not a fatal flaw.

Wodin · Post by **Wodin** » Fri Jul 11, 2008 1:06 am

I think you're getting tripped up by the fact that a character can be represented in various different ways (using a different number of bytes per character depending on the representation). What happens if you have a text stream made up of characters where the underlying data is in two bytes per character and you read the first byte? If you were to read a character after that it would start half way through the first character and stop half way through the second. This gets even worse with something like utf8 where the number of bytes per character depends on the character. i.e. (CHAR-CODE (READ-BYTE stream)) doesn't make sense, because a byte is not a char. A byte (e.g. the value 97 stored using the normal representation for a small integer in a single byte) might use the same underlying bit pattern as a char (e.g. #\a), but it might not, depending on the encoding you're using.

There must surely be a way to say "convert this string into a bunch of bytes using the following character encoding" and then use that with READ-BYTE. (I have only just started with Lisp, so I don't know how to do this.)

I agree with your point 2.

The rest sounds nasty and point 4 sounds like you might be doing it wrong. (Just a feeling I get :) )

By the way, what was it you were trying to do exactly?

P.S. I am not an expert on character encodings, Unicode, etc. so I may have got this a bit wrong, but I think the main point is BYTE != CHAR.

Post by **findinglisp** » Fri Jul 11, 2008 8:19 am

Wodin wrote:I think you're getting tripped up by the fact that a character can be represented in various different ways (using a different number of bytes per character depending on the representation). What happens if you have a text stream made up of characters where the underlying data is in two bytes per character and you read the first byte? If you were to read a character after that it would start half way through the first character and stop half way through the second. This gets even worse with something like utf8 where the number of bytes per character depends on the character. i.e. (CHAR-CODE (READ-BYTE stream)) doesn't make sense, because a byte is not a char. A byte (e.g. the value 97 stored using the normal representation for a small integer in a single byte) might use the same underlying bit pattern as a char (e.g. #\a), but it might not, depending on the encoding you're using.

Right, I understand some of the corner cases. I think I also made a mistake in what I wrote above. The main issue I ran into is that READ-BYTE wouldn't work on a character stream. So why not have READ-BYTE return (CHAR-CODE (READ-CHAR stream)). That is, what should the semantics of READ-BYTE be when applied to a character stream? It seems like you have a couple of choices:

Error. That's what Lisp does now. This seems overly harsh, IMO.
Try to return something useful. To me, READ-BYTE suggests "Give me the next object from the stream in numeric form." If the next object is a character in a character stream, it would make sense to return it as the CHAR-CODE of whatever that character is. IMO, you have two choices for anything other than ISO 8859-1 characters. (1) You could error as now. (2) You could return a number greater than 255 and let the programmer deal with this. #2 is probably more useful in practice.

There must surely be a way to say "convert this string into a bunch of bytes using the following character encoding" and then use that with READ-BYTE. (I have only just started with Lisp, so I don't know how to do this.)

Unfortunately not in standard CL. There is CHAR-CODE to convert a character to the CL implementation's chosen numeric representation, but it doesn't handle all the various issues in today's multi-lingual world. CL was standardized in the early 1990s, based on Lisps that had been around since the 1960s, and so much of this just wasn't a consideration.

That said, many CL implementations provide extensions that do exactly what you said. And there are many libraries like FLEXI-STREAMS to handle the issues, too. It's just frustrating that they have to exist for some of the simple cases as the code just gets longer and more verbose.

By the way, what was it you were trying to do exactly?

Just writing a simple set of functions to deal with textual data coming over a socket. I wanted to test them in the REPL. The most obvious way to do that was to use WITH-INPUT-FROM-STRING to create a stream that could be read, but because my functions had been designed to work over a socket, they used READ-BYTE rather than READ-CHAR and so quickly broke.

P.S. I am not an expert on character encodings, Unicode, etc. so I may have got this a bit wrong, but I think the main point is BYTE != CHAR.

Yes, but in the real world they often are equivalent.

And certainly there are other more rational responses to that issue than "READ-BYTE does not work on a character stream."

Alexander Lehmann · Post by **Alexander Lehmann** » Fri Jul 11, 2008 10:44 am

Hm, I'd rather stick with the neccessity of using CHAR-CODE & friends than having READ-BYTE read more than *a byte*.

Post by **findinglisp** » Fri Jul 11, 2008 4:30 pm

Alexander Lehmann wrote:Hm, I'd rather stick with the neccessity of using CHAR-CODE & friends than having READ-BYTE read more than *a byte*.

But that's the point. It wouldn't ever do that. If you pass READ-BYTE a character stream, it's a stream of abstract, unencoded characters. It would simply pull out the first character and return its CHAR-CODE. Since the stream was an unencoded character stream to begin with, you never could have pulled a single byte out of it to begin with. In the event that the stream is a binary stream, then READ-BYTE does exactly what it does now, returns a single byte, never more, never less.

Alexander Lehmann · Post by **Alexander Lehmann** » Sat Jul 12, 2008 8:03 am

Maybe this was a misunderstanding. Of course I see your point, however I'd not like READ-BYTE to behave in a way other than reading a byte from a given stream (because that's what I think it is supposed to do). Instead I'd prefer another standard solution for the mentioned problem. This way or that way - as I said I do understand your complaint about the default behaviour of the standard implementation (at least I think so). It's just that having READ-BYTE _not_ return a byte but an arbitrary, possibly multibyte, char-code could (IMHO) only be worse.
Anyways, I think we'll just have to live with it

fofikos · Post by **fofikos** » Sat Jul 12, 2008 8:05 am

I think the "try to return something useful" approach that you advocate here has
no place in common lisp. CL may have a couple of inconsistencies but the spec
is one of extremely high quality compared to the horrors of php where this
"context sensitive" philosophy is embraced.

Solutions to this non-issue exist (you mentioned flexistreams) and should be used
if you have to use streams. Alternatively you could simply work with sequences.

Post by **findinglisp** » Sun Jul 13, 2008 2:15 pm

fofikos wrote:I think the "try to return something useful" approach that you advocate here has
no place in common lisp. CL may have a couple of inconsistencies but the spec
is one of extremely high quality compared to the horrors of php where this
"context sensitive" philosophy is embraced.

I agree with you that the CL spec is very high quality. But that doesn't mean that the behavior it specifies is perfect in all cases. IMO, this is one of those functional areas. While I would agree that you can go too far with the "try to return something useful" approach, IMO throwing an error when READ-BYTE is used on a character stream isn't the answer either. If we disagree on that, fine, you're entitled to your opinion.

Solutions to this non-issue exist (you mentioned flexistreams) and should be used
if you have to use streams. Alternatively you could simply work with sequences.

Respectfully, the fact that solutions exist for this proves that it really is an issue. I can understand if you think it's not an issue for you, but please don't tell me that it isn't an issue for me. As for ways that I could work around this, I know what the answers are there. I had already done two different implementations with workarounds before I posted here. It's the fact that I know I'm writing workaround code that I posted about, not that I couldn't find a solution.

IMO, the streams portion of the CL spec is one of the weakest sections of the spec. Again, I'm not saying that the spec writing is bad or that it's ill-specified. The spec does a great job at specifying a weak set of functionality. Among the problems I can see are:

It doesn't use CLOS for the stream types. Therefore, you must use pseudo-standard Gray Streams to create your own stream types, or go with something somewhat implementation-specific like simple streams. Ironically, the number of types that the standard does define is abnormally large when compared to other languages (things like echo streams, etc.). But in a language that prides itself on being able to be extended, they failed to give you a standard way to create your own stream type. Even if you admit that Gray streams is standard-enough functionality that it might as well be considered part of the standard, you still have things like Allegro's Simple Streams being created to address other issues in Gray Streams (simplicity of user implementation).
It was designed before i18n/Unicode/etc. and so it doesn't have any standard notion of character encodings. This was a simple matter of timing and couldn't be helped, but the fact that the spec hasn't been revised in 17 years means we're still living with this in an age where i18n is now common-place. Again, implementation-specific solutions exist and many implementations are now adding Unicode support, but the support is far from universal and the interfaces sometimes differ dramatically.
The CL spec has a hard line between character streams and binary streams. A stream is either one or the other and things like READ-CHAR or READ-BYTE dot operate on the opposite stream type. Implementation-specific extensions exist for things like bivalent streams and flexi-streams does a reasonable job of handling this in a fairly portable library. While this doesn't create much problem when dealing with files, it's a much bigger issue when working with protocols and sockets that regularly transmit both character and binary data over the same stream (HTTP being the most obvious example).

So, I think it's fair to say that while the CL spec is well written and well specifies the language, there are places where the functionality that is specified could be better. IMO, streams is one of the most obvious. Some of these are accidental issues of timing (i18n). Others were known at the time the standard was being ratified (Gray streams almost made it in, but didn't because they were too far along at the time it was proposed). In response, a number of non-standard solutions exist, but the weaknesses in the standard functionality are all the more visible precisely because there are a host of non-standard libraries to deal with the issues.

LispForum

Strings, characters, and binary encodings

Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings

Re: Strings, characters, and binary encodings