Wodin wrote:I think you're getting tripped up by the fact that a character can be represented in various different ways (using a different number of bytes per character depending on the representation). What happens if you have a text stream made up of characters where the underlying data is in two bytes per character and you read the first byte? If you were to read a character after that it would start half way through the first character and stop half way through the second. This gets even worse with something like utf8 where the number of bytes per character depends on the character. i.e. (CHAR-CODE (READ-BYTE stream)) doesn't make sense, because a byte is not a char. A byte (e.g. the value 97 stored using the normal representation for a small integer in a single byte) might use the same underlying bit pattern as a char (e.g. #\a), but it might not, depending on the encoding you're using.
Right, I understand some of the corner cases. I think I also made a mistake in what I wrote above. The main issue I ran into is that READ-BYTE wouldn't work on a character stream. So why not have READ-BYTE return (CHAR-CODE (READ-CHAR stream)). That is, what should the semantics of READ-BYTE be when applied to a character stream? It seems like you have a couple of choices:
- Error. That's what Lisp does now. This seems overly harsh, IMO.
- Try to return something useful. To me, READ-BYTE suggests "Give me the next object from the stream in numeric form." If the next object is a character in a character stream, it would make sense to return it as the CHAR-CODE of whatever that character is. IMO, you have two choices for anything other than ISO 8859-1 characters. (1) You could error as now. (2) You could return a number greater than 255 and let the programmer deal with this. #2 is probably more useful in practice.
There must surely be a way to say "convert this string into a bunch of bytes using the following character encoding" and then use that with READ-BYTE. (I have only just started with Lisp, so I don't know how to do this.)
Unfortunately not in standard CL. There is CHAR-CODE to convert a character to the CL implementation's chosen numeric representation, but it doesn't handle all the various issues in today's multi-lingual world. CL was standardized in the early 1990s, based on Lisps that had been around since the 1960s, and so much of this just wasn't a consideration.
That said, many CL implementations provide extensions that do exactly what you said. And there are many libraries like FLEXI-STREAMS to handle the issues, too. It's just frustrating that they have to exist for some of the simple cases as the code just gets longer and more verbose.
By the way, what was it you were trying to do exactly?
Just writing a simple set of functions to deal with textual data coming over a socket. I wanted to test them in the REPL. The most obvious way to do that was to use WITH-INPUT-FROM-STRING to create a stream that could be read, but because my functions had been designed to work over a socket, they used READ-BYTE rather than READ-CHAR and so quickly broke.
P.S. I am not an expert on character encodings, Unicode, etc. so I may have got this a bit wrong, but I think the main point is BYTE != CHAR.
Yes, but in the real world they often are equivalent.
And certainly there are other more rational responses to that issue than "READ-BYTE does not work on a character stream."