File I/O woes

Discussion of Common Lisp
Post Reply
Duke
Posts: 38
Joined: Sat Oct 17, 2009 10:40 pm
Contact:

File I/O woes

Post by Duke » Sun Aug 29, 2010 2:20 pm

I was working on Markov chains yesterday. I tried to read a file as a list of strings, which proved surprisingly difficult. There doesn't seem to be a function analogous to C++'s stream insertion operator, though there is a read-line. I could just do (cl-ppcre:split (read-line stream)) and append the lines together. In fact...

Code: Select all

(with-open-file (in txt)
           (labels ((rec ()
                      (unless (equal 'end-of-file
                                     (peek-char t in nil 'end-of-file))
                        (append
                         (cl-ppcre:split "\ " (read-line in))
                         (rec)))))
             (rec)))
...It took all of about five minutes and worked perfectly on the first try, parsing a rather large book in 0.25 seconds. Don't I feel like an ass.

Here's my attempt at read-word anyway.

Code: Select all

(defun read-word (stream)
  (let ((c (peek-char nil stream nil)))
    (if (or (char= c #\Space)
            (char= c  #\Newline))
        (string (read-char stream))
        (concatenate 'string
                     (string (read-char stream))
                     (read-word stream)))))
This is surely reinventing the wheel, poorly, and preserves whitespace in the returned string. Would anyone care to suggest how I can get the correct behavior without complicating the code? Or the name of a library that already has read-word?
"If you want to improve, be content to be thought foolish and stupid." -Epictetus

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: File I/O woes

Post by ramarren » Sun Aug 29, 2010 11:33 pm

Duke wrote:There doesn't seem to be a function analogous to C++'s stream insertion operator
I think you meant stream extraction operator? Anyway, there is READ, except obviously since CL is dynamically typed it is not an exact equivalent, since it has to determine the type of the object solely from the representation.

What I would probably do is read an entire file into memory (using READ-FILE-INTO-STRING from Alexandria) and split it there. If you are going to keep the entire list in memory anyway then doubling the required size is usually irrelevant.

If implementing a word stream would be actually required then most likely you would need to implement buffering, if not for IO, since it most likely has its own buffering, as to avoid constructing the string character by character, since you would usually either block-copy, or just point into the buffer. A note, when constructing strings of unknown length it is best to use WITH-OUTPUT-TO-STRING, since it avoids multiple concatenations.

Even better, for large data sets, use foreign function interface and bind MMAP. I think osicat has a binding, although last time I tried it had trouble on 32bit systems due to return convention MMAP has.

Duke
Posts: 38
Joined: Sat Oct 17, 2009 10:40 pm
Contact:

Re: File I/O woes

Post by Duke » Mon Aug 30, 2010 12:01 am

Ramarren wrote:
Duke wrote:There doesn't seem to be a function analogous to C++'s stream insertion operator
I think you meant stream extraction operator? Anyway, there is READ, except obviously since CL is dynamically typed it is not an exact equivalent, since it has to determine the type of the object solely from the representation.
Er, yes. Extraction. :facepalm: The problem with READ was that it would barf every time it encountered a comma in the text, as if it were trying to evaluate the stream as Lisp code. READ-LINE didn't, which is a bit boggling. I might have read through the SBCL source to detect a difference between the two functions that would explain this behavior, but I'm not sure what came of that.

As for the rest of your advice, you're quite right. I probably would have done some of it de facto if I were still writing C, but I guess my intuition is taking a backseat while I re-learn the basics. I'll definitely check out Alexandria and Osicat when I pick this up next, and probably read the source to see how things get done low-level.

Much thanks for your reply, Ramarren. :)
"If you want to improve, be content to be thought foolish and stupid." -Epictetus

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: File I/O woes

Post by ramarren » Mon Aug 30, 2010 1:29 am

Duke wrote:The problem with READ was that it would barf every time it encountered a comma in the text, as if it were trying to evaluate the stream as Lisp code. READ-LINE didn't, which is a bit boggling.
READ-LINE reads a line as a string, READ reads an s-expression, which has to conform to s-expression syntax as defined by current *readtable* and the reader algorithm. You could in principle customize the syntax quite heavily, but at some point it is easier to just write a simple parser.

s-imp
Posts: 11
Joined: Sat Aug 28, 2010 5:04 am

Re: File I/O woes

Post by s-imp » Mon Aug 30, 2010 2:12 am

I've been looking through all my books and scratching my head, but I found this nice web-page I shall be going over with a fine toothed comb when the time is right:

http://la7dja.org/lisp/clml/utils.lisp

Tom
Posts: 22
Joined: Sat Jun 28, 2008 12:52 pm
Location: Wichita, KS
Contact:

Re: File I/O woes

Post by Tom » Mon Aug 30, 2010 8:49 pm

I've been doing a bit of text file parsing lately. I break it down into 2 steps. The first step is to use READ-SEQUENCE to map the file to a string.

Code: Select all

(defun valid-size (stream)
  "Ensure that the file does not exceed ARRAY-TOTAL-SIZE-LIMIT."
  (let ((size (file-length stream)))
    (if (< size array-total-size-limit)
        size
        (error "The string size exceeds ARRAY-TOTAL-SIZE-LIMIT."))))

;;; Inspired by an Erik Naggum post from 1998-04-15.
(defun map-file-to-string (pathname)
  "Map a file into a string."
  (with-open-file (input pathname :direction :input)
    (let* ((string (make-string (valid-size input)
                                :initial-element #\Space))
           (end (read-sequence string input)))
      (values string end))))
Then I perform all of the parsing on the string. I like to use META-SEXP.

Code: Select all

(defrule word? (&aux (word (make-char-accum))) ()
  (:* (:type (or white-space? newline?)))
  (:+ (:not (:type (or white-space? newline?)))
      (:char-push word))
  (:return word))

(defun parse-words-in-file (pathname)
  "Return a list of the words in the file."
  (let ((ctx (create-parser-context
              (map-file-to-string pathname))))
    (loop for word = (word? ctx)
          while word collect word)))
There are a lot of details I'm not addressing in this example, but it demonstrates the basic idea.

Good luck,

~ Tom

Duke
Posts: 38
Joined: Sat Oct 17, 2009 10:40 pm
Contact:

Re: File I/O woes

Post by Duke » Tue Aug 31, 2010 8:20 pm

Surprisingly, Alexandria's READ-FILE-INTO-STRING was only slightly faster than my wonky read-line/append function, and left unwanted newlines in the output. I think if I filter out the whitespace, the two will be pretty much equivalent.

@Tom
Thanks for the tip. I'll have to try that approach tomorrow.
"If you want to improve, be content to be thought foolish and stupid." -Epictetus

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: File I/O woes

Post by ramarren » Tue Aug 31, 2010 11:38 pm

Duke wrote:Surprisingly, Alexandria's READ-FILE-INTO-STRING was only slightly faster than my wonky read-line/append function, and left unwanted newlines in the output. I think if I filter out the whitespace, the two will be pretty much equivalent.
This is not that surprising, since such operation would would be dominated by I/O costs (and presumably READ-LINE uses properly buffered I/O internally, so READ-SEQUENCE doesn't have that much advantage) and character decoding. That might not be quite obvious to people coming from other languages, since the last time I checked many didn't use Unicode by default, but most CL implementations do, and decoding in particular UTF8 can be quite expensive. I you were concerned about speed then treating your data as binary or some single-byte-per-character encoding can be a significant gain. Also in the presence of variable character length encodings FILE-LENGTH can be greater than length of file in characters, which is why Tom's approach returns actual length read as secondary value.

Tom
Posts: 22
Joined: Sat Jun 28, 2008 12:52 pm
Location: Wichita, KS
Contact:

Re: File I/O woes

Post by Tom » Wed Sep 01, 2010 7:39 am

Ramarren wrote:
Duke wrote:Surprisingly, Alexandria's READ-FILE-INTO-STRING was only slightly faster than my wonky read-line/append function, and left unwanted newlines in the output. I think if I filter out the whitespace, the two will be pretty much equivalent.
This is not that surprising, since such operation would would be dominated by I/O costs (and presumably READ-LINE uses properly buffered I/O internally, so READ-SEQUENCE doesn't have that much advantage) and character decoding. That might not be quite obvious to people coming from other languages, since the last time I checked many didn't use Unicode by default, but most CL implementations do, and decoding in particular UTF8 can be quite expensive. I you were concerned about speed then treating your data as binary or some single-byte-per-character encoding can be a significant gain. Also in the presence of variable character length encodings FILE-LENGTH can be greater than length of file in characters, which is why Tom's approach returns actual length read as secondary value.
The main place I've run into the problem with FILE-LENGTH exceeding the number of characters is the Windows EOL character, CR+LF becomes #\Newline. I think that issue is independent of encoding. Banging my head against, though, raised my awareness of encoding issues.

~ Tom

karol.skocik
Posts: 10
Joined: Tue Sep 22, 2009 4:50 pm

Re: File I/O woes

Post by karol.skocik » Sat Sep 04, 2010 12:40 pm

I usually make my life easier with iterate:

Code: Select all

(iter (for line :in-file "/home/md/.emacs" :using #'read-line)
    (print line))

Post Reply