HTML and LISP

Discussion of Common Lisp
Post Reply
totonex
Posts: 3
Joined: Wed Feb 03, 2010 1:51 am

HTML and LISP

Post by totonex » Wed Feb 03, 2010 12:09 pm

Hello everyone.
I am very new to pretty much everything is LISP, and what LISP is.
I have an assignment to do, and i've decided it would be best to do it in LISP.
What i'm supposed to do is make a desktop application in C# which gets the HTML from some page, and using regex, extract some very specific text from it.
Since HTML is XML, and XML is a list of items, LISP just popped into my mind.
First of all, i have Emacs + CL Lisp, and LispWorks (which works with CL).
1. Should i do a console application which does this? 1b. How do i make a console application in emacs?
2. Should i use LispWorks to make a fancy desktop app.
3. Should i not use CL, but use Clojure? //clojure's just fancy.
4.Can someone give me some insight on functions to use to get HTML from a specific URL, and do i still need regex for getting my specific information? If so, what support does CL for the matter?
Thank you very much.

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: HTML and LISP

Post by ramarren » Wed Feb 03, 2010 1:50 pm

totonex wrote:I am very new to pretty much everything is LISP, and what LISP is.
One thing you need to know: ever since lowercase letters have been invented, Lisp is known as Lisp, and LISP with all capital letters refers to old dialects which are now mostly a historical curiosity.
totonex wrote:I have an assignment to do, and i've decided it would be best to do it in LISP.
How much time do you have for this assignment? If you do not know anything about Lisp, then setting the system up and learning the basics might take some time. Trying to cargo-cult a solution rarely works well, and there is a number of things in most Lisps which are somewhat different from many other popular languages, so it is even harder.
totonex wrote:What i'm supposed to do is make a desktop application in C# which gets the HTML from some page, and using regex, extract some very specific text from it.
Using regular expressions on HTML is generally not recommended, which is stated eloquently in this StackOverflow answer. Are regexes actually in problem statement? If so, and this is part of some formal education, then I would suggest changing schools ;)

And obviously wouldn't a C# requirement conflict with implementing the solution in Lisp?
totonex wrote:First of all, i have Emacs + CL Lisp, and LispWorks (which works with CL).
LispWorks is an implementation of CL, not "works with CL", whatever that means. Common Lisp is a language standarized by ANSI, and there are many implementations of that standard. The CL package for Emacs is not one of them, it just provides a layer of some useful functions and macros to mimic some CL features, but is hardly complete.
totonex wrote:1. Should i do a console application which does this? 1b. How do i make a console application in emacs?
2. Should i use LispWorks to make a fancy desktop app.
How should we know? You should do what you want/need to do. Not most likely something running inside Emacs, that wouldn't make much sense. It is possible to use Emacs in batch mode like this, but that doesn't really make sense in this case. If you have full version of LispWorks you could create a command line application in that, or use some open source CL implementation.
totonex wrote:3. Should i not use CL, but use Clojure? //clojure's just fancy.
Clojure requires Java Virtual Machine and using Java libraries for many things. If you already are comfortable with Java environment, the you can. I am not sure if there are many people on this forum using Clojure though, so if you choose to do so you might want to look for help elsewhere.
totonex wrote:4.Can someone give me some insight on functions to use to get HTML from a specific URL, and do i still need regex for getting my specific information? If so, what support does CL for the matter?
LispWorks probably provides all of that functionality built in somehow. Since I have more experience with open source implementations, I would use drakma to download the data, and closure-html and xpath to parse it and extract information from it. If you really want regular expressions there is cl-ppcre. All those work with LispWorks as well, of course.

Installing those libraries and their dependencies is best achieved using clbuild, preferably on a sane operating system, and learning to use ASDF (there is a basic tutorial here). The language in general is best learnt from book like Practical Common Lisp.

totonex
Posts: 3
Joined: Wed Feb 03, 2010 1:51 am

Re: HTML and LISP

Post by totonex » Wed Feb 03, 2010 3:36 pm

I've been reading this book http://www.scs.cmu.edu/~dst/LispBook/index.html about lisp, that's how i got caught into it.
I have to search in a html page for a very specific <div> tag with a specific rank attached to it, and the information encased within the div. I know using regex will be a headache, and by using Lisp, it will just resume to 'cdring down a list and its sublists.
I don't have a specific time issue regarding the assignment, and it was suggested to me to build it in C#. It's just that having a little knowledge of Lisp, i simply can't ignore the fact that an html page is just a list with sublists.
Given this info, could anyone pinpoint me to some resources (e.g. sample programs done in lispworks, a comprehensive lispworks book), i really feel that this can be done in not-so-many lines of code, given the similarity between html/lisp.
Thank you.
Edit: thank you for the links.

ramarren
Posts: 613
Joined: Sun Jun 29, 2008 4:02 am
Location: Warsaw, Poland
Contact:

Re: HTML and LISP

Post by ramarren » Wed Feb 03, 2010 4:12 pm

totonex wrote:I've been reading this book http://www.scs.cmu.edu/~dst/LispBook/index.html about lisp, that's how i got caught into it.
That is a good book as an introduction to programming using Lisp. Practical Common Lisp is a bit more, well, practical, although it sometimes rushes through some topics, so it might be confusing for beginners.
totonex wrote:I have to search in a html page for a very specific <div> tag with a specific rank attached to it, and the information encased within the div.
That is exactly what xpath is for. I have already linked to pure Common Lisp implementation of it.
totonex wrote:and by using Lisp, it will just resume to 'cdring down a list and its sublists.
This can be done, and I tried doing things like that a few times, but it quickly becomes annoying. There is simply enough structure in [X/HT]ML document that creating a program aware of that structure from primitives is a bit more complex than one would expect.
totonex wrote:Given this info, could anyone pinpoint me to some resources (e.g. sample programs done in lispworks, a comprehensive lispworks book)
It is generally recommended not to depend to strongly on a particular implementation. Common Lisp is a standarized language, and there are many common extensions which are handled by compatibility libraries, which means that it is in many cases possible to write a program which will run in most conforming implementations. Unless you need some implementation specific features, like CAPI, which you might if you plan on deploying on Windows.

Since my experience with commercial implementations is limited I don't know any LispWorks specific resources, but they do have a fairly comprehensive documentation.
totonex wrote:i really feel that this can be done in not-so-many lines of code, given the similarity between html/lisp.
Well, it could be done in quite few lines of code, but mostly due to preexisting libraries rather than any similarity between html and lisp, since they are fairly narrow when considering the problem domain. That is, do not expect any magic, this will not be somehow drastically more simple than in any other language. Well, the problem as described is rather "trivial if you already know how to do it", but that is the point... you are likely to spend much more time learning Lisp in general rather that as it applies to the problem, and trying to go around that will only end up in confusion. Of course, learning Lisp is great for many reasons, so you should do that anyway.

totonex
Posts: 3
Joined: Wed Feb 03, 2010 1:51 am

Re: HTML and LISP

Post by totonex » Thu Feb 04, 2010 11:40 am

Thank you very much !

JamesF
Posts: 98
Joined: Thu Jul 10, 2008 7:14 pm

Re: HTML and LISP

Post by JamesF » Thu Feb 04, 2010 3:01 pm

Welcome!

If you do this in lisp, you'll want to use a couple of Edi Weitz' libraries: http://weitz.de/drakma/ and http://weitz.de/cl-ppcre/.

A couple of bits of advice:
- first, get it working in a language you already know. Then you at least have a result that you can provide for assessment, even if it's not the one you'd have preferred to give.
- get the core functionality working first, then (if you still have the time) look at making a fancy desktop app out of it. In CL, start by using Drakma to retrieve the HTML, then use a suitable function from cl-ppcre to extract the matching text. I leave the implementation details as an exercise for the student, since writing it for you would defeat the point :)

Is a regular expression actually required? If it was stated that you're to use one, presumably in the interests of helping you learn about those, then that's definitely what you should use. Otherwise, a regex is simply one of the options; the next obvious one is to use an HTML parser to transform the page into a tree, and then write something that'll retrieve the relevant element. Neither option is inherently right or wrong; it's a matter of what best suits the task.

Lisp is a marvellously powerful language, but it has a steep learning curve. Actually, it's mostly learning-curve because, once you've learned how to do something, the language pretty much gets out of your way so that all you really notice are the things you don't yet know how to do. Very frustrating until you realise what's going on.
But I digress. My point is that it's definitely worth investing time into learning Lisp, but you should consider how much time you have on hand before committing to it to deliver a result in it. I'm a sysadmin by profession, working for a software development company; I've seen this kind of mistake made a couple of times, and may even have made it myself.

To help clear up a bit of confusion: CL, (Common Lisp, the language) is a specification. That spec has been implemented by a variety of teams, so we have Gnu CL, the confusingly-named Clisp, SBCL (a very popular open-source one), ABCL, ECL, Allegro Common Lisp from Franz, and Lispworks, as well as the ones I've momentarily forgotten. Each one of them can legitimately be referred to as "CL" because they all implement the spec. So what you have is Emacs and Lispworks, which can equally be described as Emacs and CL (or Emacs and a CL implementation).
It's similar in spirit to C, where you have a spec and a variety of compilers to choose from, as distinct from simply having Perl or Python, where the implementation is the spec.

Clojure is nearly, but not quite, Common Lisp. Rich Hickey took the language, made a few changes that he thought appropriate, and implemented it on the Java VM. I've not yet played with it myself, so have no strong opinion; if you're keen to learn a variety of languages, I see no reason not to try it as well.


Hope this helps,
James

Post Reply