Perl & LWP

Perl & LWP
author	Sean M. Burke
pages	264
publisher	O'Reilly and Associates
rating	9
reviewer	mir
ISBN	0-596-00178-9
summary	Excellent introduction to extracting and processing information from web-sites

Summary

Perl & LWP is a solid, no-nonsense book, that will teach you how to do screen-scraping using Perl.. It describes how to automatically retrieve and use information from the web. An introduction to LWP and related modules from simple to advanced uses and various ways to extract information from the returned HTML.

The good: nice style, good coverage of the subject, includes introduction to all the modules used, reference material and good, well-developed examples. It really liked the way it describes the basic methodology to develop screen-scraping code, from analyzing an HTML page to extracting and displaying only what you are interested in.

The bad: not much, some chapters are a little dry, some times the reference material could be better separated from the rest of the text. The book covers only simple access to web sites, I would have liked to see an example where the application really dialogs with the server. The appendixes are not really useful.

More Info

If it had not been published by O'Reilly, "Perl and LWP" could have been titled "Leveraging the Web: Object-Oriented techniques for information re-purposing", or "Web Services, Generation 0". An even better title would have been "Screen-scraping for fun and profit": one day we might all use Web Services and easily get the information we need from various providers using SOAP or REST, but in the meantime the common way to achive this goal is just to write code to connect to a web server, retrieve a page and extract the information from the HTML. In short "screen-scraping". This will teach you all about using Perl to get Web pages and extract their "substantifique moëlle" (the pith essence, the essentials) for your own usage. It showcases the power of Perl for that kind of job, from regular expressions to powerful CPAN modules.

At 200 pages, plus 40 pages of appendixes and index, it is part of that line of compact O'Reilly books that cover only a narrow topic but that covers it well. Just like Perl & XML its target audience is Perl programers that need to tackle a new domain. It will give them a toolbox and basic techniques that will give them a jump start and avoid many mistakes.

"Perl & LWP" starts from the basics: installing LWP, using LWP::Simple to retrieve a file from a URL, then goes on to a more complete description of the advanced LWP methods for dealing with forms and munging URL's. It continues with 5 chapters on how to process the HTML you get, using regular expressions, an HTML tokenizer and HTML::TreeBuilder, a powerful module that... builds a tree from the HTML. It goes on with how to allow your programs to access sites that require cookies, authentication or the use of a specific browser. The final chapter wraps it all up in a bigger example: a web-spider.

The book is well-written and to-the-point. It is structured in a way that mimics what a programer new to the field would do: start from the docs for a module, play with it, write snippets of code that use the various functions of the module, then go on to coding real-life examples. I particularly liked the fact that the author often explains the whys and not only the hows of the various pieces of code he shows us.

It is interesting to note that going from regular expressions to ever more powerful modules is a path followed also by most Perl programers, and even by the language itself: when Perl start being applied to a new domain first there are no modules, then low-level ones start appearing, then, as the understanding of the problem grows, easier-to-use modules are written.

Finally I would like to thank the author for following his own advice on including interesting examples and above all for not including anything about retrieving stock-quotes.

An other recommended book on the subject is Network Programming with Perl by Lincoln D. Stein, which covers a wider subject but devotes 50 pages to this topic and is also very good.

Breakdown by chapter

1. Introduction to Web Automation (15 pages): an overview of what this book will teach you, how to install Gisle Aas' LWP, some interesting words of caution about the brittleness of screen-scraping code, copyright issues and respect for the servers you are about to hammer, and finally a very simple example that shows the basic process of web automation.
2. Web Basics (16p): describes how to use LWP::Simple, an easy way to do some simple processing.
3. The LWP Class Model (17p): a slightly steeper read, closer to a reference than to a real introduction that lays out the ground work for the good stuff ahead.
4. URLs (10p): another reference chapter, this one will teach you all you can do with URLs using the URI module. Although the chapter is clear and complete it includes little explanation as to why you will need to process URLs and it is not even mentionned in the introduction roadmap.
5. Forms (28p): a complete and easy to read chapter. It includes a long description of HTML form fields that can be used as a reference, 2 fun examples (how to get the number of people living in any city in the US from the Census web site and how to check that your dream vanity plate is available in California) and how to use LWP to upload files to a server. It also describes the limits of the technique. I appreciated a very educative section showing how to go from a list of fields in a form to more and more useful code that queries that form.
6. Simple HTML processing with Regular Expressions (15p): how to extract info from an HTML page using regexps. The chapter starts with short sections about various useful regexp features, then presents excellent advice on troubleshooting them, the limits of the technique and a series of examples. An interesting chapter, but read on for more powerful ways to process HTML. On the down side, I found the discussion of the s and m regexp modifiers a little confusing.
7. HTML processing with Tokens (19p): using a real HTML parser is a better (safer) way to process HTML than regexps. This chapter uses HTML::TokeParser. It starts with a short, reference-type intro, then a detailed example. An other reference section describes the methods an alternate way of using the module, with short examples. This is the kind of reference I find the most useful, it is the simplest way to understand how to use a module.
8. Tokenizing walkthrough (13p) a long Example showing step-by-step how to write a program that extracts data from a web site, using HTML::TokeParser. The explanations are very good, showing _why_ the code is built this way and including alternatives (both good and bad ones). This chapter describes really well the method readers can use to build their code.
9. HTML processing with Trees (16p): even more powerfull than an HTML tokenizer: HTML::TreeBuilder (written by the author of the book) builds a tree from the HTML. This chapter starts with a short reference section, then revisits 2 previous examples of extracting information from HTML using HTML::TreeBuilder.
10. Modifying HTML with Trees (17p): More on the power of HTML::TreeBuilder: a reference/howto on the modification functions of HTML::TreeBuilder, with snippets of code for each function I really like HTML::TreeBuilder BTW, it is simple yet powerfull
11. Cookies, Authentication and Advanced Requests (13p): Back to that LWP business... this chapter is simple and to-the-point: how to use cookies, authentication and referer to access even more web-sites. I just found that it lacked a description on how to code a complete session with cookies.
12. Spiders (20p): a long example describing how to build a link-checking spider. It uses most of the techniques previouly described in the book, plus some additional ones to deal with redirection and robots.txt files.
Appendixes

I think the Appendixes are actually the weakest part of the book, most of them are not really useful, apart from the ASCII table (every computer book should have an ASCII table IMHO ;--).
- A. LWP modules (4p): the list and one line description of all modules in the LWP library, long and impressive! But not very useful,
- B. HTTP status (2p): available elsewhere but still pretty useful,
- C. Common MIME types (2p): lists both the usual extension and the MIME type,
- D. Language Tags (2p): the author is a linguist ;--)
- E. Common Content Encodings (2p): character set codes,
- F. ASCII Table (13p): a very complete table, includes the ascii/unicode code, the corresponding HTML entity, description and glyph,
- G. User's View of Object-Oriented Modules (11p): this is a very good idea. A lot of Perl programers are not very familiar with OO, and in truth they don't need to. They just need the basics of how to create an object in an existing class and call methods on it. I found the text too be sightly confusing though, in fact I believe it is a little too detailed and might confuse the reader.
- Index (8p): I did not think the index was great (code is listed with references to 5 seemingly random pieces of code, type=file, HTML input element is listed twice, with and without the comma...), but this is not the kind of book where the index is the primary way to access the information. The Table of Content is complete and the chapters are focussed enough that I have never needed to use the index.