Thursday, July 05, 2007

How to scrape a web site and come up with a book

I've run across many public domain books which someone has taken the loving time to set up as multiple HTML pages on a site.

But what I wanted was this book all in one piece so I could read it. Or have it on my hard drive for research.

The solution I found:

1. Scrap the site with HTTrack or similar.
2. Using Acrobat (not Reader), convert that site into a PDF.
3. Save that PDF as a single HTML file (3.2)
4. Edit that HTML file in NoteTab or a decent text editor with good search/replace functions (NoteTab is able to strip HTML code and leave just the text.)
5. Open this up in OpenOffice (far better than MS Word) and format it, using a Lulu template.
6. Publish to Lulu so others can benefit from your hard work.

Unless there are numerous typo's in the original, I sometimes can have a book up on Lulu within a few hours of finding that site online.

The point is that you are finding and re-publishing public domain works - not a person's currently copyrighted stuff, which would get you in trouble with Lulu and others...

No comments: