How to export a blog as a document

Anyone got smart ideas here? I want to convert my whole blog into a formattable document, including comments, with a view to doing a book format (The Very Best of Evolving Thoughts). I want to be able to edit it, and put it through InDesign, and I want to do the whole thing in one go.

I’ve tried importing WordPress’ XML files, but nothing works. I’ve tried finding an export plugin, but nothing exists (though many have asked for it on the forums). I can export individual posts to InDesign, but that means I can’t convert the resulting markup to Word or Pages.

Any ideas, oh wonderful crowd of readers? Anyone want to write a WRX to RTF filter?


  1. Matthew Platte Matthew Platte

    Build a screen scraper?

    Here’s a crude sketch in Ruby:

    require 'nokogiri'
    require 'open-uri'

    @doc = Nokogiri::HTML(open(""))
    # class entry-title
    # class entry-meta
    # class entry-content

    @title = @doc.at_css("h1.entry-title").text
    puts @title

    @meta = @doc.at_css(".entry-meta").text
    puts @meta

    @content = @doc.at_css(".entry-content").text
    puts @content

    More sophisticated parsing, extraction and persistent storage would be necessary.

    • It is that “more sophisticated parsing” that causes all the trouble 🙂

    • I just needed a historians to do the research for me! Thanks, Chris – I’ll report back on how well it works.

      • It won’t export comments and not to anything I can edit.

  2. I’m with Chris – I’ve used anthologize for this purpose & it’s not bad. It doesn’t give you perfectly clean copy, but it certainly gives you enough to work with.

  3. Ben Breuer Ben Breuer

    Perhaps somebody like Ed Yong, who published from his blog, would know better?

    Can you do any LaTeX wizardry to your wordpress files?

    • I can write a grep file in a number of environments, but I really hoped someone else would do that for me. I used to do that for a living, and it’s really, really, boring.

  4. When I had to export my whole blog, I was able to set the RSS feed to display all entries. That’s a cinch to import. But WordPress may not have such a setting.

