Friday, December 21, 2007

Experimenting with XML

An unintended use of this blog is that I am using it as a central repository for notes on my projects. I have three unpublished entries that are notes for various projects. This post is actually about a week old and gradually became what you see now.

I plan on working with the English Wikipedia database in XML. At nearly 12 GB, 203 million lines, and millions of entries it is several orders of magnitude larger than any file I have attempted to process. Most tools just give up at dealing with it. I resorted to vi to save a slice of the document to another file, and less to view and copy the end of the file. I created a manageable 700 line sample of the database. It was good to look through the XML of Wikipedia to see the wiki tags in their unparsed form. I wasn't very happy with the formatting of the articles as I will have to do three types of parsing on the document. Once for the XML, once to strip out the words in certain tags and documents, and again for the words. Plans are to parse the tags and words in the same pass after they have been retrieved from the XML. I did grumble a bit at the Wiki tags being embedded in the document instead of the whole thing being pure XML. Made me think of this. Another hurdle I suppose.

The English Wikipedia is not the only wikipedia database available for download. This page has a list of databases in XML that are regularly generated. Wikis on anything from 50 cent to Zombies are available. Uncyclopedia is what I sometimes use to keep myself awake during class and it is a usable size at 390 MB. Small enough to fit in memory, large enough to take time to process. It takes around 1:30 to parse with my current parser that I hope to be publishing here soon.

I started reading Uncyclopedia a couple of hours ago. Started off with Lisp, Why stick things in an electrical outlet?, and 667:Neighbor_of_The_Beast. I guess no more coding tonight.

No comments: