-- What?: Winter of Code

Showing posts with label Winter of Code. Show all posts

Saturday, January 12, 2008

Avoiding rendezvous to delay starting of tasks

A common problem I encounter is that something needs to be done before task can start running. I've always solved that by using a Start entry on tasks and calling that entry when it is safe for them to start. Calling entries works fine but I'd like to think the following is a bit neater:
(declare_task_test.adb)

with Ada.Text_IO;

-- Alex "ahab" Habeger
-- Quick demonstration on declare blocks and tasks

procedure Declare_Task_Test is

   task type Delaying_Task;

   task body Delaying_Task is
   begin
      for Index in 1..5 loop
         Ada.Text_IO.Put_Line("Delay " & Integer'Image(Index));
         Delay (1.0);
      end loop;
      Ada.Text_IO.Put_Line("Done with loop");
   end Delaying_Task;
   
begin
   Ada.Text_IO.Put_Line("No task yet.");
   declare
      The_Task : Delaying_Task;
   begin
      Ada.Text_IO.Put_Line("In declare block.");
   end;
   Ada.Text_IO.Put_Line("Declare block ended.");
end Declare_Task_Test;

Instead of delaying when the tasks start, delay when the tasks are declared.

Friday, January 11, 2008

Full XML Parser for Wikis

You may want a primer on parsing XML with Ada. That is an... okay... example, but better than many Ada packages have on the web. The following is a more complete, but more complicated example. In XML/Ada you supply an object to the parser that provides the rules for how to parse the document. It is a pretty easy to work with and extend. This code was some of the earliest I started writing in this whole adventure. I waited to finalize it until now so I could completely figure out how the client code would work with it.

This is the client to the XML/Ada library and makes use of the parser object posted below: (test_wiki_parser.adb)

with Ada.Command_Line;
with Ada.Text_IO;
with Wiki_Readers;
with Input_Sources.Large_File;
with Sax.Readers;
with Ada.Strings.Unbounded;

procedure Test_Wiki_Parser is
   
   File_Source : Input_Sources.Large_File.Large_File_Input(16);
   Wiki_Parser : Wiki_Readers.Wiki_Reader;
   
   -- passed to the Wiki_Parser to put the title to standard out
   procedure Put_Title (Article : in Wiki_Readers.Full_Document) is
   begin
      Ada.Text_IO.Put_Line(Ada.Strings.Unbounded.To_String(Article.Title) & 
                           " is article # " & Integer'Image(Article.Number));
   end Put_Title;
   
   -- passed to the XML_Parser to relay the # of documents in the collection
   procedure Put_Num_Articles (Value : in Natural) is
   begin
      Ada.Text_IO.Put_Line(Integer'Image(Value) & " documents in collection");
   end Put_Num_Articles;

begin

   if Ada.Command_Line.Argument_Count /= 1 then
      Ada.Text_IO.Put_Line ("Need to specify a filename!");
      return;
   end if;

   Wiki_Readers.Set_Processes (Handler => Wiki_Parser,
                               Document_Process   => Put_Title'Access, 
                               Collection_Process => Put_Num_Articles'Access);

   Input_Sources.Large_File.Open(Filename => Ada.Command_Line.Argument(1),
                                 Input     => File_Source);
   Wiki_Readers.Parse (Parser => Wiki_Parser,
                       Input  => File_Source);
   Input_Sources.Large_File.Close (File_Source);

exception
   when Sax.Readers.XML_Fatal_Error =>
      Input_Sources.Large_File.Close (File_Source);
end Test_Wiki_Parser;

This uses the Input_Sources.Large_File posted earlier. This entire program boils down to:

Tell parser what to do
Open file
Parse file
Close file

(wiki_reader.ads)

with Sax.Readers;
with Sax.Attributes;
with Unicode.CES;
with Ada.Strings.Unbounded; --use Ada.Strings.Unbounded;

package Wiki_Readers is
           
   type Wiki_Reader is new Sax.Readers.Reader with private;
   
   -- Stores information on documents
   type Full_Document is
      record
         Number : Natural := 0; -- What article # this is
         Title  : Ada.Strings.Unbounded.Unbounded_String;
         Text   : Ada.Strings.Unbounded.Unbounded_String;
      end record;

   procedure End_Document (Handler : in out Wiki_Reader);
   
   procedure Start_Element
     (Handler       : in out Wiki_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "";
      Atts          : Sax.Attributes.Attributes'Class);
      
   procedure End_Element
     (Handler       : in out Wiki_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "");
      
   procedure Characters
     (Handler  : in out Wiki_Reader;
      Sequence : Unicode.CES.Byte_Sequence);

   -- Takes in pointers to procedures on what to do at end of each 
   -- document and at the end of the collection      
   procedure Set_Processes
     (Handler  : in out Wiki_Reader;
      Document_Process   : access procedure (Article : Full_Document);
      Collection_Process : access procedure (Value : Natural));

private
   -- For indicating what field the characters come from.
   type Character_Field_Type is (Unknown, Text, Title);

   type Wiki_Reader is new Sax.Readers.Reader with record
      Document_Process   : access procedure (Article : Full_Document);
      Collection_Process : access procedure (Value : Natural);
      Character_Field  : Character_Field_Type := Unknown;
      Valid_Title      : Boolean; -- Validity of current title (not image, category, or other)
      Page             : Full_Document;
   end record;
end Wiki_Readers;

(wiki_reader.adb)

with Sax.Readers;    use Sax.Readers;
with Unicode.CES;    use Unicode.CES;
with Unicode;        use Unicode;
with Sax.Attributes;
with Ada.Strings;

package body Wiki_Readers is

   ------------------
   -- End_Document --
   ------------------

   procedure End_Document (Handler : in out Wiki_Reader) is
   begin
      Handler.Collection_Process(Handler.Page.Number);
   end End_Document;

   -------------------
   -- Start_Element --
   -------------------

   procedure Start_Element
     (Handler       : in out Wiki_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "";
      Atts          : Sax.Attributes.Attributes'Class) is
   begin
      -- Only care about start of title and text
      if Local_Name = "title" then
         Handler.Character_Field := Title; -- Opening tag of title
      elsif Local_Name = "text" and Handler.Valid_Title then
         Handler.Character_Field := Text; -- Opening tag of text
      end if;
   end Start_Element;

   -----------------
   -- End_Element --
   -----------------

   procedure End_Element
     (Handler : in out Wiki_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "") is
   begin
      Handler.Character_Field := Unknown;
      if Local_Name = "page" then
         -- at the end of a page
         begin
            if Handler.Valid_Title and Ada.Strings.Unbounded.Element(Handler.Page.Text, 1) /= '#' then
               -- title is valid, and article is not blank or a redirect, so the article is passed on
               Handler.Page.Number := Handler.Page.Number + 1;
               Handler.Document_Process(Handler.Page);
            end if;
         exception
            when ADA.STRINGS.INDEX_ERROR =>
               null;
         end;
         -- unconditional re-initialization
         Handler.Page.Title := Ada.Strings.Unbounded.Null_Unbounded_String;
         Handler.Page.Text  := Ada.Strings.Unbounded.Null_Unbounded_String;
      elsif Local_Name = "title" then
         -- Done parsing the characters for the title, check if it is valid or not
         Handler.Valid_Title := Ada.Strings.Unbounded.Index(Handler.Page.Title, ":") = 0;
      end if;
   end End_Element;

   ----------------
   -- Characters --
   ----------------

   procedure Characters
     (Handler  : in out Wiki_Reader;
      Sequence : Unicode.CES.Byte_Sequence) is
   begin
      if Handler.Character_Field = Text then
         Ada.Strings.Unbounded.Append(Handler.Page.Text, Sequence);
      elsif Handler.Character_Field = Title then
         Ada.Strings.Unbounded.Append(Handler.Page.Title, Sequence);
      end if;
   end Characters;
   
   procedure Set_Processes
     (Handler  : in out Wiki_Reader;
      Document_Process   : access procedure (Article : Full_Document);
      Collection_Process : access procedure (Value : Natural)) is
   begin
      Handler.Document_Process   := Document_Process;
      Handler.Collection_Process := Collection_Process;
   end Set_Processes;

end Wiki_Readers;

I tried passing the two processes (Document_Process & Collection_Process) at the declaration of Wiki_Parser, but evidently Ada does not like access types being passed there (and a compiler bug). I could possibly wrap them in a record and pass them through that also. I think this way is not the preferred way, but it works.

Yay closures.

This combo fully parses the new 14 GB version of Wikipedia in 25 minutes. This is not the exact version that will be used in the final product. The Full_Document type will be declared elsewhere. Other than that all the code above will be included.

Friday, January 4, 2008

The difference a review can make

One of my previous classes involved building a search engine for a document. The professor demonstrated different ways of indexing documents. One that really jumped out at me was ngramming. Ada's ability to do Enumeration_Type'Value on an equivalently sized string and return Enumeration_Type really made this attractive to me. Seemed like a pretty straightforward idea and should be easy to implement.

It wasn't.

I wrote a script to define the ngram type of all combinations of 3 letters. The script output a file "aaa, aab, aac, .. zzx, zzy, zzz". There are 17576 combinations of 3 letters. Ada has 12 reserved words that are 3 letters so I had to code around that. My once neat "abs" became "absa" and I had to add a case statement with 7 choices and at least one if-else in every choice of that case. Yipe. What was a very close representation of the data just became a burden. The conversion function and the accompanying type declaration make for a nearly 90 KB file. Ugh.

Using a script to write the data type for me should have been a clue.

I've been looking over that code recently and came up with a much neater solution. Here is the declaration and test of what should be a nearly drop in replacement and is probably faster to boot:
(trigram_test.adb)

with Ada.Text_IO;

procedure Trigram_Test is
   
   subtype Trigram is Positive range 1..17576;
   
   subtype Trigram_String is String (1..3);
   
   function To_Trigram (Item : Trigram_String) return Trigram is
   begin
      return ((Character'Pos(Item(1))) - 96) +
             ((Character'Pos(Item(2))) - 96) * 26 +
             ((Character'Pos(Item(2))) - 96) * 676;
   end To_Trigram;
   
begin
   Ada.Text_IO.Put_Line(Integer'Image(To_Trigram("aaa")));
   Ada.Text_IO.Put_Line(Integer'Image(To_Trigram("zzz")));
end Trigram_Test;

In comparison the original enumeration type declaration took 704 lines. This new type might be not as representative, but I never used the representation in the old implementation. This is by far easier to understand and maintain.

Tuesday, January 1, 2008

Plans gone awry

An unforeseen obstacle is that the file IO included with XML/Ada is not the most robust solution. The included file input source works by reading the entire xml file to a buffer at once. The largest file it can index is 2 GB. That is not going to work for the 12 GB file I have.

My first thought was to read large blocks with Sequential_IO for the majority of the file and then use progressively smaller calls to Direct_IO for the remainder. The size of the calls to Direct_IO would be determined by the leq_pot demonstration posted earlier. I decided against this approach because of the complicated buffer and file slicing. I would've ended up with a massive development headache and then needed to run more extensive tests. Best tests are tests you don't have to run.

Stream_IO is one of the more powerful IO packages in Ada. I had toyed with the idea of using Stream_IO to read the data, then using Unchecked Conversion to change the data to the proper type. I'm not a fan of Unchecked Conversion, and I would need to have multiple copies of the same data in memory. I considered that ugly and moved on. The second attempt was to make a buffered Stream_IO package. I wrote the package pretty quickly and started implementing it. After doing more investigation I discovered that the decoding procedure for Unicode characters requires that the buffer be internal to the Input_Source and not in the IO package. Back to the drawing board.

After visiting the drawing board this time I felt much better about my solution. Using Stream_IO, buffering logic from the previous attempt, and a procedure borrowed from the UTF-8 Unicode encoding I managed to make an object that reads a file using Stream_IO and converts Stream_Elements directly to UTF-8 characters. Additional encoding schemes shouldn't be hard to do, but I'll just stick with UTF-8 for now. This is what I came up with:

(input_source-large_file.ads)

-----------------------------------------------------------------------
--                XML/Ada - An XML suite for Ada95                   --
--                                                                   --
--                       Copyright (C) 2001-2002                     --
--                            ACT-Europe                             --
--                    Modified by Alex Habeger 2008                  --
--                                                                   --
-- This library is free software; you can redistribute it and/or     --
-- modify it under the terms of the GNU General Public               --
-- License as published by the Free Software Foundation; either      --
-- version 2 of the License, or (at your option) any later version.  --
--                                                                   --
-- This library is distributed in the hope that it will be useful,   --
-- but WITHOUT ANY WARRANTY; without even the implied warranty of    --
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU --
-- General Public License for more details.                          --
--                                                                   --
-- You should have received a copy of the GNU General Public         --
-- License along with this library; if not, write to the             --
-- Free Software Foundation, Inc., 59 Temple Place - Suite 330,      --
-- Boston, MA 02111-1307, USA.                                       --
--                                                                   --
--
--
--
--
--
--
--

with Ada.Streams.Stream_IO;
with Ada.Streams;
with Unicode;
with Unicode.CES;

package Input_Sources.Large_File is

   type Large_File_Input (Max_Buffer_Size : Positive) is new Input_Source with private;
   -- Max buffer size is the maximum size of the buffer in MB
   
   type Large_File_Input_Access is access all Large_File_Input'Class;
   --  A special implementation of a reader, that reads from a file.

   procedure Open (Filename : String; Input : out Large_File_Input);
   --  Open a new file for reading.
   --  Note that the file is read completly at once, and saved in memory.
   --  This provides a much better access later on, however this might be a
   --  problem for very big files.
   --  The physical file on disk can be modified at any time afterwards, since
   --  it is no longer read.
   --  This function can decode a file if it is coded in Utf8, Utf16 or Utf32

   procedure Close (Input : in out Large_File_Input);
   --  Close the file and free the memory

   procedure Next_Char
     (From : in out Large_File_Input;
      C    : out Unicode.Unicode_Char);
   --  Return the next character in the file.

   function Eof (From : Large_File_Input) return Boolean;
   --  True if From is past the last character in the file.

private
   
   type Stream_Element_Array_Access is access all Ada.Streams.Stream_Element_Array;
   
   -- the parent object
   -- type Input_Source is abstract tagged limited record
   --    Prolog_Size : Natural := 0;
   --    Es          : Unicode.CES.Encoding_Scheme :=
   --      Unicode.CES.Basic_8bit.Basic_8bit_Encoding;
   --    Cs          : Unicode.CCS.Character_Set :=
   --      Unicode.CCS.Unicode_Character_Set;
   --    Public_Id   : Unicode.CES.Byte_Sequence_Access;
   --    System_Id   : Unicode.CES.Byte_Sequence_Access;
   -- end record;
   
   type Large_File_Input (Max_Buffer_Size : Positive) is new Input_Source with
      record
         Buffer_Index : Ada.Streams.Stream_Element_Offset;
         Buffer_Last  : Ada.Streams.Stream_Element_Offset;
         Buffer       : Stream_Element_Array_Access;
         File         : Ada.Streams.Stream_IO.File_Type;
      end record;
end Input_Sources.Large_File;

(input_sources-large_file.adb)

-----------------------------------------------------------------------
--                XML/Ada - An XML suite for Ada95                   --
--                                                                   --
--                       Copyright (C) 2001-2006                     --
--                            ACT-Europe                             --
--                    Modified by Alex Habeger 2008                  --
--                                                                   --
-- This library is free software; you can redistribute it and/or     --
-- modify it under the terms of the GNU General Public               --
-- License as published by the Free Software Foundation; either      --
-- version 2 of the License, or (at your option) any later version.  --
--                                                                   --
-- This library is distributed in the hope that it will be useful,   --
-- but WITHOUT ANY WARRANTY; without even the implied warranty of    --
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU --
-- General Public License for more details.                          --
--                                                                   --
-- You should have received a copy of the GNU General Public         --
-- License along with this library; if not, write to the             --
-- Free Software Foundation, Inc., 59 Temple Place - Suite 330,      --
-- Boston, MA 02111-1307, USA.                                       --
--                                                                   --
--
--
--
--
--
--
--
with Unicode;            use Unicode;
with Ada.Streams;        use Ada.Streams;
with Ada.Streams.Stream_IO; 
with Unicode.CES;        use Unicode.CES;
with Unicode.CES.Utf32;  use Unicode.CES.Utf32;
with Unicode.CES.Utf16;  use Unicode.CES.Utf16;
with Unicode.CES.Utf8;   use Unicode.CES.Utf8;
with Unchecked_Deallocation;

package body Input_Sources.Large_File is

   -------------------------
   -- Internal procedures --
   -------------------------
   procedure Free_Stream is new Unchecked_Deallocation
     (Stream_Element_Array, Stream_Element_Array_Access);
   
   procedure Fill_Buffer (Input : in out Large_File_Input) is
   begin
      if not Ada.Streams.Stream_IO.End_Of_File (Input.File) then
         Input.Buffer_Index := 1;
         Ada.Streams.Stream_IO.Read (Input.File, Input.Buffer.All, Input.Buffer_Last);
      end if;
   end Fill_Buffer;
   
   ----------
   -- Open --
   ----------

   procedure Open (Filename : in     String;
                   Input    :    out Large_File_Input) is
      Length : Ada.Streams.Stream_Element_Offset;
      BOM    : Bom_Type;
      BOM_Indicator : String (1..4);
   begin
      Ada.Streams.Stream_IO.Open (Input.File, Ada.Streams.Stream_IO.In_File, Filename);
      Length := Stream_Element_Offset (Ada.Streams.Stream_IO.Size (Input.File));

      --  If the file is empty, we just create a reader that will not return
      --  any character. This will fail later on when the XML document is
      --  parsed, anyway.
      
      
      if Length = 0 then
         -- File is empty
         Input.Buffer := new Ada.Streams.Stream_Element_Array (1 .. 1);
         Input.Buffer_Index := 2;
         return;
      elsif Length > Stream_Element_Offset(1048576 * Input.Max_Buffer_Size) then
         -- file is longer than the max buffer size.
         -- will set length of buffer to Max_Buffer_Size
         Length := Stream_Element_Offset(1048576 * Input.Max_Buffer_Size);
      -- else
         -- buffer is length of file
      end if;

      Input.Buffer := new Ada.Streams.Stream_Element_Array (1 .. Length);
      Fill_Buffer (Input);
      
      for Index in BOM_Indicator'Range loop
         BOM_Indicator(Index) := Character'Val(Input.Buffer.All(Stream_Element_Offset(Index)));
      end loop;

      Read_Bom (BOM_Indicator, Input.Prolog_Size, BOM);
      case BOM is
         when Utf32_LE =>
            -- Set_Encoding (Input, Utf32_LE_Encoding);
            raise Invalid_Encoding;
         when Utf32_BE =>
            -- Set_Encoding (Input, Utf32_BE_Encoding);
            raise Invalid_Encoding;
         when Utf16_LE =>
            -- Set_Encoding (Input, Utf16_LE_Encoding);
            raise Invalid_Encoding;
         when Utf16_BE =>
            -- Set_Encoding (Input, Utf16_BE_Encoding);
            raise Invalid_Encoding;
         when Ucs4_BE | Ucs4_LE | Ucs4_2143 | Ucs4_3412 =>
            raise Invalid_Encoding;
         when Utf8_All | Unknown =>
            Set_Encoding (Input, Utf8_Encoding);
      end case;

      Input.Buffer_Index := Input.Buffer'First + Stream_Element_Offset(Input.Prolog_Size);

      --  Base file name should be used as the public Id
      Set_Public_Id (Input, Filename);
      Set_System_Id (Input, Filename);
   end Open;

   -----------
   -- Close --
   -----------

   procedure Close (Input : in out Large_File_Input) is
   begin
      Ada.Streams.Stream_IO.Close (Input.File);
      Free_Stream (Input.Buffer);
      Input_Sources.Close (Input_Source (Input));
      Input.Buffer_Index := Ada.Streams.Stream_Element_Offset'Last;
   end Close;

   ---------------
   -- Next_Char --
   ---------------

   procedure Next_Char
     (From : in out Large_File_Input;
      C :       out Unicode.Unicode_Char) is
      
      Len  : Ada.Streams.Stream_Element_Offset;
      Val  : Unicode_Char;
      Char : Unicode_Char := Unicode_Char (From.Buffer.All(From.Buffer_Index));
   begin
      -- original:
      -- From.Es.Read (From.Buffer.all, From.Index, C);
      -- C := From.Cs.To_Unicode (C);
      -- Read converts from an 8 bit character string to unicode
      -- To_Unicode returns whatever is passed to it (aka nothing) ?!
      
      -- the following is slightly modified code from Unicode's conversion
      -- to UTF-8 procedure.
      
      -- Compute the length of the encoding given what was in the first byte
      if Char < 128 then
         Len := From.Buffer_Index;
         Val := Char and 16#7f#;
      elsif (Char and 16#E0#) = 16#C0# then
         Len := From.Buffer_Index + 1;
         Val := Char and 16#1f#;
      elsif (Char and 16#F0#) = 16#E0# then
         Len := From.Buffer_Index + 2;
         Val := Char and 16#0f#;
      elsif (Char and 16#F8#) = 16#F0# then
         Len := From.Buffer_Index + 3;
         Val := Char and 16#07#;
      elsif (Char and 16#FC#) = 16#F8# then
         Len := From.Buffer_Index + 4;
         Val := Char and 16#03#;
      elsif (Char and 16#FE#) = 16#FC# then
         Len := From.Buffer_Index + 5;
         Val := Char and 16#01#;
      else
         raise Invalid_Encoding;
      end if;

      if From.Buffer'Last < Len then
         raise Invalid_Encoding;
      end if;

      for Count in From.Buffer_Index + 1 .. Len loop
         Char := Unicode_Char(From.Buffer.All (Count));
         if (Char and 16#C0#) /= 16#80# then
            raise Invalid_Encoding;
         end if;
         Val := (Val * (2 ** 6)) or (Char and 16#3f#);
      end loop;

      From.Buffer_Index := Len + 1;
      C  := Val;
      
      if From.Buffer_Index > From.Buffer_Last then
         -- buffer is empty
         Fill_Buffer(From);
      end if;
   end Next_Char;

   ---------
   -- Eof --
   ---------

   function Eof (From : Large_File_Input) return Boolean is
   begin
      return From.Buffer_Index > From.Buffer_Last;
   end Eof;

end Input_Sources.Large_File;

The time required to parse a file using this input source is much more consistent. For the 390 MB Uncyclopedia the file source included with XML/Ada would take anywhere from 1:24 to 2:01. What I wrote it is always between 1:12 and 1:14. Part of the speedup is probably due to the inlining of the UTF-8 conversion. The more consistent times is likely due to finer grained IO. Using a 16 MB buffer CPU usage on a 2.4 Ghz Core 2 hovers between 90% and 100% utilization of a single core and 18 MB of RAM is consumed. A parser with high standards for what constitutes a document processed the 12 GB wikipedia and counted 1074662 articles in 23 minutes.

Another input source to verify this against with would be nice. I've verified this input source with the smaller Uncyclopedia. For all I know this source is faulty after the first 390 MB. Then again 390 MB worth of data is probably a high enough standard.

Lessons taught:

This is the first tagged type extension, outside of a book exercise, that I have done.
Tagged types cannot have default discriminants.
Don't expect to deal with large files easily.
Simple is better.

What I'd like to continue with:

Make the package handle more than UTF-8.
Find a more optimal buffer size.
Can another thread speed up execution or is it IO bound?

Wednesday, December 26, 2007

If its not one thing, its another

Sometimes I think I try to make my solutions to problems more elegant than needed. I encountered another hurdle in my quest, which sent me on a short detour. I will wait to share the problem, but I've attached a demonstration of the solution.

A demonstration of how to find the power of two that is less than or equal to the supplied value.
(leq_pot.adb)

with Ada.Numerics.Elementary_Functions;
with Ada.Text_IO;

   -- Alex "ahab" Habeger

   -- Demonstration of how to find the closest power of two
   -- that is less than or equal to a number.

procedure LEQ_POT is
   Result : Float;
begin
   -- the power of two that will be tested
   for Power in 1..10 loop
      -- an offset to illustrate boundaries
      for Offset in -1..1 loop
         Result := Ada.Numerics.Elementary_Functions.Log(
                 Base => Float (2),
                 X    => Float (2 ** Power + Offset));
         Ada.Text_IO.Put (Integer'Image(2**(Integer(Float'Floor(Result)))));
         Ada.Text_IO.Put_Line (Integer'Image(2 ** Power + Offset));
      end loop;
      Ada.Text_IO.Put_Line ("------");
   end loop;
end LEQ_POT;

Not an exciting program, but reminds me of some of those groaner homeworks that I've had before.

Friday, December 21, 2007

Experimenting with XML

An unintended use of this blog is that I am using it as a central repository for notes on my projects. I have three unpublished entries that are notes for various projects. This post is actually about a week old and gradually became what you see now.

I plan on working with the English Wikipedia database in XML. At nearly 12 GB, 203 million lines, and millions of entries it is several orders of magnitude larger than any file I have attempted to process. Most tools just give up at dealing with it. I resorted to vi to save a slice of the document to another file, and less to view and copy the end of the file. I created a manageable 700 line sample of the database. It was good to look through the XML of Wikipedia to see the wiki tags in their unparsed form. I wasn't very happy with the formatting of the articles as I will have to do three types of parsing on the document. Once for the XML, once to strip out the words in certain tags and documents, and again for the words. Plans are to parse the tags and words in the same pass after they have been retrieved from the XML. I did grumble a bit at the Wiki tags being embedded in the document instead of the whole thing being pure XML. Made me think of this. Another hurdle I suppose.

The English Wikipedia is not the only wikipedia database available for download. This page has a list of databases in XML that are regularly generated. Wikis on anything from 50 cent to Zombies are available. Uncyclopedia is what I sometimes use to keep myself awake during class and it is a usable size at 390 MB. Small enough to fit in memory, large enough to take time to process. It takes around 1:30 to parse with my current parser that I hope to be publishing here soon.

I started reading Uncyclopedia a couple of hours ago. Started off with Lisp, Why stick things in an electrical outlet?, and 667:Neighbor_of_The_Beast. I guess no more coding tonight.

Tuesday, December 18, 2007

Fixing the heat

Yes, the weather has been cold here lately, but that is not what this post is about.

In the spring of 07 I was given an assignment to solve the heat equation. The heat equation works simply by finding the average of a point and and all the points around it, setting that new found average as the new value, and repeating. My original attempt at this involved a monitor, three copies of the array in memory, and a lockup that I could not trace down. I would have forgotten about this assignment, but earlier this semester I had an epiphany of why it didn't work. The epiphany was that I only had one point of synchronization in a loop. At certain times threads would be released from that point of synchronization then run through their loop, and come back to the point of synchronization. That point of synchronization was still open because other threads were still in it. Once this happened the threads became out of sync and the program would deadlock.

The new solution: (heat.adb)

with Ada.Command_Line;
with Ada.Float_Text_Io;
with Ada.Text_Io;

procedure Heat is
   -- Alex "ahab" Habeger
   -- A quick exercise in thread synchronization.
   -- Solves the heat equasion for a 10x10 space where the temperatures of the left and right edges 
   -- are given and static. Uses 80 threads.

   -- Constants
   Size : constant := 10; -- The height and width of the array to process
   
   -- if Heat_Type is changed processing of the Tolerance parameter may need to be changed
   subtype Heat_Type is Float;
   subtype Heat_Array_Index_Type is Integer range 1..Size;
   type Heat_Array_Type is array (Heat_Array_Index_Type, Heat_Array_Index_Type) of Heat_Type;

   -- In case proper command line parameters are not specified
   Usage_Error : exception;

   -----------------------------------------
   -- The acutal math for calculating heat
   procedure Calculate_Heat (Heat_Array : in out Heat_Array_Type;
                             Tolerance  : in     Heat_Type) is

      -- Local variables
      All_Under_Tolerance : Boolean;
      -----------------------------------------
      -- The protected bodies for Synchronizing
      -----------------------------------------
      protected Synchronizer is
         entry First_Sync (Under_Tolerance : in Boolean); -- Holds all tasks until all check in
         entry Second_Sync; -- Holds all tasks until all check in
      private
         First_Sync_Open  : Boolean := False;
         Second_Sync_Open : Boolean := False;
      end Synchronizer;
      
      protected body Synchronizer is
         entry First_Sync (Under_Tolerance : in Boolean) 
            when First_Sync'Count = Size * (Size - 2) or First_Sync_Open is
         begin
            if First_Sync'Count = Size * (Size - 2) - 1 then
               All_Under_Tolerance := True;
               First_Sync_Open := True;  -- First one opens door for rest
            elsif First_Sync'Count = 0 then
               First_Sync_Open := False; -- Last one out closes the door
            end if;
            All_Under_Tolerance := All_Under_Tolerance and Under_Tolerance;
         end First_Sync;
         -----------------------------------------
         entry Second_Sync 
            when Second_Sync'Count = Size * (Size - 2) or Second_Sync_Open is
         begin
            if Second_Sync'Count = Size * (Size - 2) - 1 then
               Second_Sync_Open := True;  -- First one opens door for rest
            elsif Second_Sync'Count = 0 then
               Second_Sync_Open := False; -- Last one out closes the door
            end if;
         end Second_Sync;
      end Synchronizer;
      -----------------------------------------
      -- The tasks to do the averaging:
      -----------------------------------------
      task type Averaging_Task is
         entry Get_Index (New_X : in Heat_Array_Index_Type;
                          New_Y : in Heat_Array_Index_Type);
      end Averaging_Task;
      -----------------------------------------
      -- 2D array of tasks
      type Averager_Task_Array is array (Heat_Array_Index_Type'First + 1 .. 
         Heat_Array_Index_Type'Last - 1, Heat_Array_Index_Type'range) of Averaging_Task;

      Averagers : Averager_Task_Array;
      -----------------------------------------
      task body Averaging_Task is
         Value : Heat_Type;
         X     : Heat_Array_Index_Type;
         Y     : Heat_Array_Index_Type;
      begin
         accept Get_Index (New_X : in Heat_Array_Index_Type;
                           New_Y : in Heat_Array_Index_Type) do
            X := New_X;
            Y := New_Y;
         end Get_Index;
         Value := Heat_Array(X, Y);
         -- Averages the heat values
         -- One averaging per iteration
         Averaging_Loop:
         loop
            if Y = Heat_Array_Index_Type'First then
               -- Averages the left, right, and above values
               Value := (Heat_Array(X - 1, Y) + Heat_Array(X + 1, Y) +
                         Heat_Array(X, Y + 1) + Value) / Heat_Type(4);
            elsif Y = Heat_Array_Index_Type'Last then
               -- Averages the left, right, and below values
               Value := (Heat_Array(X - 1, Y) + Heat_Array(X + 1, Y) +
                         Heat_Array(X, Y - 1) + Value) / Heat_Type(4);
            else
               -- Averages the left, right, above and below values
               Value := (Heat_Array(X - 1, Y) + Heat_Array(X + 1, Y) +
                         Heat_Array(X, Y + 1) + Heat_Array(X, Y - 1) + Value) / Heat_Type(5);
            end if;
           
            -- Report if the change is under tolerance and sync
            Synchronizer.First_Sync (abs(Value - Heat_Array(X, Y)) < Tolerance);
            Heat_Array(X, Y) := Value;
            Synchronizer.Second_Sync;
            exit Averaging_Loop when All_Under_Tolerance;
         end loop Averaging_Loop;

         Ada.Text_Io.Put_Line(Integer'Image(X) & "," & Integer'Image(Y) & " has been calculated.");
      end Averaging_Task;
      --------------------------
   begin
      -- Initializes the Averagers with their index
      -- One Averager initialized per iteration
      for Index_X in Heat_Array_Index_Type'First + 1 ..Heat_Array_Index_Type'Last - 1 loop
         for Index_Y in Heat_Array_Index_Type'range loop
            Averagers(Index_X, Index_Y).Get_Index(Index_X, Index_Y);
         end loop;
      end loop;
   end Calculate_Heat;
   --------------------------
   -- Sets a heat array to the default values
   procedure Initialize_Heat_Array_Type (Target :    out Heat_Array_Type;
                                         Left   : in     Heat_Type;
                                         Right  : in     Heat_Type) is
   begin
      -- Initializes the left half of the array
      -- Loops across all the columns except the left edge
      -- One column initialized per iteration
      for Index_X in Heat_Array_Index_Type'First..Heat_Array_Index_Type'Last / 2 loop
         -- Loops across all the rows
         -- One element initialized per iteration
         for Index_Y in Heat_Array_Index_Type'range loop
            Target(Index_X, Index_Y) := Left;
         end loop;
      end loop;

      -- Initializes the remainder of the array to right
      -- Loops across all the columns except the left edge
      -- One column initialized per iteration
      for Index_X in Heat_Array_Index_Type'Last / 2 + 1 ..Heat_Array_Index_Type'Last loop
         -- Loops across all the rows
         -- One element initialized per iteration
         for Index_Y in Heat_Array_Index_Type'range loop
            Target(Index_X, Index_Y) := Right;
         end loop;
      end loop;
   end Initialize_Heat_Array_Type;
   --------------------------
   -- Puts a heat array to standard out
   procedure Put_Heat_Array_Type (Source : in Heat_Array_Type) is
   begin
      for Index_Y in Heat_Array_Index_Type loop
         for Index_X in Heat_Array_Index_Type loop
            Ada.Float_Text_Io.Put (
               Item => Source(Index_X, Index_Y),
               Fore => 4,
               Aft  => 1,
               Exp  => 0);
         end loop;
         Ada.Text_IO.New_Line;
      end loop;
   end Put_Heat_Array_Type;
   --------------------------
   Heat_Array : Heat_Array_Type;
   Tolerance  : Heat_Type;
   Left       : Heat_Type;
   Right      : Heat_Type;
begin
   if Ada.Command_Line.Argument_Count /= 3 then
      -- not given the proper number of parameters
      raise Usage_Error;
   end if;

   Tolerance := Heat_Type'Value(Ada.Command_Line.Argument(1));
   Left      := Heat_Type'Value(Ada.Command_Line.Argument(2));
   Right     := Heat_Type'Value(Ada.Command_Line.Argument(3));

   if Tolerance = 0.0 then
      raise Usage_Error;
   end if;

   Initialize_Heat_Array_Type(Heat_Array, Left, Right);
   Calculate_Heat(Heat_Array => Heat_Array,
                  Tolerance  => Tolerance);
   Put_Heat_Array_Type(Heat_Array);
exception
   when Usage_Error =>
      Ada.Text_IO.Put_Line("Usage: heat [TOLERANCE] [LEFT] [RIGHT]");
      Ada.Text_Io.Put_Line("TOLERANCE must be greater than 0.");
      Ada.Text_IO.Put_Line("Example: heat 0.01 999 1");
   when Constraint_Error => -- In case a non-number was given for a parameter
      Ada.Text_IO.Put_Line("Usage: heat [TOLERANCE] [LEFT] [RIGHT]");
      Ada.Text_Io.Put_Line("TOLERANCE must be greater than 0.");
end Heat;

This is not a robust solution, but merely an exercise in task synchronization. I did make two big goofs during development. I wasn't thinking at the time and placed the line "All_Under_Tolerance := True;" in the elsif instead of the if of First_Sync. This had the effect that All_Under_Tolerance was set by the last thread to exit that entry. This took a careful reading of the code to realize this error. The other error was that I had the two lines:

          Synchronizer.Second_Sync;
          exit Averaging_Loop when All_Under_Tolerance;

reversed. This made it so that some tasks would read an unsynchronized value of All_Under_Tolerance. That goof caused some tasks exit their loops and some tasks continue on. That was easy to track down.

What did I learn from all this? I think the Synchronizer in this contains the first burst locks I have ever written. If a burst lock occurs in a loop a second burst lock seems necessary. It might be possible to call one burst lock with a requeue in it so that there are two burst locks in one call. That might be an exercise for later though.

Monday, December 17, 2007

Begining the Winter of Code

If Google can have a summer of code why can't I have a Winter of Code?

Next semester I have a class that makes many people cringe at it's name. It is heavy in Ada and in concurrency. I would like to polish my skills in preparation for this class. I have the free time this winter break to do so, hence the winter of code.

I have two projects on my agenda at the moment. In the Spring of '07 I was given a project to solve the heat problem. My solution worked, but it had some occasional deadlock. Fixing this shouldn't be a big deal, but a nice warmer into heavier things. Also the same semester I created a rather small search engine. I would like to expand it and make it more robust.

A longer goal for this project is to improve my writing ability and possibly some Ada programming advocacy. Those two can only be solved by persistence and topics to write about. I do have my workout journal elsewhere on the internet and this may see the occasional crossover.

-- What?