-- What?: Plans gone awry

An unforeseen obstacle is that the file IO included with XML/Ada is not the most robust solution. The included file input source works by reading the entire xml file to a buffer at once. The largest file it can index is 2 GB. That is not going to work for the 12 GB file I have.

My first thought was to read large blocks with Sequential_IO for the majority of the file and then use progressively smaller calls to Direct_IO for the remainder. The size of the calls to Direct_IO would be determined by the leq_pot demonstration posted earlier. I decided against this approach because of the complicated buffer and file slicing. I would've ended up with a massive development headache and then needed to run more extensive tests. Best tests are tests you don't have to run.

Stream_IO is one of the more powerful IO packages in Ada. I had toyed with the idea of using Stream_IO to read the data, then using Unchecked Conversion to change the data to the proper type. I'm not a fan of Unchecked Conversion, and I would need to have multiple copies of the same data in memory. I considered that ugly and moved on. The second attempt was to make a buffered Stream_IO package. I wrote the package pretty quickly and started implementing it. After doing more investigation I discovered that the decoding procedure for Unicode characters requires that the buffer be internal to the Input_Source and not in the IO package. Back to the drawing board.

After visiting the drawing board this time I felt much better about my solution. Using Stream_IO, buffering logic from the previous attempt, and a procedure borrowed from the UTF-8 Unicode encoding I managed to make an object that reads a file using Stream_IO and converts Stream_Elements directly to UTF-8 characters. Additional encoding schemes shouldn't be hard to do, but I'll just stick with UTF-8 for now. This is what I came up with:

(input_source-large_file.ads)

-----------------------------------------------------------------------
--                XML/Ada - An XML suite for Ada95                   --
--                                                                   --
--                       Copyright (C) 2001-2002                     --
--                            ACT-Europe                             --
--                    Modified by Alex Habeger 2008                  --
--                                                                   --
-- This library is free software; you can redistribute it and/or     --
-- modify it under the terms of the GNU General Public               --
-- License as published by the Free Software Foundation; either      --
-- version 2 of the License, or (at your option) any later version.  --
--                                                                   --
-- This library is distributed in the hope that it will be useful,   --
-- but WITHOUT ANY WARRANTY; without even the implied warranty of    --
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU --
-- General Public License for more details.                          --
--                                                                   --
-- You should have received a copy of the GNU General Public         --
-- License along with this library; if not, write to the             --
-- Free Software Foundation, Inc., 59 Temple Place - Suite 330,      --
-- Boston, MA 02111-1307, USA.                                       --
--                                                                   --
--
--
--
--
--
--
--

with Ada.Streams.Stream_IO;
with Ada.Streams;
with Unicode;
with Unicode.CES;

package Input_Sources.Large_File is

   type Large_File_Input (Max_Buffer_Size : Positive) is new Input_Source with private;
   -- Max buffer size is the maximum size of the buffer in MB
   
   type Large_File_Input_Access is access all Large_File_Input'Class;
   --  A special implementation of a reader, that reads from a file.

   procedure Open (Filename : String; Input : out Large_File_Input);
   --  Open a new file for reading.
   --  Note that the file is read completly at once, and saved in memory.
   --  This provides a much better access later on, however this might be a
   --  problem for very big files.
   --  The physical file on disk can be modified at any time afterwards, since
   --  it is no longer read.
   --  This function can decode a file if it is coded in Utf8, Utf16 or Utf32

   procedure Close (Input : in out Large_File_Input);
   --  Close the file and free the memory

   procedure Next_Char
     (From : in out Large_File_Input;
      C    : out Unicode.Unicode_Char);
   --  Return the next character in the file.

   function Eof (From : Large_File_Input) return Boolean;
   --  True if From is past the last character in the file.

private
   
   type Stream_Element_Array_Access is access all Ada.Streams.Stream_Element_Array;
   
   -- the parent object
   -- type Input_Source is abstract tagged limited record
   --    Prolog_Size : Natural := 0;
   --    Es          : Unicode.CES.Encoding_Scheme :=
   --      Unicode.CES.Basic_8bit.Basic_8bit_Encoding;
   --    Cs          : Unicode.CCS.Character_Set :=
   --      Unicode.CCS.Unicode_Character_Set;
   --    Public_Id   : Unicode.CES.Byte_Sequence_Access;
   --    System_Id   : Unicode.CES.Byte_Sequence_Access;
   -- end record;
   
   type Large_File_Input (Max_Buffer_Size : Positive) is new Input_Source with
      record
         Buffer_Index : Ada.Streams.Stream_Element_Offset;
         Buffer_Last  : Ada.Streams.Stream_Element_Offset;
         Buffer       : Stream_Element_Array_Access;
         File         : Ada.Streams.Stream_IO.File_Type;
      end record;
end Input_Sources.Large_File;

(input_sources-large_file.adb)

-----------------------------------------------------------------------
--                XML/Ada - An XML suite for Ada95                   --
--                                                                   --
--                       Copyright (C) 2001-2006                     --
--                            ACT-Europe                             --
--                    Modified by Alex Habeger 2008                  --
--                                                                   --
-- This library is free software; you can redistribute it and/or     --
-- modify it under the terms of the GNU General Public               --
-- License as published by the Free Software Foundation; either      --
-- version 2 of the License, or (at your option) any later version.  --
--                                                                   --
-- This library is distributed in the hope that it will be useful,   --
-- but WITHOUT ANY WARRANTY; without even the implied warranty of    --
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU --
-- General Public License for more details.                          --
--                                                                   --
-- You should have received a copy of the GNU General Public         --
-- License along with this library; if not, write to the             --
-- Free Software Foundation, Inc., 59 Temple Place - Suite 330,      --
-- Boston, MA 02111-1307, USA.                                       --
--                                                                   --
--
--
--
--
--
--
--
with Unicode;            use Unicode;
with Ada.Streams;        use Ada.Streams;
with Ada.Streams.Stream_IO; 
with Unicode.CES;        use Unicode.CES;
with Unicode.CES.Utf32;  use Unicode.CES.Utf32;
with Unicode.CES.Utf16;  use Unicode.CES.Utf16;
with Unicode.CES.Utf8;   use Unicode.CES.Utf8;
with Unchecked_Deallocation;

package body Input_Sources.Large_File is

   -------------------------
   -- Internal procedures --
   -------------------------
   procedure Free_Stream is new Unchecked_Deallocation
     (Stream_Element_Array, Stream_Element_Array_Access);
   
   procedure Fill_Buffer (Input : in out Large_File_Input) is
   begin
      if not Ada.Streams.Stream_IO.End_Of_File (Input.File) then
         Input.Buffer_Index := 1;
         Ada.Streams.Stream_IO.Read (Input.File, Input.Buffer.All, Input.Buffer_Last);
      end if;
   end Fill_Buffer;
   
   ----------
   -- Open --
   ----------

   procedure Open (Filename : in     String;
                   Input    :    out Large_File_Input) is
      Length : Ada.Streams.Stream_Element_Offset;
      BOM    : Bom_Type;
      BOM_Indicator : String (1..4);
   begin
      Ada.Streams.Stream_IO.Open (Input.File, Ada.Streams.Stream_IO.In_File, Filename);
      Length := Stream_Element_Offset (Ada.Streams.Stream_IO.Size (Input.File));

      --  If the file is empty, we just create a reader that will not return
      --  any character. This will fail later on when the XML document is
      --  parsed, anyway.
      
      
      if Length = 0 then
         -- File is empty
         Input.Buffer := new Ada.Streams.Stream_Element_Array (1 .. 1);
         Input.Buffer_Index := 2;
         return;
      elsif Length > Stream_Element_Offset(1048576 * Input.Max_Buffer_Size) then
         -- file is longer than the max buffer size.
         -- will set length of buffer to Max_Buffer_Size
         Length := Stream_Element_Offset(1048576 * Input.Max_Buffer_Size);
      -- else
         -- buffer is length of file
      end if;

      Input.Buffer := new Ada.Streams.Stream_Element_Array (1 .. Length);
      Fill_Buffer (Input);
      
      for Index in BOM_Indicator'Range loop
         BOM_Indicator(Index) := Character'Val(Input.Buffer.All(Stream_Element_Offset(Index)));
      end loop;

      Read_Bom (BOM_Indicator, Input.Prolog_Size, BOM);
      case BOM is
         when Utf32_LE =>
            -- Set_Encoding (Input, Utf32_LE_Encoding);
            raise Invalid_Encoding;
         when Utf32_BE =>
            -- Set_Encoding (Input, Utf32_BE_Encoding);
            raise Invalid_Encoding;
         when Utf16_LE =>
            -- Set_Encoding (Input, Utf16_LE_Encoding);
            raise Invalid_Encoding;
         when Utf16_BE =>
            -- Set_Encoding (Input, Utf16_BE_Encoding);
            raise Invalid_Encoding;
         when Ucs4_BE | Ucs4_LE | Ucs4_2143 | Ucs4_3412 =>
            raise Invalid_Encoding;
         when Utf8_All | Unknown =>
            Set_Encoding (Input, Utf8_Encoding);
      end case;

      Input.Buffer_Index := Input.Buffer'First + Stream_Element_Offset(Input.Prolog_Size);

      --  Base file name should be used as the public Id
      Set_Public_Id (Input, Filename);
      Set_System_Id (Input, Filename);
   end Open;

   -----------
   -- Close --
   -----------

   procedure Close (Input : in out Large_File_Input) is
   begin
      Ada.Streams.Stream_IO.Close (Input.File);
      Free_Stream (Input.Buffer);
      Input_Sources.Close (Input_Source (Input));
      Input.Buffer_Index := Ada.Streams.Stream_Element_Offset'Last;
   end Close;

   ---------------
   -- Next_Char --
   ---------------

   procedure Next_Char
     (From : in out Large_File_Input;
      C :       out Unicode.Unicode_Char) is
      
      Len  : Ada.Streams.Stream_Element_Offset;
      Val  : Unicode_Char;
      Char : Unicode_Char := Unicode_Char (From.Buffer.All(From.Buffer_Index));
   begin
      -- original:
      -- From.Es.Read (From.Buffer.all, From.Index, C);
      -- C := From.Cs.To_Unicode (C);
      -- Read converts from an 8 bit character string to unicode
      -- To_Unicode returns whatever is passed to it (aka nothing) ?!
      
      -- the following is slightly modified code from Unicode's conversion
      -- to UTF-8 procedure.
      
      -- Compute the length of the encoding given what was in the first byte
      if Char < 128 then
         Len := From.Buffer_Index;
         Val := Char and 16#7f#;
      elsif (Char and 16#E0#) = 16#C0# then
         Len := From.Buffer_Index + 1;
         Val := Char and 16#1f#;
      elsif (Char and 16#F0#) = 16#E0# then
         Len := From.Buffer_Index + 2;
         Val := Char and 16#0f#;
      elsif (Char and 16#F8#) = 16#F0# then
         Len := From.Buffer_Index + 3;
         Val := Char and 16#07#;
      elsif (Char and 16#FC#) = 16#F8# then
         Len := From.Buffer_Index + 4;
         Val := Char and 16#03#;
      elsif (Char and 16#FE#) = 16#FC# then
         Len := From.Buffer_Index + 5;
         Val := Char and 16#01#;
      else
         raise Invalid_Encoding;
      end if;

      if From.Buffer'Last < Len then
         raise Invalid_Encoding;
      end if;

      for Count in From.Buffer_Index + 1 .. Len loop
         Char := Unicode_Char(From.Buffer.All (Count));
         if (Char and 16#C0#) /= 16#80# then
            raise Invalid_Encoding;
         end if;
         Val := (Val * (2 ** 6)) or (Char and 16#3f#);
      end loop;

      From.Buffer_Index := Len + 1;
      C  := Val;
      
      if From.Buffer_Index > From.Buffer_Last then
         -- buffer is empty
         Fill_Buffer(From);
      end if;
   end Next_Char;

   ---------
   -- Eof --
   ---------

   function Eof (From : Large_File_Input) return Boolean is
   begin
      return From.Buffer_Index > From.Buffer_Last;
   end Eof;

end Input_Sources.Large_File;

The time required to parse a file using this input source is much more consistent. For the 390 MB Uncyclopedia the file source included with XML/Ada would take anywhere from 1:24 to 2:01. What I wrote it is always between 1:12 and 1:14. Part of the speedup is probably due to the inlining of the UTF-8 conversion. The more consistent times is likely due to finer grained IO. Using a 16 MB buffer CPU usage on a 2.4 Ghz Core 2 hovers between 90% and 100% utilization of a single core and 18 MB of RAM is consumed. A parser with high standards for what constitutes a document processed the 12 GB wikipedia and counted 1074662 articles in 23 minutes.

Another input source to verify this against with would be nice. I've verified this input source with the smaller Uncyclopedia. For all I know this source is faulty after the first 390 MB. Then again 390 MB worth of data is probably a high enough standard.

Lessons taught:

This is the first tagged type extension, outside of a book exercise, that I have done.
Tagged types cannot have default discriminants.
Don't expect to deal with large files easily.
Simple is better.

What I'd like to continue with:

Make the package handle more than UTF-8.
Find a more optimal buffer size.
Can another thread speed up execution or is it IO bound?

-- What?

Tuesday, January 1, 2008

Plans gone awry

No comments:

Blog Archive

About Me