...making Linux just a little more fun!

<-- prev

OCaml, an Introduction

By Jurjen Stellingwerff

Object Caml is an ML type of language. For the non-gurus: it's a functional language that can also be programmed in a non-functional and object-oriented way.

This language is really easy to learn. It's powerful and keeps impressing me with its speed. Programs written in this language are almost always stable by default. No segmentation faults, only occasional unending loops for the programmers that still hang on to program their own loops. It is really not needed to write most loops, since the libraries contain standard functions that are good enough in 99% of the cases. So try to use those functions: It really pays off in terms of stability of your programs, and, unless you have intimate knowledge of the inner works of this language, they tend to be better optimised.

The language can be obtained from the website caml.inria.fr. Here, they provide RPMs for the RedHat 7.2/8.0/9 and Mandrake 8.0 distributions. Also MS Windows binaries are available, but not all Unix library functions will work there, for some mysterious reason. The source tarball does compile flawlessly for me. It just has a somewhat unusual makefile layout:

# ./configure
# make world; make opt; make install

The normal libraries include many usable data-structures like balanced trees, hash tables, and streams. Their version of header files (.mli files) contain all the basic documentation you need, and those are directly converted into HTML and published on the Web in their OCaml manual. This manual is not very usable to study this language, so I'll try to explain here some of the basic language constructions. This is just to give you an impression of the power of this language.

Modules & Functions

Now some real life examples. I wrote a program to help administrating a computer. It is a subset of a normal file finder, but is a command line tool and very fast. It helps locating large, not-recently-used files to be deleted from the system. It crawls through the directory tree and show the contents in different layouts.

Every module in OCaml has its own namespace. Specific definitions can be found by adding the module name, with the first character an upper-case character. You can also change the namespace of the current program to include a total module. Normally, only the standard module 'pervasives.mli' is included in the default namespace. The example program 'show.ml' starts with:

open Basics
open Unix
open Unix.LargeFile

This includes my own set of 'basics' functions and 2 standard libraries: 'Unix' and 'Unix.LargeFile'. A module normally consists of 2 files. The first file for exporting definitions 'module.mli' (like the C .h file), and the second one for actual code (the 'module.ml' file). The program uses the function 'string_sub' that provides a foolproof version of the 'String.sub' standard function (from the string.mli module). The basics.mli file contains the lines:

val string_sub: string -> int -> int -> string
(** Get the sub string from a [string] from position [from] with [length].
This is the same function as String.sub, but it will never raise an exception.
And a negative [from] value is counted from the right side of the string. *)

This gives the definition of this function and the description. There is an automatic documentation generator (ocamldoc) that reads .mli files and writes .html files as basic interface documentation. Normal comments start with (* but the documentation generator only writes comments that start with (** to the .html files. This document contains links to the documentation of the used modules. This documentation is really helpful to start programming ocaml. The .mli files are all included in the distribution, but the complete manual and a book can be downloaded from the Web site caml.inria.fr

The function is followed by its type. It wants 3 parameters and provides a string. Normally we need to write 'Basics.string_sub' to use this function. But after the 'open Basics' instruction just 'string_sub' is enough.

Basic operations and function calls

Now, back to the main program again. The first function is 'gettype'. It will try to return the type of a file. The file type is defined as the part of the filename following the last '.'. When there is no dot, the type is unknown and returned empty.

let gettype file =
let pos = String.rindex file '.' in
String.sub file (pos+1) (String.length file-pos-1)
with Not_found -> ""

This function only uses standard functions. First, it catches the Not_found exception in the 'try' 'with Not_found -> ""' code. All other exceptions will be passed to the caller to be handled, and can possibly stop the main program. The local variable pos get is filled with the result of the function rindex. This function is also the reason to catch the exception; otherwise, the main program might stop on the first found file with no '.' in it. Local variables can be declared everywhere inside ocaml with 'let <variable> = <value> in <code>'. After the completion of the given code, the variable is out of scope and will be forgotten. The data will be passed to the garbage collector to be removed from memory. Function calls do normally use brackets. The function call to 'String.sub' gets 3 parameters the string 'file' the integer '(pos+1)' and the integer '(String.length file-pos-1)'. The last parameter calls the function 'String.length' with a single parameter 'file'. So, the functions are eager for their parameters; brackets are needed only when the parameters are filled with calculations.

Also '(+)' and '(-)' are functions of the pervasives module. It is very easy to define your own operators; just add brackets around their definition, and they are ready.

If then else

The next routine 'filesize' in the example code is far longer, but largely introduces sub-functions and 'if <bool-expr> then <expr> else <expr>' statements. This function creates a string from an int64 number for human readable file and directory sizes. The types of parameters are normally not given; they are determined by ocaml through their usage. When something is not clear, the compiler or interpreter will complain about it before executing the code.

let filesize s =
let tostr f =
  if f>9.9 then
    string_of_int (int_of_float (f +. 0.5))
    let res = string_of_float (floor (f *. 10.0 +. 0.5) /. 10.0) in
    if String.length res=2 then
      res ^ "0"
let bytes = Int64.to_float s in
if bytes > 512.0 then
  let kb = bytes /. 1024.0 in
  if kb > 512.0 then
    let mb = kb /. 1024.0 in
    if mb > 512.0 then
      let gb = mb /. 1024.0 in
      tostr gb ^ " Gb"
      tostr mb ^ " Mb"
    tostr kb ^ " kb"
  Int64.to_string s

The ocaml standard library has a set of conversion functions. These functions normally follow the form of 'int_of_float' and 'string_of_float'. Specific types like 'Int64' use shorthand notations like 'Int64.to_float'. String concatenations are done with the operation '(^)'. Normally, functions are defined for only one specific type, so there are new sets of arithmetic functions for floats like '(+.)', '(*.)' and '(/.)'. The 'tostr' sub-function has some extra calculation to change something like '5. Gb' into the nicer form of '5.0 Gb'.

List notation and type conversion

The next function, 'converttime', converts a string into a float. OCaml uses floats for date for 2 reasons. The first is to prevent possible Year 2k problems, and can also be used for less than one-second time measurements. The function accepts English acronyms for month names. So let's introduce the list and the pair to create a translation of acronyms into numbers.

let month = [("jan", 0); ("feb", 1); ("mar", 2); ("apr", 3); ("may", 4); ("jun", 5);
             ("jul", 6); ("aug", 7); ("sep", 8); ("oct", 9); ("nov", 10); ("dec", 11)]

This list is totally static, and can be used easily by the standard function List.assoc to convert a string into the corresponding number.

let converttime str =
begin match
  if str>"a" && str<"z" then
  ( int_of_string (string_sub str (String.rindex str ' '+1) 99),
    List.assoc (string_sub str 0 3) month,
  ( int_of_string (string_sub str 0 (
      try String.index str '-' with Not_found -> 99
    ( try let pos=String.index str '-'+1 in
        int_of_string (string_sub str pos (
          try String.index_from str pos '-'-pos with err -> 99
        with err -> 0
    ( try let pos=String.index str '-'+1 in
        int_of_string (string_sub str (String.index_from str pos '-'+1) 99)
        with err -> 1
with (yr,mn,md) ->
(* print_string ("Last access before: "^
     string_of_int (if yr<50 then yr+2000 else if yr<100 then yr+1900 else yr)^"-"^
     string_of_int (mn+1)^"-"^
     string_of_int md^"\n"); 
  fst (mktime 
  { tm_sec = 0; tm_min = 0; tm_hour = 0;
    tm_mday = md; tm_mon = mn;
    tm_year = if yr<50 then yr+100 else if yr<100 then yr else yr-1900;
    tm_wday = 0; tm_yday = 0; tm_isdst = false
end with err ->
  print_string ("Cannot decipher this date string '" ^ str ^ "'\n"); max_float

The new operation in this function is the 'match <expr> with <template> -> expr'. This is one of the most versatile instructions of ocaml. It can be used to examine the contents of variables and get the needed information out of it. This function creates the triplet (year, month, day-of-month) out of 2 different date notations. To debug this function the 'print_string' instruction is included but commented out to prevent clutter in the output of the program. Normally there is some logging mechanism to make the extra messages optional for the user. The 'print_string' shows the ISO notation of the given date; it creates a 4-digits year and gives a month number with January=1 instead of the internal Unix use of January=0.

This function also shows the use of 'try <expr> with err -> <expr>' that caches every possible exception and fills the variable 'err' with the details of the exception. This function can raise quite a lot of different exceptions, and frankly I am not very interested in the details. The routine just complains to the user about the given date string and gets over it. It returns the maximal possible float to include every filename.

The main standard function is the 'Unix.mktime' function. It wants to get a record filled with numbers about the current time. This function returns a pair with the needed float and a normalized record. With the pervasives function fst returns just the first parameter of the pair.

The ';' before the 'max_float' indicates that the expression results in a float, but the instructions before the ';' are calculated first. This is the first non-functional instruction inside the example code. OCaml is not strictly functional, but has the full power of other functional languages.

Dynamic data structure

Now is the time for a real data structure that is dynamically build and can be used in a lot of different ways.

type entrytype =
| Dir of entry list   (* directory with a list of files *)
| File of string      (* a file inside a directory *)


entry = {
	mutable e_name: string;   (* name of a file or directory *)
	e_type: entrytype;        (* what type is this together with type
                                     related information *)
	e_atime: float;           (* last access time *)
	e_size: int64;            (* size of the file or size of all the matching
                                     files in the directory *)

The 'and' statement is used to glue the two definitions together. They are created at the same time so that 'entrytype' can include 'entry' and vice-versa. 'entrytype' can consist one of 2 things: a directory with a list of entries or a file with its type. The directory entry has a mutable name. This is can be used later on to change a filename info the full path to that file.

As with ANSI C, the operators for Boolean algebra are '(&&)' and '(||)'.


let rec dirwrite el depth sortfn =
List.iter (
  fun e ->
    match e.e_type with
    | Dir lst ->
       if e.e_size <> Int64.of_int 0 then begin
         print_string ((String.make (depth*2) ' ') ^ "Directory " ^
           e.e_name ^ " = (" ^ filesize e.e_size ^ ")\n"); 
         dirwrite lst (depth+1) sortfn
    | File string -> 
       print_string ((String.make (depth*2) ' ') ^ e.e_name ^
         " (" ^ filesize e.e_size ^ ")\n")
  ) (List.sort sortfn el)

Here is the recursive ('rec') function 'dirwrite' that traverses a given tree 'el' and writes the result to the standard output. The parameter 'depth' indicates the amount of spaces to write a tree like structure of filenames. The function sorts all the lists with the given function 'sortfn'. The new language structure here is 'fun <parm-1> ... <parm-n> -> <expr>'. This construction creates a function without a name. The parameters of this function like construction can be used like a template to match pairs.

This function suppresses directories that are 0 bytes in size to reduce clutter.

Variables vs. definitions

(* List of global variables *)

let min_size = ref (Int64.of_int 0) and    (* minimum size of a file in bytes *)
    last_access = ref max_float and        (* last access time in seconds since 1970 *)
    has_type = ref "" and                  (* type of file to show or empty to
                                              show all *)
    name_match = ref "" and                (* regular expression to match the filename
                                              with; empty is show all *)
    name_regexp = ref (Str.regexp "") and  (* pre-calculated regular expression *)
    no_symlinks = ref false                (* don't follow symbolic links to
                                              directories *)

This is a list of variables that can be changed due to the 'ref <expr>' construction. Normally definitions are just a label to their contents. These definitions are pointers to the memory and can be read by '!<variable>' and written by '<variable> := <expr>'. The parameters given to the program can make changes to the way the files are read.

let rec dirread path =
let list = ref [] and
    size = ref (Int64.of_int 0) in
let dh = opendir path in
while true do
  let file = readdir dh in
  if file<>".." && file<>"." && file<>"CVS" && String.sub file 0 1 <> "." then
  let s=stat (path^"/"^file) in
  if s.st_kind = S_DIR &&
    (not !no_symlinks || (lstat (path^"/"^file)).st_kind <> S_LNK)
    let dir = dirread (path^"/"^file) in
    list := 
    { e_name = file;
      e_type = Dir (fst dir);
      e_atime = s.st_atime;
      e_size = snd dir
    } :: !list;
    size := Int64.add !size (snd dir)
  else if 
    (!has_type = "" || gettype file = !has_type) &&
    s.st_size > !min_size && 
    s.st_atime < !last_access &&
    (!name_match = "" || Str.string_match !name_regexp file 0)
  then begin
    list := 
    { e_name = file;
      e_type = File (gettype file);
      e_atime = s.st_atime;
      e_size = s.st_size;
    } :: !list;
    size := Int64.add !size s.st_size
(!list, !size)
| End_of_file -> (!list, !size)
| Unix_error (EACCES, err, parm) -> (!list, !size)

The following functions are introduced in the function 'dirread':

  • Unix.opendir to start reading a directory.
  • Unix.readdir to read a filename.
  • Unix.stat for a record (Unix.stats) of statistics on a file.
  • Unix.lstat for statistics on a link.
  • Int64.add to add two int64 type of variables
  • Str.regexp to create a new interpreted regular expression
  • Str.string_match to match a string against a regular expression
  • Pervasives.(::) to create a list with an extra element in front of the old one
  • Pervasives.true as a Boolean constant
  • Pervasives.snd to return the second part of a pair
  • exception Unix.Unix_error (EACCESS, err, parm) that is raised when an access denied is encountered.
  • There is also a new construction 'while <boolean-expr> do <code> done' it just does what it is supposed to do.

    Small is beautiful

    let rec flat el path =
    List.fold_right (
      fun e ls ->
        match e.e_type with
        | Dir lst -> flat lst (path ^ "/" ^ e.e_name) @ ls
        | File string ->
            e.e_name <- (path ^ "/" ^ e.e_name);
            e :: ls
      ) el []

    This neat little routine 'flat' hits the tree 'el' flat on the ground. It takes every file from every branch and creates a single list of all the encountered files. This is done with one of the most versatile standard routines inside ocaml: the 'List.fold_right' routine. This routine introduces a state machine (scarab) that crawls over a list and operates on every encountered element. It produces a new structure (droppings) as a result -- in this case, a flattened list.

    The construction '<record-field> <- <expr>' changes the contents of a mutable record field. Without mutable fields, you can mutate records only by creating a new one with lots of fields inherited from the old one. This is a shortcut for that.

    let name_order a b =
    compare a.e_name b.e_name
    let type_order a b =
    let typea = match a.e_type with Dir ls -> "dir" | File tp -> tp and
        typeb = match b.e_type with Dir ls -> "dir" | File tp -> tp in
    if compare typea typeb = 0 then
      compare a.e_name b.e_name
    else compare typea typeb
    let atime_order a b =
    compare a.e_atime b.e_atime

    A set of sorting functions to use inside 'dirwrite'. The function 'compare' results in the widely used values of -1 for lower than, 0 for equal and +1 for higher than.

    Command line parameters

    let dir = ref "." and
        sort = ref name_order and
        show = ref 0
    Arg.parse [
       ("-t",Arg.Unit (fun () -> sort := type_order), 
         "Sort by type and filename");
       ("-l",Arg.Unit (fun () -> sort := atime_order),
         "Sort by last access time");
       ("-n",Arg.Unit (fun () -> show := 1),
         "List filenames");
       ("-b",Arg.Unit (fun () -> show := 2),
         "List both filenames and sizes");
       ("-s",Arg.Unit (fun () -> no_symlinks := true),
         "Don't follow symbolic links");
       ("--before",Arg.String (fun s -> last_access := converttime s),
         "Last access older than give date (format 'yyyy-mm-dd' or 'mmm yyyy')");
       ("--size",Arg.Int (fun i ->
            min_size := Int64.mul (Int64.of_int i) (Int64.of_int (1024*1024))
         ), "File size bigger than size in Mbytes");
       ("--type",Arg.String (fun s -> has_type := s),
         "File is specific type");
       ("--name",Arg.String (fun s ->
            name_match := s; name_regexp := Str.regexp (s ^ "$")
         ), "Filename matches regular expression")
    ] (fun d -> dir := d) "show [DIR]";
    let res = dirread !dir in
    if !show=0 then begin
      dirwrite (fst res) 0 !sort;
      print_string ("Total size " ^ filesize (snd res) ^ "\n")
    end else
        (fun e -> 
          print_endline (e.e_name ^ if !show=2 then " ("^filesize e.e_size^")" else "")
        ) (List.sort !sort (flat (fst res) !dir))

    And here is the main routine. It calls the Arg.parse routine to parse the parameters given to the program. But this is too much un-GNU for me. I wrote my own version of it that follows the GNU coding standards a bit more than the default one (Gnuarg). The other version is a bit more complicated so I will include only the sources that use it.

    Generating binaries

    The code can be obtained from here. Just unpack it somewhere with 'tar -xzf show.tar.gz' and move into the source directory with 'cd show/src'. There is also a Makefile that compiles to machine code and installs everything. But Makefiles are too rough for sour eyes to show in this article. The nitty-gritty details are there in the source. The general compile form is.

    ocamlopt -o show unix.cmxa str.cmxa basics.cmx show.ml

    The only non-standard libraries in use here are unix.cmxa and str.cmxa.

    make install
    show --help
    show -s ~ --size 3 --before "apr 2003"

    That concludes this example program.

    Language features

    Garbage collector
    Just forget variables that contain complete data structures. Once it gets out of scope, the total structure will be eliminated from memory in due time.
    Flexible data-structures
    Any 2 data structures can be combined without hassle. Just create an array of records that contain 2 fields with hash tables of strings. No problem there... everything in a single variable than can be passed to functions or can be used globally in the program.
    No pointers needed
    A variable can have any type and when a new variable is created
    Flexible in language boundary checks
    The language can check array and string boundaries automatically, or those checks can be turned off for an extra speed boost. Without it, the language can give a segmentation fault, but that is the programmer's choice.
    High quality error handling
    Totally integrated into the language and no notable performance hit.
    Native code generator and byte code interpreter
    All the tools are there -- interpreter (ocaml), byte code (ocamlc) and native code compiler (ocamlopt) -- every wish is granted. The package comes also with a documentation generator (ocamldoc) and a simple to use profiler (ocamlprof) that adds usage counts as comments to the original source code. The language is also compatible with the more sophisticated profilers around.
    ANSI-C compatibility layer
    It is possible to include ANSI C routines inside OCaml programs, and OCaml routines inside C programs. This has a very easy to use API. Slightly less easy is the creation of OCaml data structures inside C; for me, that was the source of many segfaults. So, my routines call exported OCaml routines to fill data structures and create only OCaml strings and numbers in C. That way I won't have the hassle to debug the C code... OCaml is much easier to debug for me.
    Object orientation
    Not my favourite programming paradigm, but it is possible to build object-oriented programs in this language. Those features are not part of this article. I can live without them.
    An active mailing list
    This list is at caml-list@inria.fr and is normally in English. Yes, this originally French project has taken the burden to translate almost everything they got. This is no easy feat for them, so be grateful.


    Duplicate efforts in libraries
    There are separate libraries for different type of big arrays, big files, and extra long integers. This isn't a big problem, because you can always just start with the normal structures and drop in the special library when need arises. The naming of the different functions is very much standardized, so renaming of function calls isn't needed much. The extra long integers though are too much different from normal integers. That part of the standard functions really need some tuning.
    You need to be familiar with the basis constructions of the language, to make any sense of the actual code. Some constructions can look really weird without intimate knowledge of the language. OCaml is not a very natural language and has a very powerful, short notation for things. But this not much worse than languages like ANSI C, Perl, or lisp.
    Not known enough in the Linux world
    This language has excellent interfaces to standard libraries and easy binding to ANSI C, but still isn't very known. I like to create some articles like this to change that a bit. This is a really great language to program in, and gives you real power without the pitfalls common in other languages. Programmers should give it a try and feel that power once.


    [BIO] Developer at a small technology firm in the Netherlands called V&S bv. (www.v-s.nl) We sell firewall, anti-virus and spam boxes based on the Linux OS. Using more and more the OCaml language to write my applications. Busy writing a lightweight http server with an internal scripting language (camlserv.sourceforge.net, source code here) Interested in writing AI based computer games. Always trying writing one, nothing ready yet.

    Copyright © 2004, Jurjen Stellingwerff. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

    Published in Issue 99 of Linux Gazette, February 2004

    <-- prev