JavalibSawja

Javalib tutorial

Introduction

Javalib is a library written in OCaml with the aim to provide a high level representation of Java .class files. Thus it stands for a good starting point for people who want to develop static analysis for Java byte-code programs, benefiting from the strength of OCaml language.

This document briefly presents the API provided by Javalib and gives some samples of code in OCaml.

Global architecture

All modules of Javalib are sub-modules of the package module Javalib_pack in order to avoid possible namespace conflicts. The user interface consists of three modules, JBasics, JCode and Javalib. These modules are the only ones someone should need to write static analysis. The other ones provide lower level functions and should not be used.

JBasics module

This module gives a representation of all the basic types that appear at the byte-code level.

These types are:

The types representing class names, method signatures and field signatures are abstract. The directives to build them are in JBasics and are respectively make_cn, make_ms and make_fs.

This module also provides some sets and maps containers relative to the abstract class names, field signatures and method signatures. These maps and sets are very efficient because they rely on the hidden indexation of each abstract type, thereby reducing the cost of comparison operations between types. Moreover the maps are based on Patricia Trees implementation, that is known to be really fast.

JCode module

This module provides a representation of the JVM opcodes and of the code structure at the byte-code level.

It is important to notice that the code is represented by an array of opcodes, and that the position of an opcode in this array corresponds to its relative address in the original binary file (this is the same numbering as javap). In Javalib representation, an opcode includes an instruction and its arguments. That's why we needed to introduce a dummy opcode OpInvalid to keep the correct numbering.

Javalib module

This is the main module of Javalib which contains the definition of classes, fields and methods types, some methods to get a high level representation from binary classes and an internal module JPrint to pretty-print or dump every type defined in the three user modules.

Tutorial

To begin this tutorial, open an OCaml toplevel, for instance using the Emacs tuareg-mode.

Two standard libraries str, unix, and four specific libraries, extlib, zip, ptrees and javalib must be loaded. However, paths to find these four latter librairies must be specified with the #directory directives first:

#directory "<package_install_path>extlib"
#directory "<package_install_path>camlzip"
#directory "<package_install_path>ptrees"
#directory "<package_install_path>javalib"
(*<package_install_path> is given by command 'ocamlfind printconf'. 
If it is the same path than standard ocaml library just replace by '+'.*)

Then load the following libraries in the given order:

#load "str.cma"
#load "unix.cma"
#load "extLib.cma"
#load "zip.cma"
#load "ptrees.cma"
#load "javalib.cma"

You can also build a toplevel including all these libraries using the command make ocaml in the sources repository of Javalib. This command builds an executable named ocaml which is the result of the ocamlmktop command.

Making class names, field signatures and method signatures

Imagine you want to access the method m:(Ljava.lang.String;)V and the field f:I of the class A.

You first need to build the signatures associated to each entity. According to the Javalib API you will write:

open Javalib_pack
open JBasics
let aname = make_cn "A"
let java_lang_string = make_cn "java.lang.String"
let ms =
make_ms "m" [TObject (TClass java_lang_string)] None
let fs = make_fs "f" (TBasic `Int)

Getting a class representation from a binary file

The methods you need are in the Javalib module. You can open this module cause you will need it very often.

open Javalib

Then, you need to build a class_path to specify where the classes you want to load have to be found:

let class_path = class_path "./" (* for instance *)

You can now load the class ./A.class corresponding to aname.

let a = get_class class_path aname

When you don't need a classpath any more, close it with close_class_path if you don't want to get file descriptors exceptions in a near futur.

Getting fields and methods from a class

You now have the class a of type Javalib.interface_or_class. You might want to recover its method m of type Javalib.jmethod and field f of type Javalib.any_field.

Simply do:

let m = get_method a ms
let f = get_field a fs
Note:
The methods get_method and get_field raise the exception Not_found if the method or field asked for can't be found.

It's important to notice that a can be a Class of jclass or an Interface of jinterface (see type interface_or_class), but that the methods get_method and get_field work equaly on it. That's why get_field returns a value of type any_field which can be ClassField of class_field or Interface_field of interface_field. Indeed, according to the JVM specification, we need to make the distinction between interface fields and class fields.

A more sophisticated example

Now we would like to write a function that takes a classpath and a classname as parameters and that returns, for each method of this class, a set of the fields accessed for reading (instructions getstatic and getfield).

Here is the code:

open Javalib_pack
open Javalib
open JBasics
open JCode

let get_accessed_fields (class_path : class_path)
 (cn : class_name) =
 (* We first recover the interface or class associated to the
    class name cn. *)
 let c = get_class class_path cn in
 (* Then, we get all the methods of c. *)
 let methods : jcode jmethod MethodMap.t = get_methods c in
 (* For each method of c, we associate a field set containing
    all the accessed fields. *)
   MethodMap.map
    (fun m ->
      match m with
      (* A method can be abstract or concrete. *)
       | AbstractMethod _ ->
         (* An abstract method has no code to parse. *)
          FieldSet.empty
       | ConcreteMethod cm ->
          (match cm.cm_implementation with
          (* A concrete method can be native so that we don't
             know its behaviour. In this case we suppose that
             no fields have been accessed which is not safe. *)
            | Native -> FieldSet.empty
            | Java code ->
              (* The code is stored in a lazy structure, for
                 performance purposes. Indeed when loading a
                 class the Javalib does not parse its methods. *)
               let jcode = Lazy.force code in
               (* We iter on the array of opcodes, building our
                  field set at the same time. *)
                 Array.fold_left
                  (fun s op ->
                    match op with
                     | OpGetField (_, fs)
                     | OpGetStatic (_, fs) ->
                  (* We add the field signature in our field set.
                     In this example, we ignore the classes in
                     which the fields are defined. *)
                        FieldSet.add fs s
                     | _ -> s
                  ) FieldSet.empty jcode.c_code
          )
    ) methods

This method has the signature

Javalib.class_path ->
  JBasics.class_name -> JBasics.FieldSet.t JBasics.MethodMap.t

Another use case

Consider the following class written in java:

public class TestString{
   public boolean m(String s){
      if (s.equals("str")){
         return true;
      } else{
         return false;
      }
   }
}

We see that the method m might raise an NullPointer exception if we call the method equals on an uninitialized string s. To avoid this, a good practice is to replace the test s.equals("str") by the expression "str".equals(s) which will return false rather than raising an exception.

Let's see the bytecode associated to the method m, given by javap:

public boolean m(java.lang.String);
  Code:
   0:   aload_1
   1:   ldc     #2; //String str
   3:   invokevirtual   #3; //Method
           java/lang/String.equals:(Ljava/lang/Object;)Z

We will now write a sample of code that detects instructions of type ldc 'string' followed by an invokevirtual on java.lang.String.equals method.

We first need to write a function that returns the next instruction and its program point in a code, given this code and a current program point:

let rec next_instruction (code : jopcodes) (pp : int)
  : (jopcode * int) option =
 try
   match code.(pp+1) with
    | OpInvalid -> next_instruction code (pp+1)
    | op -> Some (op,pp+1)
 with _ -> None

Now we define a function that takes a classpath and a classname as parameters and that returns a map associating each concrete method signature to a list of (int,string) couples representing the program points and the strings on which the java.lang.String.equals method is called.

let get_equals_calls (class_path : class_path)
  (cn : class_name) =
 (* We first recover the interface or class associated to the
    class name cn. *)
 let java_lang_string = make_cn "java.lang.String" in
 let equals_ms =
   make_ms "equals" [TObject (TClass java_lang_object)]
    (Some (TBasic `Bool)) in
 let c = get_class class_path cn in
 (* Then, we get all the concrete methods of c. *)
 let methods : jcode concrete_method MethodMap.t =
   get_concrete_methods c in
 (* For each concrete method of c, we associate a (int*string) list
    containing all the strings passed as parameters to
    String.equals method, associated to the program point where the
    call occurs. *)
   MethodMap.map
    (fun m ->
     (match m.cm_implementation with
       (* A concrete method can be native so that we don't
          know its behaviour. In this case we suppose that
          no call to String.equals which is not safe. *)
       | Native -> []
       | Java code ->
         (* The code is stored in a lazy structure, for
            performance purposes. Indeed when loading a
            class the Javalib does not parse its methods. *)
          let jcode = Lazy.force code in
          let code = jcode.c_code in
          let l = ref [] in
          (* We iter on the array of opcodes, building
              our list of (int*string) at the same time. *)
            Array.iteri
             (fun pp op ->
               match op with
                | OpConst (`String s) ->
                  (* We detect that a string s is pushed on the
                     stack. The next instruction might be an
                     invokevirtual on String.equals. *)
                   (match (next_instruction code pp) with
                     | Some (inst,ppi) ->
                        (match inst with
                          | OpInvoke (`Virtual (TClass cn), ms)
                             when cn = java_lang_string
                               && ms = equals_ms ->
                            (* We add the program point of the
                               invokevirtual and the pushed string
                               in our list. *)
                             l := (ppi, s) :: !l
                          | _ -> ()
                        )
                     | None -> ()
                   )
                | _ -> ()
             ) code;
          (* We simply return our list, in the reverse order so that
             the program points appear in ascending order. *)
            List.rev !l
     )
    ) methods

This method has the signature

    Javalib.class_path ->
      JBasics.class_name -> (int * string) list JBasics.MethodMap.t

We obtain the expected result on the previous class TestString:

# let cp = class_path ".";;
val cp : Javalib.class_path = <abstr>

# let cn = make_cn "TestString";;
val cn : JBasics.class_name = <abstr>

# let mmap = get_equals_calls cp cn;;
val mmap : (int * string) list JBasics.MethodMap.t = <abstr>

# let l = 
    let sk = List.map (ms_name) (MethodMap.key_elements mmap)
    and sv = MethodMap.value_elements mmap in
      List.combine sk sv;;
val l : (string * (int * string) list) list =
  [("m", [(3, "str")]); ("<init>", [])]

# let () = close_class_path cp;;
Last modification: