2010-07-23

2nd Fundamental Law of EnScript

The Second Fundamental Law of EnScript is more complicated than the First or the Third:

All simple objects are ref-counted, but only root NodeClass objects are ref-counted. NodeClass objects in a list or tree are not ref-counted.

If you want to keep from crashing EnCase with your scripts, you must thoroughly understand this law. Unfortunately, it's a bit tricky.

What does ref-counted mean?

Like most languages in use these days, EnScript features garbage collection. When you create new objects in EnScript, those objects take up space in RAM. Rather than making you, the programmer, tell EnScript when that memory should be released, EnScript safely decides for you. It's really less like having your trash picked up once a week and much more like maid service.

An example of garbage collection in EnScript:

class MainClass {
  void Main(CaseClass c) {
    NameListClass strings = new NameListClass();
    strings.Parse("w h a t e v e r", " ");
    NameListClass strings2 = strings;
    Console.WriteLine(strings2.Count());
  }
}


In this example, you can see that I've allocated at least one NameListClass, and have a second reference. What you don't see is me calling delete, as you would if this were C++, or free, if this were C. Folks with Java, Python, or Perl experience should be used to this.

What happens is that EnScript gets to the end of the Main() function and realizes that the NameListClass object will never be used again. It's at that time that it is destroyed and the memory is released.

Reference counting is the underlying mechanism that makes this work. The key is to understand that the strings and strings2 variables in the above code are variables that refer to a NameListClass object; they don't contain the objects themselves. If you've programmed in Java or Python, this is normal. If you're coming from a C/C++ background, references are essentially pointers that prohibit you from doing fun pointer arithmetic or things like recasting 17 to an object. In the example above, the strings reference is created and assigned the memory location of the new NameListClass object. The Parse() method is called on the object, creating several child objects. The strings2 reference is then created and assigned the memory location within the strings reference; now both references refer to the same object.

Because objects can only be manipulated through references in EnScript, EnScript can track how many references point to each object. When a new object is created, a counter variable is created along with it, and when the object's memory location is assigned to a reference, the counter is incremented. When a reference is destroyed or has another object assigned to it, it decrements the counter of the object it had been pointing to. Here's the magic part: if the counter is zero after the decrement, the reference then says, whoa, I'm the last guy out of the room so I should turn out the lights... and it destroys the object.

Java uses a more advanced form of garbage collection called "mark-and-sweep" where a background thread runs periodically and tries to figure out which objects can be destroyed. Reference counting is used in Python and can also be used in C++ via the new shared_ptr<> template in the standard library (new in the TR1 update; use Boost's shared_ptr<> if you don't have it, since they're functionally equivalent). The upside to reference-counting, though, is that it's deterministic; you know that an object will be destroyed when the last reference goes away, which allows you considerable control over memory usage.

There are a couple of advantages to references and garbage collection. The first is that you don't have to worry about managing memory. The second is that you can't hose yourself with bad pointers the way you can in C or C++. A reference is either null (dereferencing it causes a runtime error, which will kill your script but won't crash EnCase) or it refers to a valid object. That's the theory, at least...

Great, what does this have to do with "root NodeClass objects"?

In our example above, we know that strings and strings2 both refer to the same NameListClass object, and at the end of the Main() function the reference count of the object should therefore be 2. It is. What about the child objects that were created by Parse()?

class MainClass {
  void Main(CaseClass c) {
    NameListClass strings = new NameListClass();
    strings.Parse("w h a t e v e r", " ");
    Console.WriteLine(strings.GetChild(2).Name());
    NameListClass strings2 = strings;
    Console.WriteLine(strings2.Count());
  }
}


Well, as it turns out, child objects in a NodeClass are not ref-counted. They're contained by the root NodeClass object. When the root is destroyed, so, too, are the children.

Okay... why aren't all objects reference counted?

Because. The main reason is that most everything you see in EnCase is derived from NodeClass (as pointed out in the First Law, which we'll revisit in the Third Law). Reference-counting has some overhead and, like most Windows apps, EnCase is written in C++ and can manage memory manually. So, internally, EnCase doesn't use reference counting for all those child objects, since that would be inefficient when working with millions of objects. Rather than make NodeClass work completely differently for EnScript objects than it does for EnCase's internal data, the semantics are the same.

Why do I need to keep this in mind as a Fundamental Law of EnScript?

Because this can hose you. You can write a script that crashes EnCase, killing not just your script, but any work it's done and any unsaved work in the case.

Consider this script:

class MainClass {
  typedef NameListClass[] ListArray;

  void Main(CaseClass c) {
    NameListClass found; // null reference;
    NameListClass strings = new NameListClass();
    strings.Parse("w h a t e v e r", " ");

    found = strings.Find("e"); // found points to child 5
    Console.WriteLine(found.Name());
    strings.Close();           // destroy all children


    // this code causes a lot more memory to be used
    ListArray array();
    for (unsigned int i = 0; i < 1000; ++i) {
      array.add(new NameListClass());
      array[i].Parse("g a r b a g e", " ");
    }


    if (found != null) { // totally true
      found.SetName("buffer overflow"); // can crash
      Console.WriteLine(found.Name());
    }
  }
}



Now, if you run this script... it may work; it just might print out "buffer overflow" to the console without any problems. But it's not guaranteed to work.

The problem is that found is set to refer to the fifth child of strings, and then we delete all the child objects. found, though, still thinks it's pointing to the fifth object... but it's really pointing to the memory formerly occupied by that object, and now that object could be used for something else. And since we ran through some code in the meantime that created 1,000 new lists, each with child objects, chances are high that that memory is now being used for something else.

The term of art for what happens when we call SetName() on found is "undefined behavior." We can't be sure.

If you're running a recent version of EnCase, what will probably happen is that the script will terminate and EnCase will report an error, but keep on running. If you're running an older version of EnCase, it'll probably crash and burn. Newer versions of v6 ask Windows to monitor for crashes that occur in EnScript and give EnCase a chance to recover from them. This is similar to the crash protection in the gallery view.

Even with the crash protection, though, you can still never be sure that you didn't hose everything. That's why you need to be careful when manipulating references to child NodeClass objects, and you need to understand how EnScript's memory model works.

2 comments:

  1. Awesome advice, and posted the same day I needed it! Thanks.

    ReplyDelete