Real Quick C++ : Memory Management and all that

With two programs under our belts, we are practically C++ experts. But there's more to learn. We need to discuss how to create classes that are derived from other classes, pointers and references, and memory management.

My fine inheritance

In the Unigram text classifier example, we created one class which did not inherit from other classes. Given the way it was written, we could not create other classes that could inherit from it. We could have written our code so this was the case, of course.

For example, we might have created a more general TextClassifier class from which our UnigramTextClassifier. Here's a sketch of what the TextClassifier header would look like:

#ifndef TextClassifier_H
#define TextClassifier_H
#include <map>
#include <iostream>
#include <fstream>
namespace TextClassifier 
{
  class TextClassifier 
  {
  public:

    TextClassifier();
    TextClassifier(const string classification);

    unsigned long corpusTotal() { return my_corpus_total; }
    /* ... etc ... */
    virtual void TextClassifier::learn(istream& in) =0;

    virtual float TextClassifier::score(istream& in); 
    /* ... etc ... */
  private:
    /*! internal total number of characters in corpus */
    unsigned long _corpus_total;
    /* ... etc ... */
  };
}
using namespace std;
#endif /* TextClassifier_H */

Methods that may be overridden are introduced with virtual, as in the definition of score. Methods that must be defined (like Java's interfaces) are given a 'null' definition by defining them as =0, as in the definition of learn.

To create a subclass (a 'derived class,' in C++ jargon), you declare it with after the class section. Here's an example of what the header for UnigramTextClassifier might look like:

#ifndef UnigramTextClassifier_H
#define UnigramTextClassifier_H
#include "TextClassifier.h"
namespace TextClassifier 
{
  class UnigramTextClassifier : public TextClassifier 
  {
  public:

    /* ... etc ... */

  private:

    /* ... etc ... */
  };
}
using namespace std;
using namespace TextClassifier;
#endif /* UnigramTextClassifier_H */

C++ allows for multiple inheritance of classes (that is, a class can be derived from more than one class).

What's the point?

No discussion of C++ would be helpful without some discussion of pointers and references. As we all know, variables have to be stored somewhere in memory, although typically, we think of a variable in terms of its value and not its location. A pointer is simply a variable whose value is a location.

Why do we need to know about pointers? C++ requires us to constantly consider whether we are dealing with a value or a pointer.

For example, consider the following statements (perhaps from a text classification program):

    UnigramTextClassifier ut1 = UnigramTextClassifier();
    UnigramTextClassifier ut2 = ut1;

In many languages ut1 and ut2 would point to the same value. C++, on the other hand, has a "copy-by-value" semantics, so ut1 and ut2 are different objects. Similarly, consider a function such as:

  void test(UnigramTextClassifier ut) { } 

The test function would receive a copy of the object passed to it ("pass-by-value").

Although at times this may be what we want, it is often not what we want, and so we need pointers instead (and, we'll see, references).

Pointers to objects (including the built-in types, such as ints and arrays) are declared using a "*". So, our code block above becomes:

    UnigramTextClassifier  ut = UnigramTextClassifier();
    UnigramTextClassifier* put = &ut;

Now, put is a pointer to ut, and, in some sense they "point to the same thing." To change the value of what put points to, we use *put= ....

The & means "address of".

The syntax for accessing methods and fields differs for 'regular' variables and 'pointer' variables. For regular variables, one appends the field or method with a "dot." For pointers, one uses "->" (which is just a bit of syntactic sugar; x->y is the same as (*x).y). For example:

    UnigramTextClassifier  ut = UnigramTextClassifier();
    UnigramTextClassifier* put = &ut;
    cout << ut.classification() << "; " << put->classification() << endl;

Sigh, it's all very confusing, and you need to pay careful attention. Of course, the compiler will often catch errors for you.

C++ also allows us to use & to declare reference variables, using the & for this. This is especially useful in writing functions and methods, and allow us to pass objects "by reference" instead of "by value." Again, a short example:

    void test(UnigramTextClassifier& ut, UnigramTextClassifier* put)
    {
      cout << ut.classification() << "; " << put->classification() << endl;
    }

Note that variables passed by reference use the 'dotted' syntax for access. The classic "swap" routine is:

    void swap(int& i, int& j) { int temp=i; i=j; j=temp; }

Misty water-colored memory management

Memory which is dynamically allocated in C++ is not automatically "garage-collected," and the programmer must carefully manage deleting dynamically allocated variables. Objects are dynamically created using the new keyword, and dynamically deleted using the delete keyword. Also, new objects are returned as pointers. There are (at least) three strategies for dealing with memory management. The first strategy is to carefully pair any call to new with a call to delete, which ensuring any additional references to the new objects are set to "null" (0). For example:

void test() 
    { 
      UnigramTextClassifier* put1 = new UnigramTextClassifier();
      UnigramTextClassifier* put2 = put1;
      /* do some stuff */
      cout << put1->classification();
      /* do some stuff */
      delete put1;
      put2 = 0;
    }

The second strategy is to use a programming idiom called "resource acquisition is initialization." C++ allows one to define destructors as well as constructors and these can be written so that memory is returned at the end of a code block, even in the face of exceptions. Again, this is beyond the scope of this tutorial, but a Google search for "resource acquisition is initialization" will take you futher. (See, for example, Stroustrup's answer to the question Why doesn't C++ provide a "finally" construct?).

The third strategy is to use a language that provides automatic garbage collection (including some implementations of C++). (I'm just being snarky here).