FeatureUSENIX

 

java performance

mccluskey, glen

by Glen McCluskey
<glenm@glenmccl.com>

Glen McCluskey is a consultant with 15 years of experience and has focused on programming languages since 1988. He specializes in Java and C++ performance, testing, and technical documentation areas.

Optimizing String Processing

In the last column I started discussing some of the characteristics of Java performance. One of the issues mentioned was a method call overhead. In this article, I'll look at this a little more in the context of Java string processing and discuss how the immutability of strings affects performance.

Indexing Characters in a String

Suppose you have a need to find a character in a string, that is, return a position >= 0 of where a character occurs, or -1 if not found. In a language like C, with its pointers and low-level approach to programming, it's often just as efficient or more so to code a loop for finding a character directly instead of using a library function like strchr(). Especially with short strings, avoiding the function call overhead is often worth a lot.

Is this true of Java? Suppose we code up a similar example:

public class index {
 public static void main(String args[])
 {
  String s = "aaaaaaaaaa";
  int i = 10000000;
  int n = 0;

  // method #1

  if (args.length > 0 &&
    args[0].compareTo("index") == 0) {
   while (i-- > 0)
    n = s.indexOf('x');
  }

  // method #2

  else {
   while (i-- > 0) {
    int len = s.length();
    n = -1;
    for (int j = 0; j < len; j++) {
     if (s.charAt(j) == 'x') {
      n = j;
      break;
     }
    }
   }
  }
 }
}

When we run this code, we find that method #1 is about three times as fast as method #2, which is the equivalent of using pointers in C. Why is this? There are a couple of reasons. One is that charAt() is a method call, and not merely a quick peek that retrieves a character at a given position.

The other reason for the slower performance is that charAt() checks the index to ensure that it's within bounds. So even though we're iterating over the characters of a string in a safe way (0 to s.length()-1), the checks are done anyway.

Because of the method call overhead and the index checking, indexOf() wins easily. It avoids the call overhead, and the index checking is not done, but instead characters are accessed directly from the internal char[] vector that underlies a string object.

It's probably a little early to say how an area like this one will shake out. charAt() could conceivably be expanded as an inline. The subscript checking is likely to be left in place because it's part of what Java guarantees to the programmer. And compilation to native code might change the characteristics of a string operation of this type.

String Immutability

Another aspect of string processing performance involves the immutability of strings. Suppose you want to create and print a list of numbers, like so:

public class immut1 {
 public static void main(String args[])
 {
  String s = "[";

  for (int i = 1; i <= 5000; i++) {
   if (i > 1)
   s += ",";
   s += Integer.toString(i, 10);
  }

  s += "]";
  System.out.println(s);
  }
}

This works okay, but seems kind of slow. When we profile the program, we find that it seems to be doing lots of data shuffling and so forth, with the garbage collector called repeatedly.

It turns out that using += on strings is quite expensive, and the reason is that strings themselves are immutable, that is, are not changed after creation. To append to a string, you must copy out the string to a StringBuffer, append to it, and then convert it back. StringBuffers are used for doing operations on strings, like + and +=.

This idea can be illustrated by the following example:

public class immut2 {
 public static void main(String args[])
 {
  String s = "aaa"; // sequence #1
  s += "bbb";

  System.out.println(s);

  String ss = "ccc"; // sequence #2
  StringBuffer sb = new StringBuffer();
  sb.append(ss);
  sb.append("ddd");
  String ss_save = ss;
  ss = sb.toString();

  System.out.println(ss_save);
  System.out.println(ss);
 }
}

These two sequences are equivalent; using += causes processing similar to that shown in sequence #2 (except for the ss_save line).

Note that we captured the old value of ss before ss was changed to point to a new string. The old string didn't change when we reassigned ss; we just changed a reference that pointed at it to point at a new string.

Going back to the original example, we can rewrite it as:

public class immut3 {
 public static void main(String args[])
 {
  StringBuffer sb = new StringBuffer();

  sb.append("[");

  for (int i = 1; i <= 5000; i++) {
   if (i > 1)
    sb.append(",");
   sb.append(Integer.toString(i, 10));
  }

  sb.append("]");

  String s = sb.toString();

  System.out.println(s);
 }
}

resulting in a large increase in performance. A Java compiler is allowed to perform certain kinds of optimization on string concatenation operations, but keep in mind that string objects themselves do not change after creation (you can use StringBuffer for mutable strings).

Strings in Java are quite useful and powerful, and it's helpful to know some of their performance characteristics and bottlenecks.

 

?Need help? Use our Contacts page.
First posted: 17th September 1998 efc
Last changed: 17th September 1998 efc
Issue index
;login: index
USENIX home