When playing around with multithreading, I could observe some unexpected but serious performance issues related to AtomicLong (and classes using it, such as java.util.Random), for which I currently have no explanation. However, I created a minimalistic example, which basically consists of two classes: a class "Container", which keeps a reference to a volatile variable, and a class "DemoThread", which operates on an instance of "Container" during thread execution. Note that the references to "Container" and the volatile long are private, and never shared between threads (I know that there's no need to use volatile here, it's just for demonstration purposes) - thus, multiple instances of "DemoThread" should run perfectly parallel on a multiprocessor machine, but for some reason, they do not (Complete example is at the bottom of this post).
private static class Container {
private volatile long value;
public long getValue() {
return value;
}
public final void set(long newValue) {
value = newValue;
}
}
private static class DemoThread extends Thread {
private Container variable;
public void prepare() {
this.variable = new Container();
}
public void run() {
for(int j = 0; j < 10000000; j++) {
variable.set(variable.getValue() + System.nanoTime());
}
}
}
During my test, I repeatedly create 4 DemoThreads, which are then started and joined. The only difference in each loop is the time when "prepare()" gets called (which is obviously required for the thread to run, as it otherwise would result in a NullPointerException):
DemoThread[] threads = new DemoThread[numberOfThreads];
for(int j = 0; j < 100; j++) {
boolean prepareAfterConstructor = j % 2 == 0;
for(int i = 0; i < threads.length; i++) {
threads[i] = new DemoThread();
if(prepareAfterConstructor) threads[i].prepare();
}
for(int i = 0; i < threads.length; i++) {
if(!prepareAfterConstructor) threads[i].prepare();
threads[i].start();
}
joinThreads(threads);
}
For some reason, if prepare() is executed immediately before starting the thread, it will take twice as more time to finish, and even without the "volatile" keyword, the performance differences were significant, at least on two of the machines and OS'es I tested the code. Here's a short summary:
Mac OS Summary:
Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02-334
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Mac OS X
OS Arch: x86_64
OS Version: 10.6.5
Processors/Cores: 8
With volatile keyword:
Final results:
31979 ms. when prepare() was called after instantiation.
96482 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
26009 ms. when prepare() was called after instantiation.
35196 ms. when prepare() was called before execution.
Windows Summary:
Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Windows 7
OS Arch: amd64
OS Version: 6.1
Processors/Cores: 4
With volatile keyword:
Final results:
18120 ms. when prepare() was called after instantiation.
36089 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
10115 ms. when prepare() was called after instantiation.
10039 ms. when prepare() was called before execution.
Linux Summary:
Java Version: 1.6.0_20
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.0-b09
VM Name: OpenJDK 64-Bit Server VM
OS Name: Linux
OS Arch: amd64
OS Version: 2.6.32-28-generic
Processors/Cores: 4
With volatile keyword:
Final results:
45848 ms. when prepare() was called after instantiation.
110754 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
37862 ms. when prepare() was called after instantiation.
39357 ms. when prepare() was called before execution.
Mac OS Details (volatile):
Test 1, 4 threads, setting variable in creation loop
Thread-2 completed after 653 ms.
Thread-3 completed after 653 ms.
Thread-4 completed after 653 ms.
Thread-5 completed after 653 ms.
Overall time: 654 ms.
Test 2, 4 threads, setting variable in start loop
Thread-7 completed after 1588 ms.
Thread-6 completed after 1589 ms.
Thread-8 completed after 1593 ms.
Thread-9 completed after 1593 ms.
Overall time: 1594 ms.
Test 3, 4 threads, setting variable in creation loop
Thread-10 completed after 648 ms.
Thread-12 completed after 648 ms.
Thread-13 completed after 648 ms.
Thread-11 completed after 648 ms.
Overall time: 648 ms.
Test 4, 4 threads, setting variable in start loop
Thread-17 completed after 1353 ms.
Thread-16 completed after 1957 ms.
Thread-14 completed after 2170 ms.
Thread-15 completed after 2169 ms.
Overall time: 2172 ms.
(and so on, sometimes one or two of the threads in the 'slow' loop finish as expected, but most times they don't).
The given example looks theoretically, as it is of no use, and 'volatile' is not needed here - however, if you'd use a 'java.util.Random'-Instance instead of the 'Container'-Class and call, for instance, nextInt() multiple times, the same effects will occur: The thread will be executed fast if you create the object in the Thread's constructor, but slow if you create it within the run()-method. I believe that the performance issues described in Java Random Slowdowns on Mac OS more than a year ago are related to this effect, but I have no idea why it is as it is - besides that I'm sure that it shouldn't be like that, as it would mean that it's always dangerous to create a new object within the run-method of a thread, unless you know that no volatile variables will get involved within the object graph. Profiling doesn't help, as the problem disappears in this case (same observation as in Java Random Slowdowns on Mac OS cont'd), and it also does not happen on a single-core-PC - so I'd guess that it's kind of a thread synchronization problem... however, the strange thing is that there's actually nothing to synchronize, as all variables are thread-local.
Really looking forward for any hints - and just in case you want to confirm or falsify the problem, see the test case below.
Thanks,
Stephan
public class UnexpectedPerformanceIssue {
private static class Container {
// Remove the volatile keyword, and the problem disappears (on windows)
// or gets smaller (on mac os)
private volatile long value;
public long getValue() {
return value;
}
public final void set(long newValue) {
value = newValue;
}
}
private static class DemoThread extends Thread {
private Container variable;
public void prepare() {
this.variable = new Container();
}
@Override
public void run() {
long start = System.nanoTime();
for(int j = 0; j < 10000000; j++) {
variable.set(variable.getValue() + System.nanoTime());
}
long end = System.nanoTime();
System.out.println(this.getName() + " completed after "
+ ((end - start)/1000000) + " ms.");
}
}
public static void main(String[] args) {
System.out.println("Java Version: " + System.getProperty("java.version"));
System.out.println("Java Class Version: " + System.getProperty("java.class.version"));
System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
System.out.println("VM Version: " + System.getProperty("java.vm.version"));
System.out.println("VM Name: " + System.getProperty("java.vm.name"));
System.out.println("OS Name: " + System.getProperty("os.name"));
System.out.println("OS Arch: " + System.getProperty("os.arch"));
System.out.println("OS Version: " + System.getProperty("os.version"));
System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());
System.out.println();
int numberOfThreads = 4;
System.out.println("\nReference Test (single thread):");
DemoThread t = new DemoThread();
t.prepare();
t.run();
DemoThread[] threads = new DemoThread[numberOfThreads];
long createTime = 0, startTime = 0;
for(int j = 0; j < 100; j++) {
boolean prepareAfterConstructor = j % 2 == 0;
long overallStart = System.nanoTime();
if(prepareAfterConstructor) {
System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");
} else {
System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
}
for(int i = 0; i < threads.length; i++) {
threads[i] = new DemoThread();
// Either call DemoThread.prepare() here (in odd loops)...
if(prepareAfterConstructor) threads[i].prepare();
}
for(int i = 0; i < threads.length; i++) {
// or here (in even loops). Should make no difference, but does!
if(!prepareAfterConstructor) threads[i].prepare();
threads[i].start();
}
joinThreads(threads);
long overallEnd = System.nanoTime();
long overallTime = (overallEnd - overallStart);
if(prepareAfterConstructor) {
createTime += overallTime;
} else {
startTime += overallTime;
}
System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
}
System.out.println("Final results:");
System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}
private static void joinThreads(Thread[] threads) {
for(int i = 0; i < threads.length; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}