Java hidden gem: CopyOnWriteArrayList()

CopyOnWriteArrayList() is a cousin of the well-known ArrayList() class.

ArrayList is often used for storing items. On my case, I had been working on a multi-threaded program that shared a common ArrayList.

In order to improve performance, every now and then I would like to remove some of the items on this list when matched some criteria. In the past I would use the Iterator() class to iterate through item using the iterator.next() function.

To remove an item I'd just call iterator.delete(). However, this approach was failing for some odd reason:
java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification (AbstractList.java:372)
I tried to place synchronized on the relevant methods but processing just got slower, not solving the error failure.

So, what else can one try? Looking around the web I've found the not-so-known CopyOnWriteArrayList() and to my surprise solved the problem with a nice performance boost.

Works in the same manner as a typical Arraylist but doesn't synchronize the items when they are removed. To remove items I use a second Arraylist that is decoupled and place the items to remove there. Then, an independent status thread is running in interval loops of three seconds to check if this second Arraylist has any items, removing them from the main list in asynchronous manner.

All in all, running the code in multi-threaded mode and adopting CopyOnArrayWriteArrayList() reduced the overall processing time for 17 million lines of data from 30 minutes to around 10 minutes, an average of 30k lines/second. The text database used as example is sized in 12,3 Gb and contains 2.5 billion snippets that are compared against 164 methods of my test sample.

This translates to roughly 41 billion comparisons taking place in 10 minutes.

As reference, when my computer is just reading the lines without any processing then it reaches an average speed of 140k lines/second, this value reveals the upper I/O limit expected as disk bandwidth. The speed of 30k lines/second occurs (probably) due to CPU limitations (an i7 core) when doing similarity comparisons between strings.


The performance is not bad, but at this point I'm running out of ideas on how to further bring down the processing time. The bottleneck is still the comparison algorithm, I've already wrote a cheaper/dirty version of Levensthein's algorithm for faster comparisons but still is not enough.


Any ideas?


EDIT

After some more time looking on performance I've noted that comparison of two strings was being made using String objects. There was redundant transformation back and forth between char[] and String objects. The code was modified to run using only char[] arrays. Speed was doubled, is now averaging 60k lines/second, taking 5 minutes to complete the same processing because less stress is placed on the CPU.