Sunday, February 16, 2014

Workaround the sun.nio.cs.FastCharsetProvider bottleneck

The CharsetProvider is the class that, given a String representation of a Charset gives you the corresponding Charset object.
This is typically used when you do a new String(bytes[], "UTF-8") or a Charset.forName("UTF-8").
Looking closer at the Charset class tells us that it is actually using two levels of caches :

private static Charset lookup2(String charsetName) {
        Object[] a;
        if ((a = cache2) != null && charsetName.equals(a[0])) {
            cache2 = cache1;
            cache1 = a;
            return (Charset)a[1];
        }

        Charset cs;
        if ((cs = standardProvider.charsetForName(charsetName)) != null ||
            (cs = lookupExtendedCharset(charsetName))           != null ||
            (cs = lookupViaProviders(charsetName))              != null)
        {
            cache(charsetName, cs);
            return cs;
        }

        /* Only need to check the name if we didn't find a charset for it */
        checkName(charsetName);
        return null;
    }


and if cannot find you charset in the cache, will use the standardProvider which is a sun.nio.cs.StandardCharsets that extends sun.nio.cs.FastCharsetProvider which implementation is synchronized as you can see :

public final Charset charsetForName(String charsetName) {
        synchronized (this) {
            return lookup(canonicalize(charsetName));
        }
    }

So if you are not lucky and uses more than two different encoding, you will go to this synchronized block and create a contention point in your application, as other people talked about herehere and also in this java ticket.

To prevent this issue from happening, we can directly use a Charset object since Java 1.6 in your  code. But regarding all the library that you are using, you will have a hard time patching all of them, as mentioned in this very good post.

Or, we could just patch Java at the source, and then use whatever version of the library and of java that we want, and apply this patch to old systems as well.


package sandbox;

import java.lang.reflect.Field;
import java.nio.charset.Charset;
import java.nio.charset.spi.CharsetProvider;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.cliffc.high_scale_lib.NonBlockingHashMap;

import com.google.common.collect.ImmutableMap;

/**
 * NonBlockingCharsetProvider to workaround the contention point on
 * {@link CharsetProvider#charsetForName(String)}
 * 
 * @author Leo Lewis
 * @see java.nio.charset.spi.CharsetProvider
 * @see java.nio.charset.Charset
 */
public class NonBlockingCharsetProvider extends CharsetProvider {

 private CharsetProvider parent;

 private boolean lazyInit;

 private Map<String, Charset> cache;

 /**
  * @param parent
  *            parent charset provider
  * @param lazyInit
  *            if lazy init, init the cache when the application needs the
  *            charset, otherwise populate with the parent in the constructor
  *            if lazy init, will use a ConcurrentMap as it might be changed
  *            and iterated concurrently, otherwise, will use a
  *            guava Immutablehashmap
  */
 public NonBlockingCharsetProvider(final CharsetProvider parent, final boolean lazyInit) {
  this.parent = parent;
  this.lazyInit = lazyInit;
  if (!lazyInit) {
   Map<String, Charset> tmp = new HashMap<>();
   Iterator<Charset> it = parent.charsets();
   while (it.hasNext()) {
    Charset charset = it.next();
    tmp.put(charset.name(), charset);
   }
   cache = ImmutableMap.copyOf(tmp);
  } else {
   cache = new NonBlockingHashMap<>();
  }
 }

 @Override
 public Charset charsetForName(final String name) {
  Charset charset = null;
  // if not lazyInit, the value should already be in the cache
  if (lazyInit && !cache.containsKey(name)) {
   // no lock here, so we might call several times the parent and put
   // the entry into the cache, it doesn't matter as the cache will be
   // populated eventually and we won't have to call the parent anymore
   charset = parent.charsetForName(name);
   cache.put(name, charset);
  }
  return cache.get(name);
 }

 @Override
 public Iterator<Charset> charsets() {
  if (lazyInit) {
   return parent.charsets();
  }
  return cache.values().iterator();
 }

 /**
  * Save it if we want to reinstall, set up several times the provider
  */
 private static CharsetProvider standardProvider;

 /**
  * Replace the CharsetProvider into the Charset class by an instance of this
  * {@link NonBlockingCharsetProvider}
  * 
  * @param lazyInit
  *            see
  *            {@link NonBlockingCharsetProvider#NonBlockingCharsetProvider(CharsetProvider, boolean)}
  */
 public static void setUp(boolean lazyInit) throws Exception {
  Field field = Charset.class.getDeclaredField("standardProvider");
  field.setAccessible(true);
  if (standardProvider == null) {
   standardProvider = (CharsetProvider) field.get(null);
  }
  NonBlockingCharsetProvider nonBlocking = new NonBlockingCharsetProvider(standardProvider,
    lazyInit);
  field.set(null, nonBlocking);
 }

 /**
  * Restore the default java provider
  * 
  * @throws Exception
  */
 public static void uninstall() throws Exception {
  if (standardProvider != null) {
   Field field = Charset.class.getDeclaredField("standardProvider");
   field.setAccessible(true);
   field.set(null, standardProvider);
  }
 }
}

Call the NonBlockingCharsetProvider.setUp(); to replace the java provider using reflection by this non blocking one. 
This provides two modes, lazy that will get the value from the parent when necessary and put it into a concurrent non blocking hashmap (better that the standard ConcurrentHashMap), and a non lazy that get all the parent values at initialization and provides them with a thread safe guava ImmutableHashMap. Performances are pretty close for both mode, the difference is if you want to duplicate the entire Charsets supported by the JRE into the cache, or just the one that your application is using.

Et voila!


Code source is on Github
Benchmark source as well

1 comment:

  1. There are problems in both the eager and lazy implementations. The eager implementation is case sensitive. The following test fails:

    @Test
    public void testEager() throws Exception {
    NonBlockingCharsetProvider.setUp(false);
    assertNotNull(Charset.forName("utf-8"));
    assertNotNull(Charset.availableCharsets());
    }

    So then I tried the lazy implementation. This test of the lazy implementation (run in a clean static context, of course) fails with a null pointer exception. I didn't look into why. Both tests path without using the NonBlockingCharsetProvider.

    @Test
    public void testLazy() throws Exception {
    NonBlockingCharsetProvider.setUp(true);

    final Map charsets = Charset.availableCharsets();
    assertNotNull(charsets);
    for (Map.Entry entry: charsets.entrySet()) {
    final Charset c = entry.getValue();
    final String name = entry.getKey();
    System.err.println("charset " + name);
    assertNotNull(Charset.forName(name));
    for (String alias: c.aliases()) {
    System.err.println("... alias " + alias);
    assertNotNull(Charset.forName(alias));
    }
    }
    }

    ReplyDelete