Rutgers performance issues
During Fall 2007, Rutgers migrated from WebCT to Sakai. This and other events greatly increased the load on our server. During the Fall and Spring 2008, we saw a number of serious performance issues, which it seems worth documenting.
Long garbage collection times
We have regularly seen minor garbage collections take on the order of a minute. Since the appeared to be a garbage collection issue, we spent lots of time adjusting JVM parameters. This never fixed it. We now believe that the problem was caused by bugs in dbcp and pool, i.e. in the code that manages connection pools to the database.
It's not entirely clear how this resulted in long garbage collections, but we're pretty sure it did. See below for our recommendations on database tuning. Do that before starting to tune your JVM
We tried c3po and it didn't really fix the problem.
JVM setup
We currently use a single JVM on a machine with 4 cores and 16 GB of memory. The biggest tuning challenge was the 2.3 version of Chat, which could generate as much as 200 MB of garbage per second. Almost all of it dies immediately after creation. The only way to survive this is to use a large New space. Under 2.5 this is no longer an issue, but we're retaining the same tuning.
-d64 -Xmx13000m -Xms13000m -Xmn3g
That uses 13 GB of memory. That is the largest amount that appears safe on a 16 GB machine. Java will sometimes use more than it is told, and we also need space for monitoring and utilities processes. It configures 3 GB in New.
Java has several garbage collectors. I recommend the "concurrent low-pause" GC, which is
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC
We tried the incremental version. It seemed to produce more long pauses. I believe this may be specific to our configuration, but that's not clear.
Some random configuration options that you really need to read documentation to see. They cause less wasted space with large memory configurations:
-XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=15
Sakai needs more permanent memory than default. Here's what we use:
-XX:MaxPermSize=512m -XX:PermSize=64m
Once you know how large it is going to grow, you might be better with PermSize initially set to the size needed. That will prevent a few high-overhead garbage collections.
Finally, some hacks
-XX:+UseMembar -XX:-UseThreadPriorities
UseMembar is a workaround for a bug that is supposedly fixed in 1.5.0_14. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6546278 for further information. Disabling thread priorities causes Java to use the OS thread priorities. It works around a bug that is present only on Solaris, I believe. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6518490 for further information. This will be the default in Java 6.
Other issues that can affect garbage collection
In addition to tuning, we found a few cases where local code used System.exec. This turns out to be a really, really bad idea. It causes the entire JVM to be duplicated. In theory this shouldn't matter, as the new copy should go away almost immediately, but we saw situations where the system was slowed for 30 min or so. This behavior was present in our WebCT conversion tool and the Respondus import tool. We're no longer using the WebCT converter and I fixed the Respondus tool not to use it.
We had updated inactiveInterval@org.sakaiproject.tool.api.SessionManager in sakai.properties to 8 hours. We prefer to use long idle timeouts, in order to avoid users losing work. However the longer the timeout, the more users have sessions active. We ran out of memory in the JVM one day. I am reasonably certain that it was not a tuning problem or a bug: we just had too much data. Putting the idle timeout back to 2 hours seems to have fixed it.
Jgroups timeout
Jforum uses Jgroups to maintain a single distributed cache of data on postings. When we have a long garbage collection on one system, other systems can give up on it. Jgroups is supposed to recover from this, but does not. The result is typically sites where Jforums shows no postings even though postings are actually there. We seem to have fixed it by raising the timeout to be longer than the longest garbage collection time. You can see the times by looking for "Total time for which application threads were stopped" in catalina.out, assuming you use the recommended logging parameters. In jforum-cache-cluster.xml, we use
<FD timeout="10000" max_tries="12" shun="true" up_thread="true" down_thread="true"/>
That results in 12 retries, each 10 sec (10,000 millisec), i.e. 120 sec. That's long enough that it doesn't result in a timeout at Rutgers.
Mysql jdbc connection
With the default parameters, there is a problem talking to Mysql: The connection pools code that Sakai uses does additional database operations for each query: It checks that the connection is valid and then resets any connection parameters that do not have their default value. Each of these operations generates a database query. What's worse, those queries are done under a global lock, so only one connection can proceed at a time. With a quad-processor system, you lose much of the benefit from 3 of your processors, and generates lots of extra database queries.
I recommend the following parameters in sakai.properties:
hibernate.dialect=org.hibernate.dialect.MySQLDialect
vendor@org.sakaiproject.db.api.SqlService=mysql
driverClassName@javax.sql.BaseDataSource=com.mysql.jdbc.Driver
url@javax.sql.BaseDataSource=jdbc:mysql://mysqlserver.u.edu:3306/sakai?
useUnicode=true&characterEncoding=UTF-8&useServerPrepStmts=false&
cachePrepStmts=true&prepStmtCacheSize=4096&prepStmtCacheSqlLimit=4096&
elideSetAutoCommits=true&useLocalSessionState=true
username@javax.sql.BaseDataSource=xxx
password@javax.sql.BaseDataSource=xxx
testOnBorrow@javax.sql.BaseDataSource=false
validationQuery@javax.sql.BaseDataSource=
defaultTransactionIsolationString@javax.sql.BaseDataSource=TRANSACTION_READ_COMMITTED
initialSize@javax.sql.BaseDataSource=300
maxActive@javax.sql.BaseDataSource=300
maxIdle@javax.sql.BaseDataSource=300
minIdle@javax.sql.BaseDataSource=0
timeBetweenEvictionRunsMillis@javax.sql.BaseDataSource=-1
I have broken the URL line for readability. Of course it must be one line.
The JDBC parameters enable prepared statement caching. The last two parameters set up Mysql so that it will remember that current settings of auto-commit and transaction isolation. This gets rid of the need to do one of the two extra database queries. Setting testOnBorrow to false gets rid of the other one. Setting time between eviction runs to -1 disables the evictor thread, which is known to be buggy.
This causes sakai to open 300 database connections to Mysql. This may be unnecessary in the current configuration. If you do this, make sure my.cnf is set for lots of connections. We use
max_connections = 1600 open_files_limit = 3000
If you're going to disable testOnBorrow (which is a key part of our improvements), you also need to disable timeouts on the server side, i.e. in my.cnf add
wait_timeout = 31536000