(Test Systems) Java out-of-memory leads to zombie worker
(Originated on MM chat: https://chat.civicrm.org/civicrm/pl/otynrhfqeintdbwqpnjjakjdia. Note that it starts out with two different problems; one rc tarball problem is cleared up quickly. This issue is about the other problem. I'm taking the observations about it and trying to compose a full hypothesis of the problem.)
Suppose you have a job like https://test.civicrm.org/job/CiviCRM-Core-Matrix-PR/4551/BKPROF=dfl,SUITES=phpunit-crm,label=bknix-tmp/console -- the job is interrupted because one of the Java based agents (Jenkins master or Jenkins agent) runs out of memory.
Installing build4test_qpf3m database
ok 1806 - CRM_Dedupe_MergerTest::testBatchMergeSelectedDuplicates
ok 1807 - CRM_Dedupe_MergerTest::testBatchMergeAllDuplicates
ok 1808 - CRM_Dedupe_MergerTest::testGetCidRefs
ok 1809 - CRM_Dedupe_MergerTest::testGetMatches
ok 1810 - CRM_Dedupe_MergerTest::testGetMatchesExcludeDeleted with data set #0 (true)
ok 1811 - CRM_Dedupe_MergerTest::testGetMatchesExcludeDeleted with data set #1 (false)
ok 1812 - CRM_Dedupe_MergerTest::testGetMatchesIgnoreLocationType
ok 1813 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatched
ok 1814 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatchedWithLimit
ok 1815 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatchedWithSearchLimit
ok 1816 - CRM_Dedupe_MergerTest::testGetMatchesNoCriteria
ok 1817 - CRM_Dedupe_MergerTest::testGetMatchesNoCriteriaButLimit
ok 1818 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaNotMatched
FATAL: command execution failed
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:102)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected reader termination
at hudson.remoting.SynchronousCommandTransport$ReaderThread.lambda$new$1(SynchronousCommandTransport.java:50)
at java.base/java.lang.Thread.dispatchUncaughtException(Thread.java:1997)
Caused: java.io.IOException: Backing channel 'test-4' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
at com.sun.proxy.$Proxy74.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1215)
at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1207)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:195)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:145)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:164)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
at hudson.model.Run.execute(Run.java:1900)
at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
at hudson.model.ResourceController.execute(ResourceController.java:107)
at hudson.model.Executor.run(Executor.java:449)
FATAL: Unable to delete script file /tmp/jenkins8376836365637261492.sh
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:102)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected reader termination
at hudson.remoting.SynchronousCommandTransport$ReaderThread.lambda$new$1(SynchronousCommandTransport.java:50)
at java.base/java.lang.Thread.dispatchUncaughtException(Thread.java:1997)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@4a72872d:test-4": Remote call on test-4 failed. The channel is closing down or has closed down
at hudson.remoting.Channel.call(Channel.java:993)
at hudson.FilePath.act(FilePath.java:1186)
at hudson.FilePath.act(FilePath.java:1175)
at hudson.FilePath.delete(FilePath.java:1722)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:163)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:164)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
at hudson.model.Run.execute(Run.java:1900)
at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
at hudson.model.ResourceController.execute(ResourceController.java:107)
at hudson.model.Executor.run(Executor.java:449)
Build step 'Execute shell' marked build as failure
ERROR: Step ‘Publish xUnit test result report’ failed: no workspace for CiviCRM-Core-Matrix-PR/BKPROF=dfl,SUITES=phpunit-crm,label=bknix-tmp #4551
Finished: FAILURE
Here's what happens next:
- Jenkins kills the communication channel.
- Jenkins assumes that the worker-node kills any ongoing work for the test-job.
- Jenkins establishes a new communication channel and begins running new jobs.
-
But the worker-node did not kill everything. (I'm not clear exactly what it did do -- eg if any POSIX signals were sent; eg if worker processes are running or suspended.) For example,
mysqld
is present in the process-table, and it retains a hold on TCP port 5601. - When Jenkins begins another job, it finds the worker-image (
/home/dispatcher/images/bknix-dfl-2.img
) is in use. In fact, all of the images are in use. So it creates a new one (bknix-dfl-5.img
). - When Jenkins starts using
bknix-dfl-5.img
, it tries to launch a new mysqld on TCP port 5601. But it can't; the port is conflicted. You get problems like this:[mysql] Start daemon: mysqld --datadir="/home/homer/_bknix/ramdisk/worker-3/mysql/data" [mysetup] Initialize folder: /home/homer/_bknix/ramdisk/worker-3/mysetup Waiting for MySQL (maxWait=300, interval=0.5, windDown=0.5)... ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/home/homer/_bknix/ramdisk/worker-3/mysql/run/mysql.sock' (2)
If you have several jobs running at the moment of OutOfMemory
, then you may wind up doing this multiple times (e.g. 3 jobs die; 3 zombies left behind; 3 new images created; 3 tcp ports blocked).
(Follow-up) I'm not clear what the exact status is -- if any POSIX signals sent; if worker processes are running or suspended.
What I did observe was that these zombie processes were still around ~2 hours after the original. But after ~2h30m, they had gone way on their own.
If the jobs were quietly executing in a headless fashion, then they should've wrapped up in <30min. So they're probably not running -- it suggests they're somehow suspended (and then some other timeout/reaper mechanism comes in after 2hr). But this is pure speculation.
Brainstorming...
- Maybe figure out which java process ran out of memory -- and why.
- Maybe figure out what - if any - signals are being emitted when this OOM happens. Find a way to kill the zombies/orphans properly.
- Maybe introduce some firmer timeouts within the jobs. (We usually rely on Jenkins to timeout jobs; but obviously that doesn't work here. Try sprinkling
timeout
calls intoCiviCRM-Core-Matrix-PR.job
or similar leverage-point.) - Maybe change the port-allocation function.
- Maybe enable network-namespaces for purposes for any non-interactive jobs.