Table of contents:
VSPERF CI consists of several jobs, which are integrated into OPNFV infrastructure. It means that jobs are triggered by OPNFV jenkins (daily job) or OPNFV gerrit (verify and merge jobs). The comprehensive list of jobs, their status and history is visible in VSPERF specific dashboard at https://build.opnfv.org/ci/view/vswitchperf/
There are two versions of each job, one is created for current stable branch and second for the master branch.
In case of the daily job, which executes a set of performance tests, the results are available also in the graphical form at VSPERF CI Results and test results, reports and logs are stored inside OPNFV artifacts at http://artifacts.opnfv.org/logs_vswitchperf_intel-pod12.html.
OPNFV Jenkins is operated by releng team and the configuration of jobs is stored in releng git repository. VSPERF specific part can be found at YAML file vswitchperf.yml. For more info on writing and using jjbs see Jenkins Wow.
In order to have more flexible way of job configuration, VSPERF project stored detailed job configuration in VSPERF repository into build-vsperf.sh script, which is invoked by generic YAML job configuration above.
Links summary:
CI Dashboard: https://build.opnfv.org/ci/view/vswitchperf/
Daily job results:
Job definition scripts:
The VSPERF CI jobs are broken down into:
A set of performance tests is executed for OVS with DPDK support, Vanilla OVS, VPP and SRIOV. Ixia traffic generator is used to generate RFC2544 Throughput and Back2Back traffic.
NOTE: The list of testcases to be executed for particular job type is configured inside build-vsperf.sh. Please refer to configuration options TESTCASES_* and TESTPARAM_* for additional details.
VSPERF project has a dedicated POD hosted at Intel LAB. Please check Intel POD12 and VSPERF in Intel Pharos Lab - Pod 12 for details.
DAILY JOB:
It requires a traffic generator in order to execute the performance testcases. Thus this job is executed at POD12.
The status of Intel POD12 is visible in jenkins at: https://build.opnfv.org/ci/computer/intel-pod12/
VERIFY and MERGE JOB:
They are executed at POD12 or at Ericsson pods as they don't require a traffic generator. POD12 is used as a primary jenkins slave, because execution at Ericsson build machines was not reliable since other projects start to use it more extensively. It seems, that there is a clash on resources (hugepages). There was a attempt to avoid a parallel execution of VSPERF and other jobs, but it didn't help. Contact for the Ericsson Pod: ________
A: Please check "console output" of failed job to find out a cause of failure. Them most common failures are:
DPDK, OVS, QEMU or VPP can't be cloned from it's repository and thus job fails. Example of console output in that case:
Cloning into 'dpdk'... error: RPC failed; result=18, HTTP code = 200 fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed make[1]: *** [dpdk] Error 128 |
This is often a temporary case and it is enough to re-trigger the job, e.g. by inserting a comment "reverify" into gerrit review in question. If problem will persist, please get in touch with admins responsible for particular server to verify, that connection to the failing site is not blocked by firewall.
Jenkins slave went offline during job execution. Example of a console output in that case:
FATAL: command execution failed java.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 10.30.0.3/10.30.0.3:34322' is disconnected. ... |
There are two common causes:
A: Please firstly check answer to "VERIFY JOB has failed" above for causes common for all jobs. Please note then in case of DAILY job, INTEL POD12 is used as a jenkins slave (a job executor) and VSPERF community does OS administration of this server themselves. So you can login and investigate issues directly. In case of daily job, it is possible to re-trigger it from jenkins GUI, but only in case that jenkins user is logged in and he has appropriate privileges. Get in touch with VSPERF PTL and Linux Fundation helpdesk in order to get these privileges. In case that generic issues above didn't occur, then following DAILY job specific issues can occur:
A: This is caused by VM where IxNetwork GUI application is executed. In the past, VSPERF used Intel-POD3, where execution of DAILY job was stable. It means, that performance results were stable among Daily job executions and the execution always took about 12 hours. After the move to a different Intel LAB and to Intel-POD12, the performance started to fluctuate and daily job execution takes more time by each execution. Several attempts to fix these issues were made, but issues still persists. In order to shorten DAILY job execution, it is required to login into VM as "vsperf_ci" user via remote desktop and to restart IxNetwork GUI application.
A: Check if Jenkins slave process is running:
[root@pod12-node3 ~]# ps -ef | grep jenkins jenkins 12995 1 0 Feb13 ? 00:09:40 java -jar slave.jar -jnlpUrl http s://build.opnfv.org/ci/computer/intel-pod12/slave-agent.jnlp -secret <secret> -noCertificateCheck root 17681 17647 0 15:23 pts/0 00:00:00 grep --color=auto jenk |
You can also restart it if needed using "monit stop" and "monit start" commands. Example output:
[root@pod12-node3 ~]# monit status Monit 5.25.1 uptime: 73d 5h 29m Directory 'jenkins_piddir' status OK monitoring status Monitored monitoring mode active on reboot start permission 755 uid 1001 gid 1001 access timestamp Mon, 03 Dec 2018 09:54:12 change timestamp Wed, 13 Feb 2019 14:35:01 modify timestamp Wed, 13 Feb 2019 14:35:01 data collected Thu, 14 Feb 2019 15:23:51 Process 'jenkins' status OK monitoring status Monitored monitoring mode active on reboot start pid 12995 parent pid 1 uid 1001 effective uid 1001 gid 1001 uptime 1d 0h 48m threads 53 children 0 cpu 0.0% cpu total 0.0% memory 0.7% [443.8 MB] memory total 0.7% [443.8 MB] security attribute (null) disk read 0 B/s [81.8 MB total] disk write 0 B/s [6.8 GB total] data collected Thu, 14 Feb 2019 15:23:51 System 'pod12-node3.opnfv.local' status OK monitoring status Monitored monitoring mode active on reboot start load average [0.00] [0.00] [0.00] cpu 0.0%us 0.0%sy 0.0%wa memory usage 15.2 GB [24.1%] swap usage 0 B [0.0%] uptime 73d 5h 30m boot time Mon, 03 Dec 2018 09:53:25 data collected Thu, 14 Feb 2019 15:23:51 |
A: Currently there are 3 vsperf user accounts for IxNetwork in Ixia VM. Follow the below procedure to overcome the issue. Basically, all IxNetwork port numbers are pre-configured. You would just need to restart it.
1. Connect the Ixia VM (Remote Desktop) using 'vsperf_ci' login and password. Once its connected and VM is launched, system should automatically start IxNetwork service on TCL port 9126. Open the Hidden icon's arrow button in task bar and place the mouse pointer on the IxNetwork icon to see whether it shows the TCL Port Configuration. If it's not started automatically, then double click on the IxNetwork icon and it will start the service at port 9126.
2. Connect the Ixia VM (Remote Desktop) using 'vsperf_sandbox' login and password. Once its connected and VM is launched, system should automatically start IxNetwork service ion TCL port 9127. Open the Hidden icon's arrow button in task bar and place the mouse pointer on the IxNetwork icon to see whether it shows the TCL Port Configuration. If it's not started automatically, then double click on the IxNetwork icon and it will start the service at port 9127.
3. Connect the Ixia VM (Remote Desktop) using 'vsperf_sandbox2' login and password. Once its connected and VM is launched, system should automatically start IxNetwork service ion TCL port 9128. Open the Hidden icon's arrow button in task bar and place the mouse pointer on the IxNetwork icon to see whether it shows the TCL Port Configuration. If it's not started automatically, then double click on the IxNetwork icon and it will start the service at port 9128.
If above three IxNetwork TCL services are running fine, then you are good to go.
There are several nodes available at Intel-POD12 (see Intel POD12). Currently there are two sandboxes, where second sandbox using node1 and node2 was created recently. It would be possible to reconfigure 2nd sandbox to be used as another (or even two) jenkins slave. This would speed up execution of VSPERF Jobs. However releng team must be consulted regarding the proper naming as two different jenkins slaves will be hosted at the same Intel POD12.
In the past, VERIFY & MERGE jobs, were executed at opnfv-build-ubuntu groups of slaves, which consists of ericsson-build3 and ericsson-build4 machine. The execution was reliable at both of these servers for several months, but later it started to fail. There were several issues, some of them were related to hugepages allocations and usage and to VPP. In case of VPP, it happened several times, that it stopped to work at all at one of ericsson servers. Responsible admins were asked for help, but they were not able to find a root cause. The only solution was to reboot affected server and it worked for some time again. There is a suspicious, that thboth hugepages and VPP issues are caused by parallel execution of jobs for vsperf and other projects. As debugging of such race condition at server without any access is hardly possible, both VERIFY & MERGE jobs are primarily executed at Intel-POD12. Idea was to execute VERIFY & MERGE jobs at POD12 if it is not occupied by DAILY job and if so, then to move to ericsson POD. However current YAML file definition doesn't work that way. It switches to ericsson POD only in case that INTEL POD12 is offline. Releng engineers can help us with YAML file definition to achieve better utilization of available PODs.
Consider pinning of jenkins health check application at second numa slot, which is not used for performance tests execution. Even better would be a move of that application to a jumphost. However one had to solve, how to execute vsperf "remotly" and how configure multiple slaves at the same pod (probably not possible to run multiple heathchecks at the same machine - may be container would help).
This won't be needed if we will configure more jenkins slave at Intel POD12.