Amazon's Elastic Compute Cloud command-line tools are useful building blocks for creating more complex shell scripts. They allow you to start and stop instances, get their status, add tags, manage storage, IP addresses, and so on. They have one big disadvantage: they take a long time to run. For instance, running ec2-describe-instances for six instances takes 19 seconds on an m1.small AWS Linux instance. One answer given, is that this is caused by JVM startup overhead. I found that hard to believe, because on the same machine a Java "hello world" program executes in 120ms, and running ec2-describe-instances --help takes just 321ms. So I set out to investigate, and, using multiple tracing tools and techniques, this is what I found.
By running the ec2-describe-instances shell script using the shell's -x flag I found that ec2-describe-instances calls ec2-cmd.
$ bash -x `which ec2-describe-instances` + __ZIP_PREFIX__EC2_HOME=/opt/aws/apitools/ec2 + __RPM_PREFIX__EC2_HOME=/usr/local/aes/cmdline + /opt/aws/apitools/ec2/bin/ec2-cmd DescribeInstancesUsing the same technique I found that ec2-cmd runs Java on the class com.amazon.aes.webservices.client.cmd.DescribeInstances.
$ bash -x /opt/aws/apitools/ec2/bin/ec2-cmd DescribeInstances [...] + /usr/lib/jvm/java/bin/java -classpath :/opt/aws/apitools/ec2/lib/activation-1.1.jar:[...] com.amazon.aes.webservices.client.cmd.DescribeInstancesSo from then on I invoked Java directly to avoid the noise caused by the shell scripts, as their time overhead was negligible.
I then checked the interaction between the command and the AWS web services. I did this by running tcpdump and strace in parallel and looking at the logs' timestamps. The following command starts logging packets, runs the Java program logging, and then stops logging packets. On an otherwise idle machine this gives you a very clear packet trace.
(sudo tcpdump -n >tcpdump.out &) ; \
strace -f -o strace.out -T -tt \
/usr/lib/jvm/java/bin/java \
-classpath :/opt/aws/apitools/ec2/lib/activation-1.1.jar:[...] \
com.amazon.aes.webservices.client.cmd.DescribeInstances ; \
sudo killall tcpdump
18:51:33.053700 IP 10.39.93.124.36632 > 172.16.0.23.domain: 10259+ A? ... 18:51:33.053985 IP 10.39.93.124.36632 > 172.16.0.23.domain: 17669+ AAAA? ... 18:51:33.175941 IP 172.16.0.23.domain > 10.39.93.124.36632: 10259 1/0/0 A (51) 18:51:33.209711 IP 172.16.0.23.domain > 10.39.93.124.36632: 17669 0/1/0 (113) 18:51:33.211305 IP 10.39.93.124.57985 > 172.16.0.23.domain: 64553+ PTR? ...This was 33 seconds after the program had started running.
30775 18:51:00.087534 execve("/usr/lib/jvm/java/bin/java", ["/usr/lib/jvm/java/bin/java", "-XX:-UsePerfData", "-classpath", "/opt/aws/apitools/ec2/lib/activa"..., "com.amazon.aes.webservices.clien"...], [/* 31 vars */]) = 0 <0.000417> 30775 18:51:00.088415 brk(0) = 0x1e9d000 <0.000019> 30775 18:51:00.088576 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f915f52a000 <0.000021> 30775 18:51:00.088679 readlink("/proc/self/exe", "/media/ephemeral0/opt/jdk1.7.0_04/jre/bin/java", 4096) = 46 <0.000045>Therefore, I had to concentrate on Java's startup performance and ignore the AWS interactions. In retrospect that should have been a warning that I should have noticed.
I then run strace with the -c flag, which produces a cumulative count of the system calls and the time spent in them. The results were interesting.
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 98.34 18.562341 4428 4192 571 futex 1.07 0.201412 1 178548 gettimeofday 0.27 0.051676 47 1094 mprotect 0.12 0.023182 4 6428 5687 stat 0.11 0.020168 160 126 madvise 0.06 0.011662 1 11292 lseek 0.01 0.002735 0 13208 readReasoning that futex was simply showing the synchronization between threads I decided to try to eliminate gettimeofday, because of a reference I found regarding its effect on virtualized machines. After some searching and reading the command's invocation code I found I could eliminate the gettimeofday calls by setting an environment variable.
export EC2_JVM_ARGS=-XX:-UsePerfData
I then tested how long it took to open and read all the files, by writing a command sequence that got the file names from the trace log and concatenated their contents.
fgrep ' open(' strace.out |
grep -v ENOENT |
sed 's/",.*//;s/.*"//' |
egrep -v '^/(proc|dev)' |
xargs cat |
wc -c
Another thing that struck me in the trace file were thousands of failed calls to open class files in the current directory.
stat("/home/dds/org/codehaus/xfire/fault/XFireFault.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/org/codehaus/xfire/XFireException.class", 0x7f82cb39e280) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/ImageAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/KernelAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/ImageValueAttribute.class", 0x7f82cb39e280) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/RamdiskAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/BlockDeviceMappingAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/ImageMapAttribute.class", 0x7f82cb39e280) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/ProductCodesAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/ImageListAttribute.class", 0x7f82cb39e280) = -1 ENOENT (No such file or directory) stat("/home/dds/com/amazon/aes/webservices/client/LaunchPermissionAttribute.class", 0x7f82cb39ffd0) = -1 ENOENT (No such file or directory)I traced the cause to be the way the invocation shell script built the class path by concatenating all jar files in the library.
for jar in "${LIBDIR}"/*.jar ; do
CP="${CP}:${jar}"
done
CP
variable, starting with a colon,
which is taken to mean to search in the current directory.
I fixed that problem by adding the following line after the for
loop:
CP=$(echo $CP | sed 's/://')
While examining the strace files I found many long read/seek sequences, like the following.
13190 15:07:29.981602 lseek(3, 56113794, SEEK_SET) = 56113794 <0.000034> 13190 15:07:29.981713 read(3, "PK\3\4\n\0\0\0\0\0\345)p@\216a\310\346\f\3\0\0\f\3\0\0001\0\0\0", 30) = 30 <0.000034> 13190 15:07:29.981848 lseek(3, 56113873, SEEK_SET) = 56113873 <0.000034> 13190 15:07:29.981953 read(3, "\312\376\272\276\0\0\0001\0%\n\0\4\0\26\n\0\27\0\30\7\0\31\7\0\32\1\0\6"..., 780) = 780 <0.000036> 13190 15:07:29.986162 lseek(3, 56112935, SEEK_SET) = 56112935 <0.000040> 13190 15:07:29.986288 read(3, "PK\3\4\n\0\0\0\0\0\345)p@\261\311$\200\f\3\0\0\f\3\0\0001\0\0\0", 30) = 30 <0.000033> 13190 15:07:29.986410 lseek(3, 56113014, SEEK_SET) = 56113014 <0.000031> 13190 15:07:29.986507 read(3, "\312\376\272\276\0\0\0001\0%\n\0\4\0\26\n\0\27\0\30\7\0\31\7\0\32\1\0\6"..., 780) = 780 <0.000043> 13190 15:07:29.990264 lseek(3, 56112080, SEEK_SET) = 56112080 <0.000036> 13190 15:07:29.990379 read(3, "PK\3\4\n\0\0\0\0\0\345)p@\32\17|\317\10\3\0\0\10\3\0\0001\0\0\0", 30) = 30 <0.000042> 13190 15:07:29.990515 lseek(3, 56112159, SEEK_SET) = 56112159 <0.000034> 13190 15:07:29.990626 read(3, "\312\376\272\276\0\0\0001\0%\n\0\4\0\26\n\0\27\0\30\7\0\31\7\0\32\1\0\6"..., 776) = 776 <0.000040> 13190 15:07:29.994931 lseek(3, 56111220, SEEK_SET) = 56111220 <0.000047> 13190 15:07:29.995065 read(3, "PK\3\4\n\0\0\0\0\0\345)p@$\333\211\31\f\3\0\0\f\3\0\0002\0\0\0", 30) = 30 <0.000044> 13190 15:07:29.995204 lseek(3, 56111300, SEEK_SET) = 56111300 <0.000035> 13190 15:07:29.995316 read(3, "\312\376\272\276\0\0\0001\0%\n\0\4\0\26\n\0\27\0\30\7\0\31\7\0\32\1\0\6"..., 780) = 780 <0.000043> 13190 15:07:29.999541 lseek(3, 56110361, SEEK_SET) = 56110361 <0.000043> 13190 15:07:29.999667 read(3, "PK\3\4\n\0\0\0\0\0\345)p@2u\340\335\v\3\0\0\v\3\0\0002\0\0\0", 30) = 30 <0.000047> 13190 15:07:29.999809 lseek(3, 56110441, SEEK_SET) = 56110441 <0.000037>This got me thinking about the performance implications of these, and whether the seek calls were somehow interacting with Amazon's Simple Storage Service (S3), which was used as the machine's backing storage. Searching for data on AWS's performance, I came across an excellent technical report written by Simson L. Garfinkel of Harvard's Computer Science Group. There he mentioned that S3 performance depended on the size of the objects stored. Although the operating system's file system and buffer cache should isolate the seek system calls from S3, I decided to test whether using local storage would improve the performance. I therefore modified the instance's startup command to attach some local storage to the machine
ec2-run-instances $AMI -b /dev/sdb=ephemeral0 ...
tar cf - /opt/aws | sudo tar -C /media/ephemeral0/ -xf -
sudo ln -s /media/ephemeral0/opt/ /opt
tar cf - /usr/lib/jvm | sudo tar -C /media/ephemeral0/ -xf -
sudo ln -s /media/ephemeral0/usr/lib/jvm/ /usr/lib/jvm
tar cf - /usr/share | sudo tar -C /media/ephemeral0/ -xf -
sudo ln -s /media/ephemeral0/usr/share/ share
I then considered whether all those seek/read calls were an inefficiency of the OpenJDK that comes with Amazon's AWS Linux. Could Sun's (now Oracle's) Java runtime be more efficient? Based on instructions by Dmitry Leskov I downloaded and run Orcacle's Java 1.7, but the performance appeared to deteriorate a bit. Another dead-end.
In the end it turned out that all the solutions I tried produced negligible or (in the case of Oracle's JVM negative) results:
gettimeofday
,
Apparently, AWS command-line tools are indeed slow due to Java's
long startup time.
Dynamic class loading ensures that running small programs or even
an AWS command with the -help
flag isn't slow,
but when the command needs to do some actual work and thus loads
all the required class files, it takes many seconds to run.
So what can one do? Here are some things that do work.
curl -s http://169.254.169.254/2011-01-01/meta-data/instance-id
Last modified: Saturday, June 2, 2012 0:14 am
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.