Performance of read-write throughput with iscsi by Martin Monperrus (tagged as
linux iscsi
) I recently encountered some performance issues using iSCSI. I use the open-iscsi implementation on the client side. After hours of googling and trial and errors, here are some points related to the performance of iSCSI.
Readahead The performance is highly dependent on the block device readahead parameter (sector count for filesystem read-ahead). $ blockdev --getra /dev/sda 256
By setting it to 1024 instead of the default 256, I doubled the read throughput. $ blockdev --setra 1024 /dev/sda
Note: on 2.6 kernels, this is equivalent to $ hdparm -a 1024 /dev/sda (see blockdev man page: --setfra N: Set filesystem readahead (same like --setra on 2.6 kernels).) $ hdparm -a 1024 /dev/sda /dev/sda: setting fs readahead to 1024 readahead = 1024 (on)
This also equivalent to setting /sys/block/sda/queue/read_ahead_kb except that the unit is different. The unit is #sectors for blockdev, hdparm and kilobyte for read_ahead_kb.
Note that the setting the readahead to a value larger than max_sectors_kb (/sys/block/sda/queue/max_sectors_kb) has no effect. The minimum value of both is taken.
To see the effect of your changes, look at the field avgrq-sz in $ iostat -x -d2 during hdparm -t.
MTU On http://publib.boulder.ibm.com/infocenter/iseries/v7r1m0/index.jsp?topic=/rzahq/mtuconsiderations.htm, it is stated that High bandwidth and low latency is desirable for the iSCSI network. Storage and virtual Ethernet can take advantage of a maximum transmission unit (MTU) up to a 9000 byte ‘jumbo' frame if the iSCSI network supports the larger MTU. An initiator functions as an iSCSI client.. Jumbo frames also seem to be a solution according to several posts on the web (e.g. this one). The reason is that a basic filesystem block is 4096 bytes, which requires 3 packets with a default MTU of 1500 bytes. On the contrary, with jumbo frames, one network packet can contain one and sometimes even two sequential FS blocks.
MTU is set using Path MTU discovery. It requires ICMP packets not to be firewalled. To set the MTU of the network interface controller (NIC): $ ifconfig eth0 mtu 7200 To test the maximum MTU between the initiator and the target: (initiator) $ tracepath target.com
Partition alignement The partition should be aligned for maximum performance. On my iSCSI disk, I have a single partition which starts at sector #2048. See:
heads and sectors for partition alignment
http://communities.vmware.com/docs/DOC-10510
http://groups.google.com/group/open-iscsi/browse_thread/thread/37741fb3b3eca1e4
http://comments.gmane.org/gmane.linux.iscsi.open-iscsi/2240
/sys/block/sdX/queue/scheduler The scheduler/elevator "noop" seems to be the best for iSCSI. In my settings, noop is better because it avoids crashing the server under heavy I/O load (e.g. when writing very large files) (echo noop > /sys/block/sda/queue/scheduler).
/sys/block/sdX/queue/max_sectors_kb The default value on Linux is 512. It means that a request is at most 256kb, since one request is translated to one SCSI Command PDU, it means that the lower max_sectors_kb the higher the number of SCSI Command PDUs for the same amount of data to be transferred in sequential read/write. Based on this observation, I noticed that my sequential write throughput significantly increased by increasing max_sectors_kb to 16384 (echo 16384 > /sys/block/sda/queue/max_sectors_kb). To monitor the number of SCSI Command PDUs: iscsiadm -m session --stats, field scsicmd_pdus
/sys/block/sdX/queue/nr_requests Increasing the maximal I/O queue size (nr_requests) often improves the performance: echo 1024 > /sys/block/sda/queue/nr_requests See http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO.
iSCSI R2T I still dont' know whether Initial R2T has an impact on throughput or latency. For a given request, the number of R2T PDUs is approximately equal to Size / MaxBurstLength. For instance, to copy 100 MB of data with a connection configured with MaxBurstLength=262144, there are 100000000/262144=38 R2T PDUs. To monitor the number of R2T PDUs: iscsiadm -m session --stats, field r2t_pdus
vm.vfs_cache_pressure Decreasing vm.vfs_cache_pressure increases the file system cache, hence decreases the number of accesses to the iscsi disk. I set it to the usual value of 50: $ sysctl -w vm.vfs_cache_pressure=50 (you can read the current value with cat /proc/sys/vm/vfs_cache_pressure). I believe in the arguments, but I've not set up an experiment which verifies that it actually increases the performance.
To empty the cache (free pagecache, dentries and inodes): $ echo 3 > /proc/sys/vm/drop_caches
Troubleshooting $ iscsid --version iscsid version 2.0-870
$ iscsiadm --version iscsiadm version 2.0-870
$ iscsiadm -m session -P 2 iSCSI Transport Class version 2.0-870 iscsiadm version 2.0-870 Target: iqn.2007-10.net.ovh:r35173vol0 Current Portal: 91.121.191.30:3260,1 Persistent Portal: 91.121.191.30:3260,1
Interface:
Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.2005-03.org.open-iscsi:e4fe229d280f Iface IPaddress: 94.23.243.67 Iface HWaddress: default Iface Netdev: default SID: 1 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE
Negotiated iSCSI params:
HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 131072 MaxXmitDataSegmentLength: 8192 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: Yes MaxOutstandingR2T: 1
Tests $ hdparm -Tt /dev/sda /dev/sda: Timing cached reads: 876 MB in 2.00 seconds = 438.04 MB/sec Timing buffered disk reads: 22 MB in 3.66 seconds = 6.01 MB/sec read test the kernel does not cache block devices
$ dd if=/dev/sda of=/dev/null bs=1024k count=50 50+0 records in 50+0 records out 52428800 bytes (52 MB) copied, 4.68656 s, 11.2 MB/s write test (to a real file on top of the filesystem, so first empty the kernel cache)
$ echo 3 > /proc/sys/vm/drop_caches $ dd if=/dev/zero of=/foo bs=1024k count=50 50+0 records in 50+0 records out 52428800 bytes (52 MB) copied, 4.31652 s, 12.1 MB/s $ rm /foo read and write tests with bonnie++
$ bonnie++ -f Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP r35173.ovh.ne 1000M 9132 5 4756 3 11636 2 209.0 8 Latency 10596ms 3903ms 773ms 2373ms Version 1.96 ------Sequential Create------ --------Random Create-------- r35173.ovh.net -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 6180 41 +++++ +++ 10183 46 7648 50 +++++ +++ 10177 46 Latency 58855us 1460us 4112us 463us 241us 200us random access time with seeker from http: www.linuxinsight.com/how_fast_is_your_disk.html
$ ./seeker /dev/sda Seeker v2.0, 2007-01-15, http: www.linuxinsight.com/how_fast_is_your_disk.html Benchmarking /dev/sda [20480MB], wait 30 seconds............................. Results: 20 seeks/second, 49.02 ms random access time parallel random access time with seeker_baryluk from http:smp.if.uj.edu.pl/~baryluk/seeker_baryluk.c I set the number of threads to 32 because it is the maximum number of parallel requests from /sys/block/sda/device/queue_depth note that the random access time is buggy (see source code)
$ ./seeker_baryluk /dev/sda 32 Seeker v3.0, 2009-06-17, http: www.linuxinsight.com/how_fast_is_your_disk.html Benchmarking /dev/sda 41943040 blocks, 21474836480 bytes, 20 GB, 20480 MB, 21 GiB, 21474 MiBal sector size] 32 threads......... Results: 164 seeks/second, 6.088 ms random access time (67320 < offsets < 21471527830) to read the current configuration value of node.conn[0].timeo.noop_out_interval, node.conn[0].timeo.noop_out_timeout and node.session.timeo.replacement_timeout you may have to adapt the host and connection number
$ cat /sys/class/iscsi_connection/connection1:0/recv_tmo/ping_tmo $ cat /sys/class/iscsi_connection/connection1:0/recv_tmo/recv_tmo $ cat /sys/class/iscsi_session/session1/recovery_tmo $ cat /sys/class/iscsi_session/session1/abort_tmo $ cat /sys/class/iscsi_session/session1/lu_reset_tmo
0 Comments
Please log in to leave a comment.