Testing performance of server SSD in Ceph storage

How to check the actual SSD performance
One of the main responsibilities of a service provider is to provide reliable and high-quality services. We also had an extra responsibility to provide services to our customers at attractive prices.
Sacrificing reliability and quality for the sake of good prices is not our kind of approach. Of course, to give customers both at once, we had to make some effort. We had to think thoroughly, look for interesting, sometimes unobvious solutions. In any beautiful and effective engineering solution, in addition to the theoretical part, there is a practical side.

And if we talk about practice, our clients are most definitely interested in how fast our servers are. It is a good question, however, as we all know, “everybody lies”, and the manufacturers of disk drives are no exception. We have some nerdy and meticulous engineers in our team who just would not trust a vendor’s word. Thus, we decided to check absolutely everything.

Testing performance of server SSD in Ceph storage
Sometimes, the performance of disk subsystems is estimated incorrectly. Testers use methods that depend on cache speed, processor performance, and “convenience” of the file system location on the disk. Sometimes they measure the linear speed, which does not reflect the actual performance.

In our tests, we measured the number of IOPS and latency.
Latency is a tricky thing that is hard to measure. In any data storage system, one part of requests is fulfilled quickly, and the other part is fulfilled slowly, and for a small number of requests, the delay before the response can be substantially higher than for the others. The higher the latency, the lower the performance.
This is why in the tests where we measured the sustained latency, three parameters were recorded at once: the average request execution time, the maximum time, and the latency for 99.9 percentile. The last number indicates the following: in 99.9% of cases, the request will be fulfilled faster than the specified time value.
To get valid test results, we used the fio utility, which, unlike IOmeter, has no issues with Linux.

Ceph settings
Ceph is a fail-safe distributed open source data storage that runs over TCP. One of the basic qualities of Ceph is scalability up to petabyte sizes.
Ceph is a free system with high availability and reliability, which does not require special equipment.
Normal workflow for Ceph cluster requires the interaction of multiple OSDs (disks) through a network. We trim down Ceph configuration down to a single OSD on localhost (the same machine where test runs). This eliminates all network latencies and restricts Ceph performance down to a single disk (which is our goal for benchmark purposes).
Ceph provides a load that no synthetic test can produce. Therefore, the results of testing with the help of Ceph are much closer to the possible real loads.
We need to clarify one important thing about all following benchmarks. Ceph uses a journal to store all data prior actual write. That means that every write operation to Ceph storage cause two writes for each underlaying OSDs. As we use just one OSD for purposes of a benchmark, that means that value of write IOPS should be multiplied by 2, and latency should be divided by 2.

Ceph settings in our test:
host reference {
        id -2
        alg straw
        hash 0
        item osd.0 weight 1.000
root default {
        id -1
        # weight 7.000
        alg straw
        hash 0  
        item reference weight 7.000
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host

For the Ceph pool, specify the number of allocation groups PG = 32, the total number of allocation groups for the location of PGP = 32, replication size 1, and min_size 1.

  • Rack server brand: Dell PowerEdge R730xd
  • Processor: Intel® Xeon® CPU E5–2680 v3 @ 2.50GHz
  • RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
  • SSD Disks
  1. Intel® SSD DC S3500 Series 800 GB / Model: Intel 730 and DC S35x0/3610/3700 Series SSDs INTEL / Firmware version: CVWL420503MU800RGN
  2. Kingston DC400 1.6 TB / Model: KINGSTON SEDC400S371600G / Firmware version: SAFM32T3
  3. Kingston DC400 800 GB / Model: KINGSTON SEDC400S37800G / Firmware version: SAFM32T3
  4. Samsung PM863 480 GB / Model: SAMSUNG MZ7LM480HCHP-00003 / Firmware version: GXT3003Q
  5. Samsung PM863 960 GB / Model: SAMSUNG MZ7LM960HCHP-00003 / Firmware version: GXT3003Q
  6. Samsung SSD 850 PRO 1TB / Model: Samsung SSD 850 PRO 1TB / Serial number: S252NXAG809889W
  7. Crucial M500 SSD 960 GB / Device model: Crucial_CT960M500SSD1 / Serial number: 14240C511DD2
  8. Intel SSD 530 Series 240 GB / Device model: INTEL SSDSC2BW240A4 / Firmware version: DC12
  9. Intel SSD 330 Series 240 GB / Device model: INTEL SSDSC2CT240A3 / Firmware version: 300i
  10. Samsung SM863 240 GB / Device model: MZ7KM240HAGR00D3 / Firmware version: GB52

A disk of 100Gb in size was created. Then it was replicated between the tested disks (OSDs). Replication was performed by adding a new OSD, while the old OSD was marked as “out”.
Testing was conducted on single disk OSDs with a built-in log (pool size 1). All tests were conducted on a local machine (localhost) because we had to exclude the network latency effect. Our task was to evaluate the “clean” performance of SSD, without the influence of third-party factors.
After the creation, the disk was “heated up” by the following command:
fio --name `hostname` --blocksize=2M --ioengine=rbd --iodepth=1 --direct=1 --buffered=0 --rw=write --pool=bench --rbdname=bench

reemptive readahead was turned off on all disks, queue discipline was set to noop, which is a simple queue. We tested the disks themselves, and not advanced algorithms or software capabilities that further increase the speed of reading and writing. We needed to eliminate everything that could distort the test results. One more thing: before each launch of the test for reading, reading caches were cleared.
The block size in each test was standard: 4k. OS operate blocks of this size while saving files. No additional changes or conditions were required.

We tested SSDs of different capacities by Intel, Kingston, Samsung, Crucial, EDGE, SanDisk, and Dell. The results are shown in the graphics after each test.
Note that some models were tested twice, with different firmware versions, we’ll provide you with additional information about these disks in a separate article. Believe us, there is much to tell!

Test #1
fio --name `hostname` --blocksize=4k --ioengine=rbd --iodepth=1 --size 1G --direct=1 --buffered=0 --rw=randwrite --pool=bench --rbdname=bench

Here the average time for writing request execution (sustained latency) is tested. The latency parameters are recorded.
Disks are sorted from the lowest latency to the highest. Higher latency = lower performance, lower latency = higher performance.

Test #2
fio --name `hostname` --blocksize=4k --ioengine=rbd --iodepth=32 --direct=1 --buffered=0 --rw=randwrite --pool=bench --rbdname=bench

Here, the peak IOPS performance for writing is tested. The IOPS and latency parameters are recorded.
Disks are sorted from the highest IOPS to the lowest. Higher IOPS = higher performance, lower IOPS = lower performance.

Test #3
fio --name `hostname` --blocksize=4k --ioengine=rbd --iodepth=1 --size 1G --direct=1 --buffered=0 --rw=randread --pool=bench --rbdname=bench
Here the average time for reading request execution (sustained latency) is tested. The latency parameters are recorded.

Test #4
fio --name `hostname` --blocksize=4k --ioengine=rbd --iodepth=32 --direct=1 --buffered=0 --rw=randread --pool=bench --rbdname=bench
Here, the peak IOPS performance for reading is tested. The IOPS and latency parameters are recorded.

Note: in the tests with the queue depth of iodepth = 32, it was possible to achieve disk utilization of more than 80%. At the same time, the processor of the test machine was not fully utilized.
Unfortunately, not all disks have successfully passed the tests. Some of the selected models had to be disqualified as they have issues with TRIM support.
The problems were the same: when trying to create a file system with the mkfs.xfs command on a disk connected to the HBA controller from Broadcom (LSI) via the SAS disk shelf, the entire enclosure crashed.

The names and “passport data” of disqualified disks are listed below:
  • 1. EDGE E3 SSD
  • 2. EDGE Boost Pro Plus 7mm SSD / Firmware version: N1007C
  • 3. Kingston DC400 480 GB / Device model: KINGSTON SEDC400S37480G / Firmware version: SAFM02.G
  • 4. SanDisk SDLF1DAR-960G-1J / Device model: EDGSD25960GE3XS6 / Firmware version: SAFM02.2
Another important thing is that the disqualified disks don’t simply go to “junk” — we work with their vendors on improving the firmware, informing about all the bugs revealed during tests.
Выделенные серверы OVH
Выделенные серверы Hetzner

0 комментариев

Оставить комментарий