Ceph - OSD restore performance

When ceph restores an OSD, performance may seem quite slow. This is due the default settings where ceph has quite conservative values depending on your application workload. Especially if you're running workloads with many small objects (files), the default values may seem too slow.

Adjust OSD daemon configuration

Configuration of restore speed is mostly affected by OSD daemon configuration. If you want to adjust restore speed, you may try the following settings:

# set runtime values
ceph tell 'osd.*' injectargs '--osd-max-backfills 64'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 16'
ceph tell 'osd.*' injectargs '--osd-recovery-op-priority 3'
 
# set persistent values
ceph config set osd osd_max_backfills 64
ceph config set osd osd_recovery_max_active 16
ceph config set osd osd_recovery_op_priority 3
 
# disable delay between recovery operations
ceph tell 'osd.*' config set osd_recovery_sleep_hdd 0
ceph tell 'osd.*' config set osd_recovery_sleep_ssd 0
ceph tell 'osd.*' config set osd_recovery_sleep_hybrid 0
 
# set persistent values
ceph config set osd osd_recovery_sleep_hdd 0
ceph config set osd osd_recovery_sleep_ssd 0
ceph config set osd osd_recovery_sleep_hybrid 0
osd recovery adjustments

This will adjust the runtime configuration:

  • setting max number of backfills per OSD (counted inbound & outbound independently)
  • setting max number of active recovery requests per OSD at the same time
  • disable delays between recovery operations on hdd/ssd and hybrid configurations
  • increasing recovery priority in OSD worker queue

Depending on your hardware you will see a rather huge increase in restore objects (in our test case 20-30 object/s → 1000-2000 object/s).

If you're adjusting these values on a production system, i recommend to increase the values in steps (using ceph tell to set runtime values) and to stay on values that provide a sufficent restore speed - going to fast might impact your actual production workload.

Selecting proper values

You might want to get max speed when restoring data. If your workload allows it, you might increase the values. The out ceph status as input for your decisions.

cluster:
    id:     bcdbd2fa-7037-11eb-93b2-9380cdd20e72
    health: HEALTH_WARN
            Degraded data redundancy: 385683/1549110 objects degraded (24.897%), 32 pgs degraded, 32 pgs undersized
  
  services:
    mon: 3 daemons, quorum nuv-dc-apphost1,nuv-dc-apphost2,nuv-dc-apphost3 (age 2h)
    mgr: nuv-dc-apphost1.cpsuzt(active, since 2h), standbys: nuv-dc-apphost2.esvbvr
    mds: cephfs0:1 {0=cephfs0.nuv-dc-apphost2.agwcjj=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 4m), 3 in (since 4m); 32 remapped pgs
  
  data:
    pools:   3 pools, 65 pgs
    objects: 516.37k objects, 17 GiB
    usage:   89 GiB used, 167 GiB / 256 GiB avail
    pgs:     385683/1549110 objects degraded (24.897%)
             33 active+clean
             24 active+undersized+degraded+remapped+backfill_wait
             8  active+undersized+degraded+remapped+backfilling
  
  io:
    recovery: 16 MiB/s, 470 objects/s
ceph status

This shows that you've 8 pgs that are actively backfilling while 24 pgs are waiting on backfill. If - only if - you're hardware still has enough spare capacity, you might want to increase the concurrency on backfilling further in this case.

  cluster:
    id:     bcdbd2fa-7037-11eb-93b2-9380cdd20e72
    health: HEALTH_WARN
            Degraded data redundancy: 129439/1549110 objects degraded (8.356%), 24 pgs degraded, 24 pgs undersized
  
  services:
    mon: 3 daemons, quorum nuv-dc-apphost1,nuv-dc-apphost2,nuv-dc-apphost3 (age 2h)
    mgr: nuv-dc-apphost1.cpsuzt(active, since 2h), standbys: nuv-dc-apphost2.esvbvr
    mds: cephfs0:1 {0=cephfs0.nuv-dc-apphost2.agwcjj=up:active} 2 up:standby
    osd: 3 osds: 3 up (since 8m), 3 in (since 8m); 24 remapped pgs
  
  data:
    pools:   3 pools, 65 pgs
    objects: 516.37k objects, 17 GiB
    usage:   113 GiB used, 143 GiB / 256 GiB avail
    pgs:     129439/1549110 objects degraded (8.356%)
             41 active+clean
             24 active+undersized+degraded+remapped+backfilling
  
  io:
    recovery: 48 MiB/s, 1.14k objects/s

This is the same environment running the same restore with backfills increased to 128 - no more waiting for backfilling. We're now recovery with full speed. Use this with care if you're running application workloads!