Ceph osd down что делать
Перейти к содержимому

Ceph osd down что делать

  • автор:

Закончилось место на OSD в CEPH, что делать?

Для теста развернул на одной машине CEPH. Ставил через ceph-depoy.

В качестве OSD использую директории на диске

Создал 7 директорий:

/opt/osd1 /opt/osd2 /opt/osd3 . /opt/osd7 

поднял rados gateway в итоге получилось 6 пулов:

#ceph osd pool ls .rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.buckets.index default.rgw.buckets.data 

Для теста выставил следующие параметры:

osd pool default size = 1 osd pool default min size = 1 osd pool default pg num = 30 osd pool default pgp num = 30 

В ходе теста CEPH предупредил что заканчивается место на одном OSD. Я решил, что поможет добавление нового OSD и CEPH сам перераспределит данные ( я был не прав!) Сейчас статус ceph стал таким:

~# ceph -s cluster: id: 3ed5c9c-ec59-4223-9104-65f82103a45d health: HEALTH_ERR Reduced data availability: 28 pgs stale 1 slow requests are blocked > 32 sec. Implicated osds 0 4 stuck requests are blocked > 4096 sec. Implicated osds 1,2,5 services: mon: 1 daemons, quorum Rutherford mgr: Ruerfr(active) osd: 7 osds: 6 up, 6 in rgw: 1 daemon active data: pools: 6 pools, 180 pgs objects: 227 objects, 2.93KiB usage: 23.0GiB used, 37.0GiB / 60GiB avail pgs: 152 active+clean 28 stale+active+clean 

место на OSD закончилось и он ушел в статус down:

# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.06857 root default -3 0.06857 host Rutherford 0 hdd 0.00980 osd.0 up 0.95001 1.00000 1 hdd 0.00980 osd.1 up 0.90002 1.00000 2 hdd 0.00980 osd.2 up 0.90002 1.00000 3 hdd 0.00980 osd.3 down 0 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 6 hdd 0.00980 osd.6 up 1.00000 1.00000 

Я понимаю что проблема в моем не понимании работы CEPH, но к сожалению сам найти решение не смог, поэтому прошу помощи. Вопросы на которые я так и не смог ответить:

  • Как сейчас восстановить работу CEPH, место на диске есть. Создать OSD могу , но как заставить CEPH перераспределить данные с одного OSD на другие ?
  • Почему CEPH писал данные только на один OSD, я изначально создавал их 7 штук ?
[global] fsid = 1ed3ce2c-ec59-4315-9146-65182123a35d mon_initial_members = Rut4erfor mon_host = 8.3.5.1 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd pool default size = 1 osd pool default min size = 1 osd pool default pg num = 30 osd pool default pgp num = 30 [osd] osd max object size = 1073741824 osd max write size = 1073741824 

SergHom
17.07.20 09:10:26 MSK

Почему CEPH писал данные только на один OSD, я изначально создавал их 7 штук ?

ceph osd df
ceph pg ls | tail -20

Pinkbyte ★★★★★
( 17.07.20 16:38:00 MSK )
Последнее исправление: Pinkbyte 17.07.20 16:41:12 MSK (всего исправлений: 2)

CRASH
( 17.07.20 18:01:56 MSK )

osd pool default size = 1 osd pool default min size = 1

Перевод: «я хочу потерять данные, когда с одним osd что-то плохое произойдет».

Надо хотя бы в 2 оба параметра выставить.

А так у тебя один osd ушел в down, вместе с единственной копией данных, которая там была.

osd pool default pg num = 30

Мало. Их должно быть в общей сложности (т.е. с учетом копий) 100 на каждый osd.

Итого: пересоздавай кластер.

AEP ★★★★★
( 19.07.20 18:13:33 MSK )

Ну и по поводу отсутствующих знаний: есть еще проблема отсутствия неустаревшей литературы, и проблема ужасной документации (оно все еще акцентируют внимание на filestore, который уже давно не нужен, так как есть bluestore). И вообще проблема сакральных знаний, которыми спецы делятся только за мегабаксы.

Советую записаться вот на этот курс, так как других хороших источников я просто не знаю: https://slurm.io/ceph

AEP ★★★★★
( 19.07.20 18:24:00 MSK )
Ответ на: комментарий от Pinkbyte 17.07.20 16:38:00 MSK

Прежде всего спасибо что откликнулись ! вот данные которые просили:

 ceph osd df ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META AVAIL %USE VAR PGS 0 hdd 0.00980 0.80005 10GiB 9.88GiB 57.4MiB 220KiB 9.83GiB 119MiB 98.84 3.42 36 1 hdd 0.00980 0.75006 10GiB 9.88GiB 56.6MiB 305KiB 9.83GiB 119MiB 98.84 3.42 26 2 hdd 0.00980 0.75006 10GiB 9.89GiB 55.9MiB 371KiB 9.83GiB 117MiB 98.86 3.42 11 3 hdd 0.00980 0 0B 0B 0B 0B 0B 0B 0 0 28 4 hdd 0.00980 1.00000 10GiB 2.85GiB 57.7MiB 0B 2.79GiB 7.15GiB 28.47 0.99 140 5 hdd 0.00980 1.00000 10GiB 2.48GiB 57.8MiB 0B 2.43GiB 7.52GiB 24.84 0.86 127 6 hdd 0.00980 1.00000 10GiB 2.74GiB 57.7MiB 0B 2.68GiB 7.26GiB 27.37 0.95 136 7 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.18 0.46 95 8 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.20 0.46 72 9 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.20 0.46 76 10 hdd 0.00980 1.00000 10GiB 1.32GiB 57.8MiB 0B 1.26GiB 8.68GiB 13.21 0.46 78 11 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.17 0.46 76 12 hdd 0.00980 1.00000 10GiB 1.32GiB 57.8MiB 0B 1.26GiB 8.68GiB 13.19 0.46 74 13 hdd 0.00980 1.00000 10GiB 1.32GiB 57.8MiB 0B 1.26GiB 8.68GiB 13.15 0.46 59 14 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.19 0.46 54 15 hdd 0.00980 1.00000 10GiB 1.32GiB 57.9MiB 0B 1.26GiB 8.68GiB 13.17 0.46 76 16 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.16 0.46 54 17 hdd 0.00980 1.00000 10GiB 1.31GiB 57.7MiB 0B 1.26GiB 8.69GiB 13.15 0.46 66 18 hdd 0.00980 1.00000 10GiB 1.32GiB 57.7MiB 0B 1.26GiB 8.68GiB 13.16 0.46 60 19 hdd 0.00980 1.00000 10GiB 1.31GiB 57.7MiB 0B 1.26GiB 8.69GiB 13.13 0.45 45 TOTAL 190GiB 54.8GiB 1.07GiB 896KiB 53.8GiB 135GiB 28.87 

ceph pg ls | tail -20

ceph pg ls | tail -20 7.c 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-14 12:08:40.728620 0'0 460:343 [6] 6 [6] 6 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.d 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 08:38:53.212660 0'0 461:176 [16] 16 [16,15] 16 0'0 2020-07-19 08:38:53.212586 0'0 2020-07-14 12:06:57.684106 7.e 0 0 0 0 0 0 0 0 0 0 stale+active+undersized 2020-07-14 12:08:40.440023 0'0 445:327 [2] 2 [2] 2 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.f 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 10:42:42.538184 0'0 461:176 [16] 16 [16,4] 16 0'0 2020-07-19 10:42:42.538089 0'0 2020-07-14 12:06:57.684106 7.10 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 08:22:00.818971 0'0 461:221 [13] 13 [13,10] 13 0'0 2020-07-19 08:22:00.818882 0'0 2020-07-19 08:22:00.818882 7.11 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 01:55:41.984335 0'0 461:192 [15] 15 [15,7] 15 0'0 2020-07-19 01:55:41.984294 0'0 2020-07-16 15:16:36.600379 7.12 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 06:16:35.490832 0'0 461:260 [10] 10 [10,8] 10 0'0 2020-07-19 06:16:35.490770 0'0 2020-07-18 06:04:58.129850 7.13 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-18 22:49:32.916717 0'0 460:211 [14] 14 [14] 14 0'0 2020-07-18 17:47:52.840600 0'0 2020-07-14 12:06:57.684106 7.14 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-14 12:08:38.610730 0'0 460:342 [5] 5 [5] 5 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.15 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-15 14:29:42.848823 0'0 460:232 [12] 12 [12] 12 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.16 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 15:41:18.933142 0'0 461:283 [7] 7 [7,5] 7 0'0 2020-07-19 15:41:18.933077 0'0 2020-07-14 12:06:57.684106 7.17 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-20 02:46:08.936766 0'0 461:271 [9] 9 [9,5] 9 0'0 2020-07-20 02:46:08.936673 0'0 2020-07-14 12:06:57.684106 7.18 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-14 12:08:40.097447 0'0 460:339 [4] 4 [4] 4 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.19 0 0 0 0 0 0 0 0 0 0 active+clean+remapped 2020-07-19 17:05:12.163909 0'0 461:287 [8] 8 [8,6] 8 0'0 2020-07-19 17:05:12.163859 0'0 2020-07-14 12:06:57.684106 7.1a 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-19 07:27:46.065812 0'0 461:137 [19] 19 [19,11] 19 0'0 2020-07-19 07:27:46.065720 0'0 2020-07-19 07:27:46.065720 7.1b 0 0 0 0 0 0 0 0 0 0 stale+active+undersized 2020-07-14 12:08:38.159892 0'0 457:337 [0] 0 [0] 0 0'0 2020-07-14 12:06:57.684106 0'0 2020-07-14 12:06:57.684106 7.1c 0 0 0 0 0 0 0 0 0 0 active+clean 2020-07-20 05:52:46.899173 0'0 461:193 [15] 15 [15,9] 15 0'0 2020-07-20 05:52:46.899083 0'0 2020-07-14 12:06:57.684106 7.1d 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-07-18 22:49:31.955385 0'0 460:289 [8] 8 [8] 8 0'0 2020-07-17 23:30:38.960506 0'0 2020-07-14 12:06:57.684106 * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details. 

PS Сори , последний лог разъехался , из за большого количества колонок.

Chapter 5. Troubleshooting OSDs

This chapter contains information on how to fix the most common errors related to Ceph OSDs.

Before You Start

  • Verify your network connection. See Chapter 3, Troubleshooting Networking Issues for details.
  • Verify that Monitors have a quorum by using the ceph health command. If the command returns a health status ( HEALTH_OK , HEALTH_WARN , or HEALTH_ERR ), the Monitors are able to form a quorum. If not, address any Monitor problems first. See ] for details. For details about ceph health see xref:understanding-ceph-health[.
  • Optionally, stop the rebalancing process to save time and resources. See Section 5.2, “Stopping and Starting Rebalancing” for details.

5.1. The Most Common Error Messages Related to OSDs

The following tables list the most common error messages that are returned by the ceph health detail command, or included in the Ceph logs. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix the problems.

Table 5.1. Error Messages Related to OSDs

requests are blocked

Table 5.2. Common Error Messages in Ceph Logs Related to OSDs

heartbeat_check: no reply from osd.X

Main cluster log

wrongly marked me down

Main cluster log

osds have slow requests

Main cluster log

FAILED assert(0 == «hit suicide timeout»)

5.1.1. Full OSDs

The ceph health detail command returns an error message similar to the following one:

HEALTH_ERR 1 full osds osd.3 is full at 95%
What This Means

Ceph prevents clients from performing I/O operations on full OSD nodes to avoid losing data. It returns the HEALTH_ERR full osds message when the cluster reaches the capacity set by the mon_osd_full_ratio parameter. By default, this parameter is set to 0.95 which means 95% of the cluster capacity.

To Troubleshoot This Problem

Determine how many percent of raw storage ( %RAW USED ) is used:

# ceph df

If %RAW USED is above 70-75%, you can:

  • Delete unnecessary data. This is a short-term solution to avoid production downtime. See Section 5.6, “Deleting Data from a Full Cluster” for details.
  • Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat. For details, see the Adding and Removing OSD Nodes chapter in the Administration Guide for Red Hat Ceph Storage 3.
See Also

5.1.2. Nearfull OSDs

The ceph health detail command returns an error message similar to the following one:

HEALTH_WARN 1 nearfull osds osd.2 is near full at 85%
What This Means

Ceph returns the nearfull osds message when the cluster reaches the capacity set by the mon osd nearfull ratio defaults parameter. By default, this parameter is set to 0.85 which means 85% of the cluster capacity.

Ceph distributes data based on the CRUSH hierarchy in the best possible way but it cannot guarantee equal distribution. The main causes of the uneven data distribution and the nearfull osds messages are:

  • The OSDs are not balanced among the OSD nodes in the cluster. That is, some OSD nodes host significantly more OSDs than others, or the weight of some OSDs in the CRUSH map is not adequate to their capacity.
  • The Placement Group (PG) count is not proper as per the number of the OSDs, use case, target PGs per OSD, and OSD utilization.
  • The cluster uses inappropriate CRUSH tunables.
  • The back-end storage for OSDs is almost full.
To Troubleshoot This Problem:
  1. Verify that the PG count is sufficient and increase it if needed. See Section 7.5, “Increasing the PG Count” for details.
  2. Verify that you use CRUSH tunables optimal to the cluster version and adjust them if not. For details, see the CRUSH Tunables section in the Storage Strategies guide for Red Hat Ceph Storage 3 and the How can I test the impact CRUSH map tunable modifications will have on my PG distribution across OSDs in Red Hat Ceph Storage? solution on the Red Hat Customer Portal.
  3. Change the weight of OSDs by utilization. See the Set an OSD’s Weight by Utilization section in the Storage Strategies guide for Red Hat Ceph Storage 3.
  4. Determine how much space is left on the disks used by OSDs.
  1. To view how much space OSDs use in general:
# ceph osd df
See Also

5.1.3. One or More OSDs Are Down

The ceph health command returns an error similar to the following one:

HEALTH_WARN 1/3 in osds are down
What This Means

One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As a consequence, the surviving ceph-osd daemons reported this failure to the Monitors.

If the ceph-osd daemon is not running, the underlying OSD drive or file system is either corrupted, or some other error, such as a missing keyring, is preventing the daemon from starting.

In most cases, networking issues cause the situation when the ceph-osd daemon is running but still marked as down .

To Troubleshoot This Problem
  1. Determine which OSD is down :
# ceph health detail HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

systemctl restart ceph-osd@
Replace with the ID of the OSD that is down , for example:

# systemctl restart ceph-osd@0
  1. If you are not able start ceph-osd , follow the steps in The ceph-osd daemon cannot start.
  2. If you are able to start the ceph-osd daemon but it is marked as down , follow the steps in The ceph-osd daemon is running but still marked as down .
The ceph-osd daemon cannot start
  1. If you have a node containing a number of OSDs (generally, more that twelve), verify that the default maximum number of threads (PID count) is sufficient. See Section 5.5, “Increasing the PID count” for details.
  2. Verify that the OSD data and journal partitions are mounted properly:
# ceph-disk list . /dev/vdb : /dev/vdb1 ceph data, prepared /dev/vdb2 ceph journal /dev/vdc : /dev/vdc1 ceph data, active, cluster ceph, osd.1, journal /dev/vdc2 /dev/vdc2 ceph journal, for /dev/vdc1 /dev/sdd1 : /dev/sdd1 ceph data, unprepared /dev/sdd2 ceph journal

If this error message is returned during boot time of the OSD host, open a support ticket as this might indicate a known issue tracked in the Red Hat Bugzilla 1439210. See Chapter 9, Contacting Red Hat Support Service for details.

  1. An EIO error message similar to the following one indicates a failure of the underlying disk:
FAILED assert(!m_filestore_fail_eio || r != -5)
FAILED assert(0 == "hit suicide timeout")
$ dmesg
  1. The error -5 error message similar to the following one indicates corruption of the underlying XFS file system. For details on how to fix this problem, see the What is the meaning of «xfs_log_force: error -5 returned»? solution on the Red Hat Customer Portal.
xfs_log_force: error -5 returned
Caught signal (Segmentation fault)
The ceph-osd is running but still marked as down
  1. Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ directory.
  1. If the log includes error messages similar to the following ones, see Section 5.1.4, “Flapping OSDs”.
wrongly marked me down heartbeat_check: no reply from osd.2 since back
See Also
  • Section 5.1.4, “Flapping OSDs”
  • Section 7.1.1, “Stale Placement Groups”
  • The Starting, Stopping, Restarting a Daemon by Instances section in the Administration Guide for Red Hat Ceph Storage 3

5.1.4. Flapping OSDs

The ceph -w | grep osds command shows OSDs repeatedly as down and then up again within a short period of time:

# ceph -w | grep osds 2017-04-05 06:27:20.810535 mon.0 [INF] osdmap e609: 9 osds: 8 up, 9 in 2017-04-05 06:27:24.120611 mon.0 [INF] osdmap e611: 9 osds: 7 up, 9 in 2017-04-05 06:27:25.975622 mon.0 [INF] HEALTH_WARN; 118 pgs stale; 2/9 in osds are down 2017-04-05 06:27:27.489790 mon.0 [INF] osdmap e614: 9 osds: 6 up, 9 in 2017-04-05 06:27:36.540000 mon.0 [INF] osdmap e616: 9 osds: 7 up, 9 in 2017-04-05 06:27:39.681913 mon.0 [INF] osdmap e618: 9 osds: 8 up, 9 in 2017-04-05 06:27:43.269401 mon.0 [INF] osdmap e620: 9 osds: 9 up, 9 in 2017-04-05 06:27:54.884426 mon.0 [INF] osdmap e622: 9 osds: 8 up, 9 in 2017-04-05 06:27:57.398706 mon.0 [INF] osdmap e624: 9 osds: 7 up, 9 in 2017-04-05 06:27:59.669841 mon.0 [INF] osdmap e625: 9 osds: 6 up, 9 in 2017-04-05 06:28:07.043677 mon.0 [INF] osdmap e628: 9 osds: 7 up, 9 in 2017-04-05 06:28:10.512331 mon.0 [INF] osdmap e630: 9 osds: 8 up, 9 in 2017-04-05 06:28:12.670923 mon.0 [INF] osdmap e631: 9 osds: 9 up, 9 in

In addition, the Ceph log contains error messages similar to the following ones:

2016-07-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down
2016-07-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2016-07-25 19:00:07.444113 front 2016-07-25 18:59:48.311935 (cutoff 2016-07-25 18:59:48.906862)
What This Means

The main causes of flapping OSDs are:

  • Certain cluster operations, such as scrubbing or recovery, take an abnormal amount of time. For example, if you perform these operations on objects with a large index or large placement groups. Usually, after these operations finish, the flapping OSDs problem is solved.
  • Problems with the underlying physical hardware. In this case, the ceph health detail command also returns the slow requests error message. For details, see Section 5.1.5, “Slow Requests, and Requests are Blocked”.
  • Problems with network.

OSDs cannot handle well the situation when the cluster (back-end) network fails or develops significant latency while the public (front-end) network operates optimally.

OSDs use the cluster network for sending heartbeat packets to each other to indicate that they are up and in . If the cluster network does not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they report each other as being down to the Monitors, while marking themselves as up .

The following parameters in the Ceph configuration file influence this behavior:

How long OSDs wait for the heartbeat packets to return before reporting an OSD as down to the Monitors.

How many OSDs must report another OSD as down before the Monitors mark the OSD as down

This table shows that in the default configuration, the Ceph Monitors mark an OSD as down if only one OSD made three distinct reports about the first OSD being down . In some cases, if one single host encounters network issues, the entire cluster can experience flapping OSDs. This is because the OSDs that reside on the host will report other OSDs in the cluster as down .

The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately killed.

To Troubleshoot This Problem
  1. Check the output of the ceph health detail command again. If it includes the slow requests error message, see Section 5.1.5, “Slow Requests, and Requests are Blocked” for details on how to troubleshoot this issue.
# ceph health detail HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests
# ceph osd tree | grep down
# ceph osd set noup # ceph osd set nodown

Using the noup and nodown flags does not fix the root cause of the problem but only prevents OSDs from flapping. Open a support ticket, if you are unable to fix and troubleshoot the error by yourself. See Chapter 9, Contacting Red Hat Support Service for details.

Additional Resources
  • The Verifying the Network Configuration for Red Hat Ceph Storage section in the Red Hat Ceph Storage 3 Installation Guide for Red Hat Enterprise Linux or Installation Guide for Ubuntu
  • The Heartbeating section in the Architecture Guide for Red Hat Ceph Storage 3

5.1.5. Slow Requests, and Requests are Blocked

The ceph-osd daemon is slow to respond to a request and the ceph health detail command returns an error message similar to the following one:

HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests

In addition, the Ceph logs include an error message similar to the following ones:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at : osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
What This Means

An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

The main causes of OSDs having slow requests are:

  • Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
  • Problems with network. These problems are usually connected with flapping OSDs. See Section 5.1.4, “Flapping OSDs” for details.
  • System load

The following table shows the types of slow requests. Use the dump_historic_ops administration socket command to determine the type of a slow request. For details about the administration socket, see the Using the Administration Socket section in the Administration Guide for Red Hat Ceph Storage 3.

waiting for rw locks

The OSD is waiting to acquire a lock on a placement group for the operation.

waiting for subops

The OSD is waiting for replica OSDs to apply the operation to the journal.

no flag points reached

The OSD did not reach any major operation milestone.

waiting for degraded object

The OSDs have not replicated an object the specified number of times yet.

To Troubleshoot This Problem
  1. Determine if the OSDs with slow or block requests share a common piece of hardware, for example a disk drive, host, rack, or network switch.
  2. If the OSDs share a disk:
  1. Use the smartmontools utility to check the health of the disk or the logs to determine any errors on the disk.

The smartmontools utility is included in the smartmontools package.
The iostat utility is included in the sysstat package.

  1. Check the RAM and CPU utilization
  2. Use the netstat utility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any networking issues. See also Chapter 3, Troubleshooting Networking Issues for further information.
See Also
  • The Using the Administration Socket section in the Administration Guide for Red Hat Ceph Storage 3

5.2. Stopping and Starting Rebalancing

When an OSD fails or you stop it, the CRUSH algorithm automatically starts the rebalancing process to redistribute data across the remaining OSDs.

Rebalancing can take time and resources, therefore, consider stopping rebalancing during troubleshooting or maintaining OSDs. To do so, set the noout flag before stopping the OSD:

# ceph osd set noout

When you finish troubleshooting or maintenance, unset the noout flag to start rebalancing:

# ceph osd unset noout

Placement groups within the stopped OSDs become degraded during troubleshooting and maintenance.

See Also
  • The Rebalancing and Recovery section in the Architecture Guide for Red Hat Ceph Storage 3

5.3. Mounting the OSD Data Partition

If the OSD data partition is not mounted correctly, the ceph-osd daemon cannot start. If you discover that the partition is not mounted as expected, follow the steps in this section to mount it.

Procedure: Mounting the OSD Data Partition
  1. Mount the partition:

# mount -o noatime /var/lib/ceph/osd/-

Replace with the path to the partition on the OSD drive dedicated to OSD data. Specify the cluster name and the OSD number, for example:

# mount -o noatime /dev/sdd1 /var/lib/ceph/osd/ceph-0

# systemctl start ceph-osd@
Replace the with the ID of the OSD, for example:

# systemctl start ceph-osd@0
See Also

5.4. Replacing an OSD Drive

Ceph is designed for fault tolerance, which means that it can operate in a degraded state without losing data. Consequently, Ceph can operate even if a data storage drive fails. In the context of a failed drive, the degraded state means that the extra copies of the data stored on other OSDs will backfill automatically to other OSDs in the cluster. However, if this occurs, replace the failed OSD drive and recreate the OSD manually.

When a drive fails, Ceph reports the OSD as down :

HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

Ceph can mark an OSD as down also as a consequence of networking or permissions problems. See Section 5.1.3, “One or More OSDs Are Down” for details.

Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and replace it with a new one without bringing down the node. The whole procedure includes these steps:

  1. Remove the OSD from the Ceph cluster. For details, see the Removing an OSD from the Ceph Cluster procedure.
  2. Replace the drive. For details see, the Replacing the Physical Drive section.
  3. Add the OSD to the cluster. For details, see the Adding an OSD to the Ceph Cluster procedure.
Before You Start
  1. Determine which OSD is down :
# ceph osd tree | grep -i down ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 0 0.00999 osd.0 down 1.00000 1.00000

# systemctl status ceph-osd@
Replace with the ID of the OSD marked as down , for example:

# systemctl status ceph-osd@osd.0 . Active: inactive (dead)
Procedure: Removing an OSD from the Ceph Cluster
  1. Mark the OSD as out :

# ceph osd out osd.
Replace with the ID of the OSD that is marked as down , for example:

# ceph osd out osd.0 marked out osd.0.

If the OSD is down , Ceph marks it as out automatically after 600 seconds when it does not receive any heartbeat packet from the OSD. When this happens, other OSDs with copies of the failed OSD data begin backfilling to ensure that the required number of copies exists within the cluster. While the cluster is backfilling, the cluster will be in a degraded state.

# ceph -w | grep backfill 2017-06-02 04:48:03.403872 mon.0 [INF] pgmap v10293282: 431 pgs: 1 active+undersized+degraded+remapped+backfilling, 28 active+undersized+degraded, 49 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 294 active+clean; 72347 MB data, 101302 MB used, 1624 GB / 1722 GB avail; 227 kB/s rd, 1358 B/s wr, 12 op/s; 10626/35917 objects degraded (29.585%); 6757/35917 objects misplaced (18.813%); 63500 kB/s, 15 objects/s recovering 2017-06-02 04:48:04.414397 mon.0 [INF] pgmap v10293283: 431 pgs: 2 active+undersized+degraded+remapped+backfilling, 75 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 295 active+clean; 72347 MB data, 101398 MB used, 1623 GB / 1722 GB avail; 969 kB/s rd, 6778 B/s wr, 32 op/s; 10626/35917 objects degraded (29.585%); 10580/35917 objects misplaced (29.457%); 125 MB/s, 31 objects/s recovering 2017-06-02 04:48:00.380063 osd.1 [INF] 0.6f starting backfill to osd.0 from (0'0,0'0] MAX to 2521'166639 2017-06-02 04:48:00.380139 osd.1 [INF] 0.48 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'43079 2017-06-02 04:48:00.380260 osd.1 [INF] 0.d starting backfill to osd.0 from (0'0,0'0] MAX to 2513'136847 2017-06-02 04:48:00.380849 osd.1 [INF] 0.71 starting backfill to osd.0 from (0'0,0'0] MAX to 2331'28496 2017-06-02 04:48:00.381027 osd.1 [INF] 0.51 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'87544

# ceph osd crush remove osd.
Replace with the ID of the OSD that is marked as down , for example:

# ceph osd crush remove osd.0 removed item id 0 name 'osd.0' from crush map

# ceph auth del osd.
Replace with the ID of the OSD that is marked as down , for example:

# ceph auth del osd.0 updated

# ceph osd rm osd.
Replace with the ID of the OSD that is marked as down , for example:

# ceph osd rm osd.0 removed osd.0

If you have removed the OSD successfully, it is not present in the output of the following command:

# ceph osd tree

# umount /var/lib/ceph/osd/-
Specify the name of the cluster and the ID of the OSD, for example:

# umount /var/lib/ceph/osd/ceph-0/

If you have unmounted the drive successfully, it is not present in the output of the following command:

# df -h
Procedure: Replacing the Physical Drive
  1. See the documentation for the hardware node for details on replacing the physical drive.
  1. If the drive is hot-swappable, replace the failed drive with a new one.
  2. If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and replace the physical drive. Consider preventing the cluster from backfilling. See Section 5.2, “Stopping and Starting Rebalancing” for details.
Procedure: Adding an OSD to the Ceph Cluster
  1. Add the OSD again.
  1. If you used Ansible to deploy the cluster, run the ceph-ansible playbook again from the Ceph administration server:
# ansible-playbook /usr/share/ceph-ansible site.yml
# ceph osd tree

ceph osd crush move =
For example, to move the bucket located at sdd:row1 to the root bucket:

# ceph osd crush move ssd:row1 root=ssd:root
See Also
  • Section 5.1.3, “One or More OSDs Are Down”
  • The Managing the Cluster Size chapter in the Administration Guide for Red Hat Ceph Storage 3
  • The Red Hat Ceph Storage 3 Installation Guide for Red Hat Enterprise Linux or the Installation Guide for Ubuntu

5.5. Increasing the PID count

If you have a node containing more than 12 Ceph OSDs, the default maximum number of threads (PID count) can be insufficient, especially during recovery. As a consequence, some ceph-osd daemons can terminate and fail to start again. If this happens, increase the maximum possible number of threads allowed.

To temporary increase the number:

# sysctl -w kernel.pid.max=4194303

To permanently increase the number, update the /etc/sysctl.conf file as follows:

kernel.pid.max = 4194303

5.6. Deleting Data from a Full Cluster

Ceph automatically prevents any I/O operations on OSDs that reached the capacity specified by the mon_osd_full_ratio parameter and returns the full osds error message.

This procedure shows how to delete unnecessary data to fix this error.

The mon_osd_full_ratio parameter sets the value of the full_ratio parameter when creating a cluster. You cannot change the value of mon_osd_full_ratio afterwards. To temporarily increase the full_ratio value, increase the set-full-ratio instead.

Procedure: Deleting Data from a Full Cluster
  1. Determine the current value of full_ratio , by default it is set to 0.95 :
# ceph osd dump | grep -i full full_ratio 0.95
# ceph osd set-full-ratio 0.97

Red Hat strongly recommends to not set the set-full-ratio to a value higher than 0.97. Setting this parameter to a higher value makes the recovery process harder. As a consequence, you might not be able to recover full OSDs at all.

# ceph osd dump | grep -i full full_ratio 0.97
# ceph -w
# ceph osd set-full-ratio 0.95
# ceph osd dump | grep -i full full_ratio 0.95
See Also
  • Section 5.1.1, “Full OSDs”
  • Section 5.1.2, “Nearfull OSDs”

Chapter 5. Troubleshooting Ceph OSDs

This chapter contains information on how to fix the most common errors related to Ceph OSDs.

5.1. Prerequisites

  • Verify your network connection. See Troubleshooting networking issues for details.
  • Verify that Monitors have a quorum by using the ceph health command. If the command returns a health status ( HEALTH_OK , HEALTH_WARN , or HEALTH_ERR ), the Monitors are able to form a quorum. If not, address any Monitor problems first. See Troubleshooting Ceph Monitors for details. For details about ceph health see Understanding Ceph health.
  • Optionally, stop the rebalancing process to save time and resources. See Stopping and starting rebalancing for details.

5.2. Most common Ceph OSD errors

The following tables list the most common error messages that are returned by the ceph health detail command, or included in the Ceph logs. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix the problems.

5.2.1. Prerequisites

  • Root-level access to the Ceph OSD nodes.

5.2.2. Ceph OSD error messages

A table of common Ceph OSD error messages, and a potential fix.

requests are blocked

5.2.3. Common Ceph OSD error messages in the Ceph logs

A table of common Ceph OSD error messages found in the Ceph logs, and a link to a potential fix.

heartbeat_check: no reply from osd.X

Main cluster log

wrongly marked me down

Main cluster log

osds have slow requests

Main cluster log

FAILED assert(0 == «hit suicide timeout»)

5.2.4. Full OSDs

The ceph health detail command returns an error message similar to the following one:

HEALTH_ERR 1 full osds osd.3 is full at 95%

What This Means

Ceph prevents clients from performing I/O operations on full OSD nodes to avoid losing data. It returns the HEALTH_ERR full osds message when the cluster reaches the capacity set by the mon_osd_full_ratio parameter. By default, this parameter is set to 0.95 which means 95% of the cluster capacity.

To Troubleshoot This Problem

Determine how many percent of raw storage ( %RAW USED ) is used:

ceph df

If %RAW USED is above 70-75%, you can:

  • Delete unnecessary data. This is a short-term solution to avoid production downtime.
  • Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat.

Additional Resources

  • Nearfull OSDs in the Red Hat Ceph Storage Troubleshooting Guide .
  • See Deleting data from a full storage cluster for details.

5.2.5. Backfillfull OSDs

The ceph health detail command returns an error message similar to the following one:

health: HEALTH_WARN 3 backfillfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 32 pgs backfill_toofull

What this means

When one or more OSDs has exceeded the backfillfull threshold, Ceph prevents data from rebalancing to this device. This is an early warning that rebalancing might not complete and that the cluster is approaching full. The default for the backfullfull threshold is 90%.

To troubleshoot this problem

Check utilization by pool:

ceph df

If %RAW USED is above 70-75%, you can carry out one of the following actions:

  • Delete unnecessary data. This is a short-term solution to avoid production downtime.
  • Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat.
  • Increase the backfillfull ratio for the OSDs that contain the PGs stuck in backfull_toofull to allow the recovery process to continue. Add new storage to the cluster as soon as possible or remove data to prevent filling more OSDs.

Syntax

ceph osd set-backfillfull-ratio VALUE

The range for VALUE is 0.0 to 1.0.
Example

[ceph: root@host01/]# ceph osd set-backfillfull-ratio 0.92

Additional Resources

  • Nearfull OSDS in the Red Hat Ceph Storage Troubleshooting Guide .
  • See Deleting data from a full storage cluster for details.

5.2.6. Nearfull OSDs

The ceph health detail command returns an error message similar to the following one:

HEALTH_WARN 1 nearfull osds osd.2 is near full at 85%

What This Means

Ceph returns the nearfull osds message when the cluster reaches the capacity set by the mon osd nearfull ratio defaults parameter. By default, this parameter is set to 0.85 which means 85% of the cluster capacity.

Ceph distributes data based on the CRUSH hierarchy in the best possible way but it cannot guarantee equal distribution. The main causes of the uneven data distribution and the nearfull osds messages are:

  • The OSDs are not balanced among the OSD nodes in the cluster. That is, some OSD nodes host significantly more OSDs than others, or the weight of some OSDs in the CRUSH map is not adequate to their capacity.
  • The Placement Group (PG) count is not proper as per the number of the OSDs, use case, target PGs per OSD, and OSD utilization.
  • The cluster uses inappropriate CRUSH tunables.
  • The back-end storage for OSDs is almost full.

To Troubleshoot This Problem:

  1. Verify that the PG count is sufficient and increase it if needed.
  2. Verify that you use CRUSH tunables optimal to the cluster version and adjust them if not.
  3. Change the weight of OSDs by utilization.
  4. Determine how much space is left on the disks used by OSDs.

  1. To view how much space OSDs use in general:
[ceph: root@host01 /]# ceph osd df

Additional Resources

  • Full OSDs
  • See the Set an OSD’s Weight by Utilization section in the Storage Strategies guide for Red Hat Ceph Storage 5.
  • For details, see the CRUSH Tunables section in the Storage Strategies guide for Red Hat Ceph Storage 5 and the How can I test the impact CRUSH map tunable modifications will have on my PG distribution across OSDs in Red Hat Ceph Storage? solution on the Red Hat Customer Portal.
  • See Increasing the placement group for details.

5.2.7. Down OSDs

The ceph health detail command returns an error similar to the following one:

HEALTH_WARN 1/3 in osds are down

What This Means

One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As a consequence, the surviving ceph-osd daemons reported this failure to the Monitors.

If the ceph-osd daemon is not running, the underlying OSD drive or file system is either corrupted, or some other error, such as a missing keyring, is preventing the daemon from starting.

In most cases, networking issues cause the situation when the ceph-osd daemon is running but still marked as down .

To Troubleshoot This Problem

    Determine which OSD is down :

[ceph: root@host01 /]# ceph health detail HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

Syntax

systemctl restart ceph-FSID@osd.OSD_ID

Example

[root@host01 ~]# systemctl restart ceph-b404c440-9e4c-11ec-a28a-001a4a0001df@osd.0.service
  1. If you are not able start ceph-osd , follow the steps in The ceph-osd daemon cannot start .
  2. If you are able to start the ceph-osd daemon but it is marked as down , follow the steps in The ceph-osd daemon is running but still marked as `down` .

The ceph-osd daemon cannot start

  1. If you have a node containing a number of OSDs (generally, more than twelve), verify that the default maximum number of threads (PID count) is sufficient. See Increasing the PID count for details.
  2. Verify that the OSD data and journal partitions are mounted properly. You can use the ceph-volume lvm list command to list all devices and volumes associated with the Ceph Storage Cluster and then manually inspect if they are mounted properly. See the mount(8) manual page for details.
  3. If you got the ERROR: missing keyring, cannot use cephx for authentication error message, the OSD is a missing keyring.
  4. If you got the ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1 error message, the ceph-osd daemon cannot read the underlying file system. See the following steps for instructions on how to troubleshoot and fix this error.

  1. Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ CLUSTER_FSID / directory after the logging to files is enabled.
  2. An EIO error message indicates a failure of the underlying disk. To fix this problem replace the underlying OSD disk. See Replacing an OSD drive for details.
  3. If the log includes any other FAILED assert errors, such as the following one, open a support ticket. See Contacting Red Hat Support for service for details.
FAILED assert(0 == "hit suicide timeout")
dmesg
  1. If the dmesg output includes any SCSI error error messages, see the SCSI Error Codes Solution Finder solution on the Red Hat Customer Portal to determine the best way to fix the problem.
  2. Alternatively, if you are unable to fix the underlying file system, replace the OSD drive. See Replacing an OSD drive for details.
Caught signal (Segmentation fault)

The ceph-osd is running but still marked as down

    Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the /var/log/ceph/ CLUSTER_FSID / directory after the logging to files is enabled.

  1. If the log includes error messages similar to the following ones, see Flapping OSDs.
wrongly marked me down heartbeat_check: no reply from osd.2 since back

Additional Resources

  • Flapping OSDs
  • Stale placement groups
  • See the Ceph daemon logs to enable logging to files.

5.2.8. Flapping OSDs

The ceph -w | grep osds command shows OSDs repeatedly as down and then up again within a short period of time:

ceph -w | grep osds 2022-05-05 06:27:20.810535 mon.0 [INF] osdmap e609: 9 osds: 8 up, 9 in 2022-05-05 06:27:24.120611 mon.0 [INF] osdmap e611: 9 osds: 7 up, 9 in 2022-05-05 06:27:25.975622 mon.0 [INF] HEALTH_WARN; 118 pgs stale; 2/9 in osds are down 2022-05-05 06:27:27.489790 mon.0 [INF] osdmap e614: 9 osds: 6 up, 9 in 2022-05-05 06:27:36.540000 mon.0 [INF] osdmap e616: 9 osds: 7 up, 9 in 2022-05-05 06:27:39.681913 mon.0 [INF] osdmap e618: 9 osds: 8 up, 9 in 2022-05-05 06:27:43.269401 mon.0 [INF] osdmap e620: 9 osds: 9 up, 9 in 2022-05-05 06:27:54.884426 mon.0 [INF] osdmap e622: 9 osds: 8 up, 9 in 2022-05-05 06:27:57.398706 mon.0 [INF] osdmap e624: 9 osds: 7 up, 9 in 2022-05-05 06:27:59.669841 mon.0 [INF] osdmap e625: 9 osds: 6 up, 9 in 2022-05-05 06:28:07.043677 mon.0 [INF] osdmap e628: 9 osds: 7 up, 9 in 2022-05-05 06:28:10.512331 mon.0 [INF] osdmap e630: 9 osds: 8 up, 9 in 2022-05-05 06:28:12.670923 mon.0 [INF] osdmap e631: 9 osds: 9 up, 9 in

In addition the Ceph log contains error messages similar to the following ones:

2022-05-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down
2022-05-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2021-07-25 19:00:07.444113 front 2021-07-25 18:59:48.311935 (cutoff 2021-07-25 18:59:48.906862)

What This Means

The main causes of flapping OSDs are:

  • Certain storage cluster operations, such as scrubbing or recovery, take an abnormal amount of time, for example, if you perform these operations on objects with a large index or large placement groups. Usually, after these operations finish, the flapping OSDs problem is solved.
  • Problems with the underlying physical hardware. In this case, the ceph health detail command also returns the slow requests error message.
  • Problems with the network.

Ceph OSDs cannot manage situations where the private network for the storage cluster fails, or significant latency is on the public client-facing network.

Ceph OSDs use the private network for sending heartbeat packets to each other to indicate that they are up and in . If the private storage cluster network does not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they report each other as being down to the Ceph Monitors, while marking themselves as up .

The following parameters in the Ceph configuration file influence this behavior:

How long OSDs wait for the heartbeat packets to return before reporting an OSD as down to the Ceph Monitors.

How many OSDs must report another OSD as down before the Ceph Monitors mark the OSD as down

This table shows that in the default configuration, the Ceph Monitors mark an OSD as down if only one OSD made three distinct reports about the first OSD being down . In some cases, if one single host encounters network issues, the entire cluster can experience flapping OSDs. This is because the OSDs that reside on the host will report other OSDs in the cluster as down .

The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately killed.

To Troubleshoot This Problem

    Check the output of the ceph health detail command again. If it includes the slow requests error message, see for details on how to troubleshoot this issue.

ceph health detail HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests
ceph osd tree | grep down
ceph osd set noup ceph osd set nodown

Using the noup and nodown flags does not fix the root cause of the problem but only prevents OSDs from flapping. To open a support ticket, see the Contacting Red Hat Support for service section for details.

Flapping OSDs can be caused by MTU misconfiguration on Ceph OSD nodes, at the network switch level, or both. To resolve the issue, set MTU to a uniform size on all storage cluster nodes, including on the core and access network switches with a planned downtime. Do not tune osd heartbeat min size because changing this setting can hide issues within the network, and it will not solve actual network inconsistency.

Additional Resources

  • See the Ceph heartbeat section in the Red Hat Ceph Storage Architecture Guide for details.
  • See the Slow requests or requests are blocked section in the Red Hat Ceph Storage Troubleshooting Guide .

5.2.9. Slow requests or requests are blocked

The ceph-osd daemon is slow to respond to a request and the ceph health detail command returns an error message similar to the following one:

HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests 30 ops are blocked > 268435 sec 1 ops are blocked > 268435 sec on osd.11 1 ops are blocked > 268435 sec on osd.18 28 ops are blocked > 268435 sec on osd.39 3 osds have slow requests

In addition, the Ceph logs include an error message similar to the following ones:

2022-05-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
2022-05-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at : osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]

What This Means

An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

The main causes of OSDs having slow requests are:

  • Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
  • Problems with the network. These problems are usually connected with flapping OSDs. See Flapping OSDs for details.
  • System load

The following table shows the types of slow requests. Use the dump_historic_ops administration socket command to determine the type of a slow request. For details about the administration socket, see the Using the Ceph Administration Socket section in the Administration Guide for Red Hat Ceph Storage 5.

waiting for rw locks

The OSD is waiting to acquire a lock on a placement group for the operation.

waiting for subops

The OSD is waiting for replica OSDs to apply the operation to the journal.

no flag points reached

The OSD did not reach any major operation milestone.

waiting for degraded object

The OSDs have not replicated an object the specified number of times yet.

To Troubleshoot This Problem

  1. Determine if the OSDs with slow or block requests share a common piece of hardware, for example, a disk drive, host, rack, or network switch.
  2. If the OSDs share a disk:

  1. Use the smartmontools utility to check the health of the disk or the logs to determine any errors on the disk.

The smartmontools utility is included in the smartmontools package.
The iostat utility is included in the sysstat package.

  1. Check the RAM and CPU utilization
  2. Use the netstat utility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any networking issues. See also Troubleshooting networking issues for further information.

Additional Resources

  • See the Using the Ceph Administration Socket section in the Red Hat Ceph Storage Administration Guide for details.

5.3. Stopping and starting rebalancing

When an OSD fails or you stop it, the CRUSH algorithm automatically starts the rebalancing process to redistribute data across the remaining OSDs.

Rebalancing can take time and resources, therefore, consider stopping rebalancing during troubleshooting or maintaining OSDs.

Placement groups within the stopped OSDs become degraded during troubleshooting and maintenance.

Prerequisites

  • Root-level access to the Ceph Monitor node.

Procedure

    Log in to the Cephadm shell:

Example

[root@host01 ~]# cephadm shell

Example

[ceph: root@host01 /]# ceph osd set noout

Example

[ceph: root@host01 /]# ceph osd unset noout

Additional Resources

  • The Rebalancing and Recovery section in the Red Hat Ceph Storage Architecture Guide .

5.4. Replacing an OSD drive

Ceph is designed for fault tolerance, which means that it can operate in a degraded state without losing data. Consequently, Ceph can operate even if a data storage drive fails. In the context of a failed drive, the degraded state means that the extra copies of the data stored on other OSDs will backfill automatically to other OSDs in the cluster. However, if this occurs, replace the failed OSD drive and recreate the OSD manually.

When a drive fails, Ceph reports the OSD as down :

HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

Ceph can mark an OSD as down also as a consequence of networking or permissions problems. See Down OSDs for details.

Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and replace it with a new one without bringing down the node. The whole procedure includes these steps:

  1. Remove the OSD from the Ceph cluster. For details, see the Removing an OSD from the Ceph Cluster procedure.
  2. Replace the drive. For details, see Replacing the physical drive section.
  3. Add the OSD to the cluster. For details, see Adding an OSD to the Ceph Cluster procedure.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor node.
  • At least one OSD is down .

Removing an OSD from the Ceph Cluster

    Log into the Cephadm shell:

Example

[root@host01 ~]# cephadm shell

Example

[ceph: root@host01 /]# ceph osd tree | grep -i down ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF 0 hdd 0.00999 osd.0 down 1.00000 1.00000

Syntax

ceph osd out OSD_ID.

Example

[ceph: root@host01 /]# ceph osd out osd.0 marked out osd.0.

If the OSD is down , Ceph marks it as out automatically after 600 seconds when it does not receive any heartbeat packet from the OSD based on the mon_osd_down_out_interval parameter. When this happens, other OSDs with copies of the failed OSD data begin backfilling to ensure that the required number of copies exists within the cluster. While the cluster is backfilling, the cluster will be in a degraded state.

Example

[ceph: root@host01 /]# ceph -w | grep backfill 2022-05-02 04:48:03.403872 mon.0 [INF] pgmap v10293282: 431 pgs: 1 active+undersized+degraded+remapped+backfilling, 28 active+undersized+degraded, 49 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 294 active+clean; 72347 MB data, 101302 MB used, 1624 GB / 1722 GB avail; 227 kB/s rd, 1358 B/s wr, 12 op/s; 10626/35917 objects degraded (29.585%); 6757/35917 objects misplaced (18.813%); 63500 kB/s, 15 objects/s recovering 2022-05-02 04:48:04.414397 mon.0 [INF] pgmap v10293283: 431 pgs: 2 active+undersized+degraded+remapped+backfilling, 75 active+undersized+degraded+remapped+wait_backfill, 59 stale+active+clean, 295 active+clean; 72347 MB data, 101398 MB used, 1623 GB / 1722 GB avail; 969 kB/s rd, 6778 B/s wr, 32 op/s; 10626/35917 objects degraded (29.585%); 10580/35917 objects misplaced (29.457%); 125 MB/s, 31 objects/s recovering 2022-05-02 04:48:00.380063 osd.1 [INF] 0.6f starting backfill to osd.0 from (0'0,0'0] MAX to 2521'166639 2022-05-02 04:48:00.380139 osd.1 [INF] 0.48 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'43079 2022-05-02 04:48:00.380260 osd.1 [INF] 0.d starting backfill to osd.0 from (0'0,0'0] MAX to 2513'136847 2022-05-02 04:48:00.380849 osd.1 [INF] 0.71 starting backfill to osd.0 from (0'0,0'0] MAX to 2331'28496 2022-05-02 04:48:00.381027 osd.1 [INF] 0.51 starting backfill to osd.0 from (0'0,0'0] MAX to 2513'87544

Syntax

ceph orch daemon stop OSD_ID

Example

[ceph: root@host01 /]# ceph orch daemon stop osd.0

Syntax

ceph orch osd rm OSD_ID --replace

Example

[ceph: root@host01 /]# ceph orch osd rm 0 --replace

Replacing the physical drive

See the documentation for the hardware node for details on replacing the physical drive.

  1. If the drive is hot-swappable, replace the failed drive with a new one.
  2. If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and replace the physical drive. Consider preventing the cluster from backfilling. See the Stopping and Starting Rebalancing chapter in the Red Hat Ceph Storage Troubleshooting Guide for details.
  3. When the drive appears under the /dev/ directory, make a note of the drive path.
  4. If you want to add the OSD manually, find the OSD drive and format the disk.

Adding an OSD to the Ceph Cluster

    Once the new drive is inserted, you can use the following options to deploy the OSDs:

  • The OSDs are deployed automatically by the Ceph Orchestrator if the —unmanaged parameter is not set.

Example

[ceph: root@host01 /]# ceph orch apply osd --all-available-devices

Example

[ceph: root@host01 /]# ceph orch apply osd --all-available-devices --unmanaged=true

Example

[ceph: root@host01 /]# ceph orch daemon add osd host02:/dev/sdb

Example

[ceph: root@host01 /]# ceph osd tree

Additional Resources

  • See the Deploying Ceph OSDs on all available devices section in the Red Hat Ceph Storage Operations Guide .
  • See the Deploying Ceph OSDs on specific devices and hosts section in the Red Hat Ceph Storage Operations Guide .
  • See the Down OSDs section in the Red Hat Ceph Storage Troubleshooting Guide .
  • See the Red Hat Ceph Storage Installation Guide.

5.5. Increasing the PID count

If you have a node containing more than 12 Ceph OSDs, the default maximum number of threads (PID count) can be insufficient, especially during recovery. As a consequence, some ceph-osd daemons can terminate and fail to start again. If this happens, increase the maximum possible number of threads allowed.

Procedure

To temporary increase the number:

[root@mon ~]# sysctl -w kernel.pid.max=4194303

To permanently increase the number, update the /etc/sysctl.conf file as follows:

kernel.pid.max = 4194303

5.6. Deleting data from a full storage cluster

Ceph automatically prevents any I/O operations on OSDs that reached the capacity specified by the mon_osd_full_ratio parameter and returns the full osds error message.

This procedure shows how to delete unnecessary data to fix this error.

The mon_osd_full_ratio parameter sets the value of the full_ratio parameter when creating a cluster. You cannot change the value of mon_osd_full_ratio afterward. To temporarily increase the full_ratio value, increase the set-full-ratio instead.

Prerequisites

  • Root-level access to the Ceph Monitor node.

Procedure

    Log in to the Cephadm shell:

Example

[root@host01 ~]# cephadm shell
[ceph: root@host01 /]# ceph osd dump | grep -i full full_ratio 0.95
[ceph: root@host01 /]# ceph osd set-full-ratio 0.97

Red Hat strongly recommends to not set the set-full-ratio to a value higher than 0.97. Setting this parameter to a higher value makes the recovery process harder. As a consequence, you might not be able to recover full OSDs at all.

[ceph: root@host01 /]# ceph osd dump | grep -i full full_ratio 0.97
[ceph: root@host01 /]# ceph -w
[ceph: root@host01 /]# ceph osd set-full-ratio 0.95
[ceph: root@host01 /]# ceph osd dump | grep -i full full_ratio 0.95

Additional Resources

  • Full OSDs section in the Red Hat Ceph Storage Troubleshooting Guide .
  • Nearfull OSDs section in the Red Hat Ceph Storage Troubleshooting Guide .

ceph Замена поврежденного OSD

Thank you for reading this post, don’t forget to subscribe!

Замена поврежденного OSD

Если у нас по каким-то при­чи­нам один из osd ушел в down — в дан­ном при­ме­ре это osd 0

ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF

-1 0.05878 root default

-3 0.01959 host tst-vsrv-ceph1

0 hdd 0.00980 osd.0 down 1.00000 1.00000

3 hdd 0.00980 osd.3 up 1.00000 1.00000

-5 0.01959 host tst-vsrv-ceph2

1 hdd 0.00980 osd.1 up 0.90002 1.00000

4 hdd 0.00980 osd.4 up 1.00000 1.00000

-7 0.01959 host tst-vsrv-ceph3

2 hdd 0.00980 osd.2 up 0.90002 1.00000

5 hdd 0.00980 osd.5 up 1.00000 1.00000

И при этом при пере­за­пус­ке osd.0 он через вре­мя падает

systemctl restart ceph-osd@0.service

В этом слу­чае мы можем про­сто выве­сти нера­бо­та­ю­щий osd из кла­сте­ра, пере­со­здать его и вер­нуть в кластер

ceph osd out osd.0

ceph osd crush remove osd.0

ceph auth del osd.0

ceph osd rm osd.0

ceph osd tree здесь уви­дим что наш osd выве­ден из кластера

umount /var/lib/ceph/osd/ceph-0

lvremove /dev/vg-ceph/ceph важ­но. Необ­хо­ди­мо уда­лить logical volume, кото­рый исполь­зо­вал­ся под osd.0 толь­ко после это­го полу­чить­ся создать его и вер­нуть обрат­но в кластер

lvcreate -L14G -n ceph vg-ceph

ceph-deploy osd create —data vg-ceph/ceph tst-vsrv-ceph1 (из под сepha, из нуж­ной директории)

ceph osd tree видим что osd 0 появился

Ждем минут 5 пока прой­дет реба­лан­си­ров­ка Ceph и он воз­вра­ща­ет­ся в состо­я­ние HEALTH_OK

ceph health detail узнать деталь­но какая osd забилась

ceph osd tree

ceph osd df

. Дан­ную про­це­ду­ру заме­ны мож­но так же при­ме­нить, если кла­стер пере­шел в состо­я­ние HEALTH_ERROR . Так как в дан­ном состо­я­нии рас­ши­ре­ние Logical Volume нам не помо­жет, мы можем выве­сти запол­нив­ший­ся OSD (уви­дим мы ее коман­дой ceph health detail), так же уда­лить LVM и пере­со­здать ее с нуж­ным раз­ме­ром. В этот момент на кла­сте­ре про­ис­хо­дит реба­лан­си­ро­ка по OSD , . Дожи­да­ем­ся её окон­ча­ния и вво­дим новый OSD в кла­стер. Про­ис­хо­дит после­ду­ю­щая реба­лан­си­ров­ка и спу­стя неко­то­рое вре­мя кла­стер дол­жен вер­нуть­ся в рабо­чее состояние.

Удаляем RBD , если она заблокирована и не удаляется

Если мы про­ве­ри­ли вез­де, что rbd, кото­рую уда­ля­ем не при­мон­ти­ро­ва­на и не вхо­дит в map, то один из спо­со­бов уда­лить ее следующий:

rbd rm rbd_name

2019-09-19 18:00:45.834 7f0777fff700 -1 librbd::image::RemoveRequest: 0x5600042e5ad0 check_image_watchers: image has watchers — not removing

Removing image: 0% complete…failed.

rbd: error: image still has watchers

This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.

rbd status rbd_name

watcher=10.242.146.15:0/231140603 client.27127586 cookie=18446462598732840965

ceph osd blacklist add 10.10.10.15:0/231140603

blacklisting 10.10.10.15:0/231140603 until 2019-09-19 19:03:11.057368 (3600 sec)

rbd rm rbd_name

Removing image: 100% complete…done.

ceph osd blacklist rm 10.10.10.15:0/231140603

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *