Troubleshooting

Module ‘dashboard’ has failed: key type unsupported

Erreur :

$ ceph health detail
HEALTH_ERR Module 'dashboard' has failed: key type unsupported
[ERR] MGR_MODULE_ERROR: Module 'dashboard' has failed: key type unsupported
    Module 'dashboard' has failed: key type unsupported

Il faut désactiver le SSL puis remettre un certificat valide :

ceph config set mgr mgr/dashboard/ssl false; ceph mgr module disable dashboard ; sleep 5; ceph mgr module enable dashboard

ceph dashboard set-ssl-certificate-key -i /etc/pki/realms/pkgdata.net/private/key.pem
ceph dashboard set-ssl-certificate -i /etc/pki/realms/pkgdata.net/acme/cert.pem

ceph config set mgr mgr/dashboard/ssl true; ceph mgr module disable dashboard ; sleep 5; ceph mgr module enable dashboard

Erreur S3 Ceph RGW : No such file or directory

Sur une sauvegarde, le repository restic était corrompu :

Load(<index/d4b1e7f03d>, 0, 0) failed: The specified key does not exist.

La commande ‘restic repair index’ ne fonctionnait pas non plus :

# restic repair index
repository dfbf3cf6 opened (version 2, compression level auto)
created new cache in /data/restic-databases/cache
loading indexes...
Load(<index/d4b1e7f03d>, 0, 0) failed: The specified key does not exist.
Load(<index/d4b1e7f03d>, 0, 0) failed: The specified key does not exist.
removing invalid index d4b1e7f03df484f65e458d37d4089fe7e0a2312dee34a8f5350cfb5f09d7925d: The specified key does not exist.
getting pack files to read...
rebuilding index
[0:00] 100.00%  1 / 1 indexes processed
[0:00] 100.00%  2 / 2 old indexes deleted
done

On peut aussi voir le problème après avoir mounté le bucket Ceph :

s3fs os402.pkgdata.net-databases /mnt/ceph-bucket -o passwd_file=/etc/passwd-s3fs -o url=https://pkgdata.backup -o use_path_request_styl

root@on001:/mnt/ceph-bucket/index # ls
ls: cannot access 'd4b1e7f03df484f65e458d37d4089fe7e0a2312dee34a8f5350cfb5f09d7925d': No such file or directory
d4b1e7f03df484f65e458d37d4089fe7e0a2312dee34a8f5350cfb5f09d7925d  f5f669ed7e8bf6afc46a102b790483e2e3720385d566207da99b32dbc076e855

Pour corriger, se loguer sur ov004.pkgdata.net et lancer :

radosgw-admin bucket check --bucket=os402.pkgdata.net-databases --check-objects --fix

BLUEFS_SPILLOVER

Cf. https://tracker.ceph.com/issues/44509#note-8 https://github.com/nh2/ceph-fix-spilled-metadata-script/blob/main/ceph-fix-spilled-metadata.sh

La DB utilise de l’espace sur les disques à plateaux :

ceph health detail
HEALTH_WARN 10 OSD(s) experiencing BlueFS spillover
[WRN] BLUEFS_SPILLOVER: 10 OSD(s) experiencing BlueFS spillover
     osd.1 spilled over 24 GiB metadata from 'db' device (2.7 GiB used of 50 GiB) to slow device
     osd.2 spilled over 25 GiB metadata from 'db' device (2.4 GiB used of 50 GiB) to slow device
     osd.3 spilled over 23 GiB metadata from 'db' device (2.7 GiB used of 50 GiB) to slow device
     osd.4 spilled over 23 GiB metadata from 'db' device (2.5 GiB used of 50 GiB) to slow device
     osd.5 spilled over 25 GiB metadata from 'db' device (2.1 GiB used of 50 GiB) to slow device
     osd.6 spilled over 23 GiB metadata from 'db' device (2.1 GiB used of 50 GiB) to slow device
     osd.7 spilled over 25 GiB metadata from 'db' device (2.6 GiB used of 50 GiB) to slow device
     osd.8 spilled over 25 GiB metadata from 'db' device (2.8 GiB used of 50 GiB) to slow device
     osd.9 spilled over 24 GiB metadata from 'db' device (1.7 GiB used of 50 GiB) to slow device
     osd.10 spilled over 25 GiB metadata from 'db' device (2.7 GiB used of 50 GiB) to slow device

Cela vient du fait que :

ceph daemon /var/run/ceph/9e2d3cee-4d0c-11ef-ba6d-047c16f1285e/ceph-osd.1.asok config show|grep -i bluestore_volume_selection_policy
    "bluestore_volume_selection_policy": "use_some_extra",

mais bluefs_used dans BDEV_SLOW devrait être à 0 car il y a encore de la place sur BDEV_DB :

ceph daemon /var/run/ceph/9e2d3cee-4d0c-11ef-ba6d-047c16f1285e/ceph-osd.1.asok bluestore bluefs device info
{
    "dev": {
        "device": "BDEV_DB",
        "total": 53687083008,
        "free": 50804547584,
        "bluefs_used": 2882535424
    },
    "dev": {
        "device": "BDEV_SLOW",
        "total": 22000965779456,
        "free": 6919077777408,
        "bluefs_used": 25998458880,
        "bluefs max available": 6834005999616
    }
}

et :

ceph tell osd.1 bluefs stats
1 : device size 0xc7fffe000 : using 0x660700000(26 GiB)
2 : device size 0x14027fc00000 : using 0xdb184548000(14 TiB)
RocksDBBlueFSVolumeSelector 
>>Settings<< extra=28 GiB, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
LOG         0 B         16 MiB      0 B         0 B         0 B         14 MiB      1           
WAL         0 B         126 MiB     0 B         0 B         0 B         108 MiB     5           
DB          0 B         1.1 GiB     0 B         0 B         0 B         717 MiB     52          
SLOW        0 B         24 GiB      0 B         0 B         0 B         22 GiB      352         
TOTAL       0 B         26 GiB      0 B         0 B         0 B         0 B         410         
MAXIMUMS:
LOG         0 B         16 MiB      0 B         0 B         0 B         14 MiB      
WAL         0 B         1.7 GiB     0 B         0 B         0 B         1001 MiB    
DB          0 B         1.2 GiB     0 B         0 B         0 B         801 MiB     
SLOW        0 B         24 GiB      0 B         0 B         0 B         22 GiB      
TOTAL       0 B         27 GiB      0 B         0 B         0 B         0 B         
>> SIZE <<  0 B         48 GiB      19 TiB

Pour corriger :

ceph osd add-noout osd.1
systemctl stop ceph-9e2d3cee-4d0c-11ef-ba6d-047c16f1285e@osd.1.service
ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/9e2d3cee-4d0c-11ef-ba6d-047c16f1285e/osd.1 --devs-source /var/lib/ceph/9e2d3cee-4d0c-11ef-ba6d-047c16f1285e/osd.1/block --dev-target /var/lib/ceph/9e2d3cee-4d0c-11ef-ba6d-047c16f1285e/osd.1/block.db
systemctl start ceph-9e2d3cee-4d0c-11ef-ba6d-047c16f1285e@osd.1.service
ceph osd rm-noout  osd.1