1 of 44

develop

Welcome to Mayastor!

The native NVMe-oF CAS engine of OpenEBS

This documentation is staged on an unstable branch

It is provided here as a convenience to the writers, reviewers and editors of Mayastor's user documentation, to provide easy visualisation of content before publishing. It MUST NOT be used to guide the installation or use of Mayastor, other than for pre-release testing outside of production. As a staging branch, it is expected at times to contain errors and to be incomplete.

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

What is Mayastor?

Mayastor is a performance optimised "Container Attached Storage" (CAS) solution of the CNCF project OpenEBS. The goal of OpenEBS is to extend Kubernetes with a declarative data plane, providing flexible persistent storage for stateful applications.

Design goals for Mayastor include:

Highly available, durable persistence of data
To be readily deployable and easily managed by autonomous SRE or development teams
To be a low-overhead abstraction for NVMe-based storage devices

Mayastor incorporates Intel's Storage Performance Development Kit. It has been designed from the ground up to leverage the protocol and compute efficiency of NVMe-oF semantics, and the performance capabilities of the latest generation of solid-state storage devices, in order to deliver a storage abstraction with performance overhead measured to be within the range of single-digit percentages.

By comparison, most "shared everything" storage systems are widely thought to impart an overhead of at least 40% (and sometimes as much as 80% or more) as compared to the capabilities of the underlying devices or cloud volumes; additionally traditional shared storage scales in an unpredictable manner as I/O from many workloads interact and compete for resources.

While Mayastor utilizes NVMe-oF it does not require NVMe devices or cloud volumes to operate and can work well with other device types.

Where can I find Mayastor?

Mayastor's source code and documentation are distributed amongst a number of GitHub repositories under the OpenEBS organisation. The following list describes some of the main repositories but is not exhaustive.

openebs/mayastor : contains the source code of the data plane components
openebs/mayastor-control-plane : contains the source code of the control plane components
openebs/mayastor-api : contains common protocol buffer definitions and OpenAPI specifications for Mayastor components
openebs/mayastor-dependencies : contains dependencies common to the control and data plane repositories
openebs/mayastor-extensions : contains components and utilities that provide extended functionalities like ease of installation, monitoring and observability aspects
openebs/mayastor-docs : contains Mayastor's user documenation

Scope

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

This quickstart guide describes the actions necessary to perform a basic installation of Mayastor on an existing Kubernetes cluster, sufficient for evaluation purposes. It assumes that the target cluster will pull the Mayastor container images directly from OpenEBS public container repositories. Where preferred, it is also possible to build Mayastor locally from source and deploy the resultant images but this is outside of the scope of this guide.

Deploying and operating Mayastor in production contexts requires a foundational knowledge of Mayastor internals and best practices, found elsewhere within this documentation.

Basic Architecture

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

The objective of this section is to provide the user and evaluator of Mayastor with a topological view of the gross anatomy of a Mayastor deployment. A description will be made of the expected pod inventory of a correctly deployed cluster, the roles and functions of the constituent pods and related Kubernetes resource types, and of the high level interactions between them and the orchestration thereof.

More detailed guides to Mayastor's components, their design and internal structure, and instructions for building Mayastor from source, are maintained within the project's GitHub repository.

Dramatis Personae

Name

Resource Type

Function

Frequency in Cluster

Control Plane

agent-core

Deployment

Principle control plane actor

Single

csi-controller

Deployment

Hosts Mayastor's CSI controller implementation and CSI provisioner side car

Single

api-rest

Pod

Hosts the public API REST server

Single

api-rest

Service

Exposes the REST API server

operator-diskpool

Deployment

Hosts DiskPool operator

Single

csi-node

DaemonSet

Hosts CSI Driver node plugin containers

All worker nodes

etcd

StatefulSet

Hosts etcd Server container

Configurable(Recommended: Three replicas)

etcd

Service

Exposes etcd DB endpoint

Single

etcd-headless

Service

Exposes etcd DB endpoint

Single

io-engine

DaemonSet

Hosts Mayastor I/O engine

User-selected nodes

DiskPool

CRD

Declares a Mayastor pool's desired state and reflects its current state

User-defined, one or many

Additional components

metrics-exporter-pool

Sidecar container (within io-engine DaemonSet)

Exports pool related metrics in Prometheus format

All worker nodes

pool-metrics-exporter

Service

Exposes exporter API endpoint to Prometheus

Single

promtail

DaemonSet

Scrapes logs of Mayastor-specific pods and exports them to Loki

All worker nodes

loki

StatefulSet

Stores the historical logs exported by promtail pods

Single

loki

Service

Exposes the Loki API endpoint via ClusterIP

Single

Component Roles

io-engine

The io-engine pod encapsulates Mayastor containers, which implement the I/O path from the block devices at the persistence layer, up to the relevant initiators on the worker nodes mounting volume claims. The Mayastor process running inside this container performs four major functions:

Creates and manages DiskPools hosted on that node.
Creates, exports, and manages volume controller objects hosted on that node.
Creates and exposes replicas from DiskPools hosted on that node over NVMe-TCP.
Provides a gRPC interface service to orchestrate the creation, deletion and management of the above objects, hosted on that node. Before the io-engine pod starts running, an init container attempts to verify connectivity to the agent-core in the namespace where Mayastor has been deployed. If a connection is established, the io-engine pod registers itself over gRPC to the agent-core. In this way, the agent-core maintains a registry of nodes and their supported api-versions. The scheduling of these pods is determined declaratively by using a DaemonSet specification. By default, a nodeSelector field is used within the pod spec to select all worker nodes to which the user has attached the label openebs.io/engine=mayastor as recipients of an io-engine pod. In this way, the node count and location are set appropriately to the hardware configuration of the worker nodes, and the capacity and performance demands of the cluster.

csi-node

The csi-node pods within a cluster implement the node plugin component of Mayastor's CSI driver. As such, their function is to orchestrate the mounting of Mayastor-provisioned volumes on worker nodes on which application pods consuming those volumes are scheduled. By default, a csi-node pod is scheduled on every node in the target cluster, as determined by a DaemonSet resource of the same name. Each of these pods encapsulates two containers, csi-node, and csi-driver-registrar. The node plugin does not need to run on every worker node within a cluster and this behavior can be modified, if desired, through the application of appropriate node labeling and the addition of a corresponding nodeSelector entry within the pod spec of the csi-node DaemonSet. However, it should be noted that if a node does not host a plugin pod, then it will not be possible to schedule an application pod on it, which is configured to mount Mayastor volumes.

etcd

etcd is a distributed reliable key-value store for the critical data of a distributed system. Mayastor uses etcd as a reliable persistent store for its configuration and state data.

Supportability:

The supportability tool is used to create support bundles (archive files) by interacting with multiple services present in the cluster where Mayastor is installed. These bundles contain information about Mayastor resources like volumes, pools and nodes, and can be used for debugging. The tool can collect the following information:

Topological information of Mayastor's resource(s) by interacting with the REST service
Historical logs by interacting with Loki. If Loki is unavailable, it interacts with the kube-apiserver to fetch logs.
Mayastor-specific Kubernetes resources by interacting with the kube-apiserver
Mayastor-specific information from etcd (internal) by interacting with the etcd server.

Loki:

Loki aggregates and centrally stores logs from all Mayastor containers which are deployed in the cluster.

Promtail:

Promtail is a log collector built specifically for Loki. It uses the configuration file for target discovery and includes analogous features for labeling, transforming, and filtering logs from containers before ingesting them to Loki.

Quickstart

Prerequisites

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

General

All worker nodes must satisfy the following requirements:

x86-64 CPU cores with SSE4.2 instruction support
(Tested on) Linux kernel 5.15 (Recommended) Linux kernel 5.13 or higher. The kernel should have the following modules loaded:
- nvme-tcp
- ext4 and optionally xfs
- version must be v3.7 or later.
Each worker node which will host an instance of an io-engine pod must have the following resources free and available for exclusive use by that pod:
- Two CPU cores
- 1GiB RAM
- HugePage support
  - A minimum of 2GiB of 2MiB-sized pages

Networking requirements

Ensure that the following ports are not in use on the node:
- 10124: Mayastor gRPC server will use this port.
- 8420 / 4421: NVMf targets will use these ports.
The firewall settings should not restrict connection to the node.

Recommended resource requirements

DiskPool requirements

Disks must be unpartitioned, unformatted, and used exclusively by the DiskPool.
The minimum capacity of the disks should be 10 GB.

RBAC permission requirements

Kubernetes core v1 API-group resources: Pod, Event, Node, Namespace, ServiceAccount, PersistentVolume, PersistentVolumeClaim, ConfigMap, Secret, Service, Endpoint, Event.
Kubernetes batch API-group resources: CronJob, Job
Kubernetes apps API-group resources: Deployment, ReplicaSet, StatefulSet, DaemonSet
Kubernetes storage.k8s.io API-group resources: StorageClass, VolumeSnapshot, VolumeSnapshotContent, VolumeAttachment, CSI-Node
Kubernetes apiextensions.k8s.io API-group resources: CustomResourceDefinition
Mayastor Custom Resources that is openebs.io API-group resources: DiskPool
Custom Resources from Helm chart dependencies of Jaeger that is helpful for debugging:
- ConsoleLink Resource from console.openshift.io API group
- ElasticSearch Resource from logging.openshift.io API group
- Kafka and KafkaUsers from kafka.strimzi.io API group
- ServiceMonitor from monitoring.coreos.com API group
- Ingress from networking.k8s.io API group and from extensions API group
- Route from route.openshift.io API group
- All resources from jaegertracing.io API group

Minimum worker node count

The minimum supported worker node count is three nodes. When using the synchronous replication feature (N-way mirroring), the number of worker nodes to which Mayastor is deployed should be no less than the desired replication factor.

Transport protocols

Mayastor supports the export and mounting of volumes over NVMe-oF TCP only. Worker node(s) on which a volume may be scheduled (to be mounted) must have the requisite initiator support installed and configured. In order to reliably mount Mayastor volumes over NVMe-oF TCP, a worker node's kernel version must be 5.13 or later and the nvme-tcp kernel module must be loaded.

Preparing the Cluster

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Verify / Enable Huge Page Support

2MiB-sized Huge Pages must be supported and enabled on the mayastor storage nodes. A minimum number of 1024 such pages (i.e. 2GiB total) must be available exclusively to the Mayastor pod on each node, which should be verified thus:

If fewer than 1024 pages are available then the page count should be reconfigured on the worker node as required, accounting for any other workloads which may be scheduled on the same node and which also require them. For example:

This change should also be made persistent across reboots by adding the required value to the file/etc/sysctl.conf like so:

If you modify the Huge Page configuration of a node, you MUST either restart kubelet or reboot the node. Mayastor will not deploy correctly if the available Huge Page count as reported by the node's kubelet instance does not satisfy the minimum requirements.

Label Mayastor Node Candidates

All worker nodes which will have Mayastor pods running on them must be labelled with the OpenEBS engine type "mayastor". This label will be used as a node selector by the Mayastor Daemonset, which is deployed as a part of the Mayastor data plane components installation. To add this label to a node, execute:

If you set csi.node.topology.nodeSelector: true, then you will need to label the worker nodes accordingly to csi.node.topology.segments. Both csi-node and agent-ha-node Daemonsets will include the topology segments into the node selector.

Deploy Mayastor

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Overview

Before deploying and using Mayastor please consult the Known Issues section of this guide.

The steps and commands which follow are intended only for use in conjunction with Mayastor version(s) 2.1.x and above.

Installation via helm

The Mayastor Helm chart now includes the Dynamic Local Persistent Volume (LocalPV) provisioner as the default option for provisioning storage to etcd and Loki. This simplifies storage setup by utilizing local volumes within your Kubernetes cluster. For etcd, the chart uses the mayastor-etcd-localpv storage class, and for Loki, it utilizes the mayastor-loki-localpv storage class. These storage classes are bundled with the Mayastor chart, ensuring that your etcd and Loki instances are configured to use openEbs localPV volumes efficiently. /var/local/{{ .Release.Name }} paths should be persistent accross reboots.

Add the OpenEBS Mayastor Helm repository.

helm repo add mayastor https://openebs.github.io/mayastor-extensions/

"mayastor" has been added to your repositories

Run the following command to discover all the stable versions of the added chart repository:

helm search repo mayastor --versions

 NAME             	CHART VERSION	APP VERSION  	DESCRIPTION                       
mayastor/mayastor	2.4.0        	2.4.0       	Mayastor Helm chart for Kubernetes

To discover all the versions (including unstable versions), execute: helm search repo mayastor --devel --versions

Run the following command to install Mayastor _version 2.4.

helm install mayastor mayastor/mayastor -n mayastor --create-namespace --version 2.4.0

NAME: mayastor
LAST DEPLOYED: Thu Sep 22 18:59:56 2022
NAMESPACE: mayastor
STATUS: deployed
REVISION: 1
NOTES:
OpenEBS Mayastor has been installed. Check its status by running:
$ kubectl get pods -n mayastor

For more information or to view the documentation, visit our website at https://openebs.io.

Verify the status of the pods by running the command:

kubectl get pods -n mayastor

NAME                                         READY   STATUS    RESTARTS   AGE
mayastor-agent-core-6c485944f5-c65q6         2/2     Running   0          2m13s
mayastor-agent-ha-node-42tnm                 1/1     Running   0          2m14s
mayastor-agent-ha-node-45srp                 1/1     Running   0          2m14s
mayastor-agent-ha-node-tzz9x                 1/1     Running   0          2m14s
mayastor-api-rest-5c79485686-7qg5p           1/1     Running   0          2m13s
mayastor-csi-controller-65d6bc946-ldnfb      3/3     Running   0          2m13s
mayastor-csi-node-f4fgd                      2/2     Running   0          2m13s
mayastor-csi-node-ls9m4                      2/2     Running   0          2m13s
mayastor-csi-node-xtcfc                      2/2     Running   0          2m13s
mayastor-etcd-0                              1/1     Running   0          2m13s
mayastor-etcd-1                              1/1     Running   0          2m13s
mayastor-etcd-2                              1/1     Running   0          2m13s
mayastor-io-engine-f2wm6                     2/2     Running   0          2m13s
mayastor-io-engine-kqxs9                     2/2     Running   0          2m13s
mayastor-io-engine-m44ms                     2/2     Running   0          2m13s
mayastor-loki-0                              1/1     Running   0          2m13s
mayastor-obs-callhome-5f47c6d78b-fzzd7       1/1     Running   0          2m13s
mayastor-operator-diskpool-b64b9b7bb-vrjl6   1/1     Running   0          2m13s
mayastor-promtail-cglxr                      1/1     Running   0          2m14s
mayastor-promtail-jc2mz                      1/1     Running   0          2m14s
mayastor-promtail-mr8nf                      1/1     Running   0          2m14s

Configure Mayastor

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Create DiskPool(s)

What is a DiskPool?

When a node allocates storage capacity for a replica of a persistent volume (PV) it does so from a DiskPool. Each node may create and manage zero, one, or more such pools. The ownership of a pool by a node is exclusive. A pool can manage only one block device, which constitutes the entire data persistence layer for that pool and thus defines its maximum storage capacity.

A pool is defined declaratively, through the creation of a corresponding DiskPool custom resource on the cluster. The DiskPool must be created in the same namespace where Mayastor has been deployed. User configurable parameters of this resource type include a unique name for the pool, the node name on which it will be hosted and a reference to a disk device which is accessible from that node. The pool definition requires the reference to its member block device to adhere to a discrete range of schemas, each associated with a specific access mechanism/transport/ or device type.

Mayastor versions before 2.0.1 had an upper limit on the number of retry attempts in the case of failure in create events in the DSP operator. With this release, the upper limit has been removed, which ensures that the DiskPool operator indefinitely reconciles with the CR.

Permissible Schemes for `spec.disks` under DiskPool CR

It is highly recommended to specify the disk using a unique device link that remains unaltered across node reboots. Examples of such device links are: by-path or by-id.

Easy way to retrieve device link for a given node: kubectl mayastor get block-devices worker

DEVNAME       DEVTYPE  SIZE      AVAILABLE  MODEL                             DEVPATH                                                           MAJOR  MINOR  DEVLINKS 
/dev/nvme0n1  disk     894.3GiB  yes        Dell DC NVMe PE8010 RI U.2 960GB  /devices/pci0000:30/0000:30:02.0/0000:31:00.0/nvme/nvme0/nvme0n1  259    0      "/dev/disk/by-id/nvme-eui.ace42e00164f0290", "/dev/disk/by-path/pci-0000:31:00.0-nvme-1", "/dev/disk/by-dname/nvme0n1", "/dev/disk/by-id/nvme-Dell_DC_NVMe_PE8010_RI_U.2_960GB_SDA9N7266I110A814"

Usage of the device name (for example, /dev/sdx) is not advised, as it may change if the node reboots, which might cause data corruption.

Type

Format

Example

Disk(non PCI) with disk-by-guid reference (Best Practice)

Device File

aio:///dev/disk/by-id/ OR uring:///dev/disk/by-id/

Asynchronous Disk(AIO)

Device File

/dev/sdx

Asynchronous Disk I/O (AIO)

Device File

aio:///dev/sdx

io_uring

Device File

uring:///dev/sdx

Once a node has created a pool it is assumed that it henceforth has exclusive use of the associated block device; it should not be partitioned, formatted, or shared with another application or process. Any pre-existing data on the device will be destroyed.

A RAM drive isn't suitable for use in production as it uses volatile memory for backing the data. The memory for this disk emulation is allocated from the hugepages pool. Make sure to allocate sufficient additional hugepages resource on any storage nodes which will provide this type of storage.

Configure Pools for Use with this Quickstart

To get started, it is necessary to create and host at least one pool on one of the nodes in the cluster. The number of pools available limits the extent to which the synchronous N-way mirroring (replication) of PVs can be configured; the number of pools configured should be equal to or greater than the desired maximum replication factor of the PVs to be created. Also, while placing data replicas ensure that appropriate redundancy is provided. Mayastor's control plane will avoid placing more than one replica of a volume on the same node. For example, the minimum viable configuration for a Mayastor deployment which is intended to implement 3-way mirrored PVs must have three nodes, each having one DiskPool, with each of those pools having one unique block device allocated to it.

Using one or more the following examples as templates, create the required type and number of pools.

cat <<EOF | kubectl create -f -
apiVersion: "openebs.io/v1beta1"
kind: DiskPool
metadata:
  name: pool-on-node-1
  namespace: mayastor
spec:
  node: workernode-1-hostname
  disks: ["/dev/disk/by-id/<id>"]
EOF

apiVersion: "openebs.io/v1beta1"
kind: DiskPool
metadata:
  name: INSERT_POOL_NAME_HERE
  namespace: mayastor
spec:
  node: INSERT_WORKERNODE_HOSTNAME_HERE
  disks: ["INSERT_DEVICE_URI_HERE"]

When using the examples given as guides to creating your own pools, remember to replace the values for the fields "metadata.name", "spec.node" and "spec.disks" as appropriate to your cluster's intended configuration. Note that whilst the "disks" parameter accepts an array of values, the current version of Mayastor supports only one disk device per pool.

Existing schemas in Custom Resource (CR) definitions (in older versions) will be updated from v1alpha1 to v1beta1 after upgrading to Mayastor 2.4 and above. To resolve errors encountered pertaining to the upgrade, click here.

Verify Pool Creation and Status

The status of DiskPools may be determined by reference to their cluster CRs. Available, healthy pools should report their State as online. Verify that the expected number of pools have been created and that they are online.

kubectl get dsp -n mayastor

NAME             NODE          STATE    POOL_STATUS   CAPACITY      USED   AVAILABLE
pool-on-node-1   node-1-14944  Created   Online        10724835328   0      10724835328
pool-on-node-2   node-2-14944  Created   Online        10724835328   0      10724835328
pool-on-node-3   node-3-14944  Created   Online        10724835328   0      10724835328

Create Mayastor StorageClass(s)

Mayastor dynamically provisions PersistentVolumes (PVs) based on StorageClass definitions created by the user. Parameters of the definition are used to set the characteristics and behaviour of its associated PVs. For a detailed description of these parameters see storage class parameter description. Most importantly StorageClass definition is used to control the level of data protection afforded to it (that is, the number of synchronous data replicas which are maintained, for purposes of redundancy). It is possible to create any number of StorageClass definitions, spanning all permitted parameter permutations.

We illustrate this quickstart guide with two examples of possible use cases; one which offers no data redundancy (i.e. a single data replica), and another having three data replicas.

Both the example YAMLs given below have thin provisioning enabled. You can modify these as required to match your own desired test cases, within the limitations of the cluster under test.

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-1
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "1"
  thin: "true"
provisioner: io.openebs.csi-mayastor
EOF

cat <<EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mayastor-3
parameters:
  ioTimeout: "30"
  protocol: nvmf
  repl: "3"
  thin: "true"
provisioner: io.openebs.csi-mayastor
EOF

The default installation of Mayastor includes the creation of a StorageClass with the name mayastor-single-replica. The replication factor is set to 1. Users may either use this StorageClass or create their own.

Storage Class Parameters

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Storage class resource in Kubernetes is used to supply parameters to volumes when they are created. It is a convenient way of grouping volumes with common characteristics. All parameters take a string value. Brief explanation of each supported Mayastor parameter follows.

The storage class parameter local has been deprecated and is a breaking change in Mayastor version 2.0. Ensure that this parameter is not used.

"fsType"

File system that will be used when mounting the volume. The supported file systems are ext4, xfs and btrfs and the default file system when not specified is ext4. We recommend to use xfs that is considered to be more advanced and performant. Please ensure the requested filesystem driver is installed on all worker nodes in the cluster before using it.

"protocol"

The parameter 'protocol' takes the value nvmf(NVMe over TCP protocol). It is used to mount the volume (target) on the application node.

"repl"

The string value should be a number and the number should be greater than zero. Mayastor control plane will try to keep always this many copies of the data if possible. If set to one then the volume does not tolerate any node failure. If set to two, then it tolerates one node failure. If set to three, then two node failures, etc.

"thin"

The volumes can either be thick or thin provisioned. Adding the thin parameter to the StorageClass YAML allows the volume to be thinly provisioned. To do so, add thin: true under the parameters spec, in the StorageClass YAML. Sample YAML When the volumes are thinly provisioned, the user needs to monitor the pools, and if these pools start to run out of space, then either new pools must be added or volumes deleted to prevent thinly provisioned volumes from getting degraded or faulted. This is because when a pool with more than one replica runs out of space, Mayastor moves the largest out-of-space replica to another pool and then executes a rebuild. It then checks if all the replicas have sufficient space; if not, it moves the next largest replica to another pool, and this process continues till all the replicas have sufficient space.

The capacity usage on a pool can be monitored using exporter metrics.

The agents.core.capacity.thin spec present in the Mayastor helm chart consists of the following configurable parameters that can be used to control the scheduling of thinly provisioned replicas:

poolCommitment parameter specifies the maximum allowed pool commitment limit (in percent).
volumeCommitment parameter specifies the minimum amount of free space that must be present in each replica pool in order to create new replicas for an existing volume. This value is specified as a percentage of the volume size.
volumeCommitmentInitial minimum amount of free space that must be present in each replica pool in order to create new replicas for a new volume. This value is specified as a percentage of the volume size.

Note:

By default, the volumes are provisioned as thick.
For a pool of a particular size, say 10 Gigabytes, a volume > 10 Gigabytes cannot be created, as Mayastor currently does not support pool expansion.
The replicas for a given volume can be either all thick or all thin. Same volume cannot have a combination of thick and thin replicas

"stsAffinityGroup"

stsAffinityGroup represents a collection of volumes that belong to instances of Kubernetes StatefulSet. When a StatefulSet is deployed, each instance within the StatefulSet creates its own individual volume, which collectively forms the stsAffinityGroup. Each volume within the stsAffinityGroup corresponds to a pod of the StatefulSet.

This feature enforces the following rules to ensure the proper placement and distribution of replicas and targets so that there isn't any single point of failure affecting multiple instances of StatefulSet.

Anti-Affinity among single-replica volumes : This rule ensures that replicas of different volumes are distributed in such a way that there is no single point of failure. By avoiding the colocation of replicas from different volumes on the same node.
Anti-Affinity among multi-replica volumes :

If the affinity group volumes have multiple replicas, they already have some level of redundancy. This feature ensures that in such cases, the replicas are distributed optimally for the stsAffinityGroup volumes.

Anti-affinity among targets :

The High Availability feature ensures that there is no single point of failure for the targets. The stsAffinityGroup ensures that in such cases, the targets are distributed optimally for the stsAffinityGroup volumes.

By default, the stsAffinityGroup feature is disabled. To enable it, modify the storage class YAML by setting the parameters.stsAffinityGroup parameter to true.

"cloneFsIdAsVolumeId"

cloneFsIdAsVolumeId is a setting for volume clones/restores with two options: true and false. By default, it is set to false.

When set to true, the created clone/restore's filesystem uuid will be set to the restore volume's uuid. This is important because some file systems, like XFS, do not allow duplicate filesystem uuid on the same machine by default.
When set to false, the created clone/restore's filesystem uuid will be same as the orignal volume uuid, but it will be mounted using the nouuid flag to bypass duplicate uuid validation.

This option needs to be set to true when using a btrfs filesystem, if the application using the restored volume is scheduled on the same node where the original volume is mounted, concurrently.

Deploy a Test Application

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Objective

If all verification steps in the preceding stages were satisfied, then Mayastor has been successfully deployed within the cluster. In order to verify basic functionality, we will now dynamically provision a Persistent Volume based on a Mayastor StorageClass, mount that volume within a small test pod which we'll create, and use the utility within that pod to check that I/O to the volume is processed correctly.

Define the PVC

Use kubectl to create a PVC based on a StorageClass that you created in the . In the example shown below, we'll consider that StorageClass to have been named "mayastor-1". Replace the value of the field "storageClassName" with the name of your own Mayastor-based StorageClass.

For the purposes of this quickstart guide, it is suggested to name the PVC "ms-volume-claim", as this is what will be illustrated in the example steps which follow.

If you used the storage class example from previous stage, then volume binding mode is set to WaitForFirstConsumer. That means, that the volume won't be created until there is an application using the volume. We will go ahead and create the application pod and then check all resources that should have been created as part of that in kubernetes.

Deploy the FIO Test Pod

The Mayastor CSI driver will cause the application pod and the corresponding Mayastor volume's NVMe target/controller ("Nexus") to be scheduled on the same Mayastor Node, in order to assist with restoration of volume and application availabilty in the event of a storage node failure.

In this version, applications using PVs provisioned by Mayastor can only be successfully scheduled on Mayastor Nodes. This behaviour is controlled by the local: parameter of the corresponding StorageClass, which by default is set to a value of true. Therefore, this is the only supported value for this release - setting a non-local configuration may cause scheduling of the application pod to fail, as the PV cannot be mounted to a worker node other than a MSN. This behaviour will change in a future release.

Verify the Volume and the Deployment

We will now verify the Volume Claim and that the corresponding Volume and Mayastor Volume resources have been created and are healthy.

Verify the Volume Claim

The status of the PVC should be "Bound".

Verify the Persistent Volume

Substitute the example volume name with that shown under the "VOLUME" heading of the output returned by the preceding "get pvc" command, as executed in your own cluster

Verify the Mayastor Volume

The status of the volume should be "online".

To verify the status of volume is used.

Verify the application

Verify that the pod has been deployed successfully, having the status "Running". It may take a few seconds after creating the pod before it reaches that status, proceeding via the "ContainerCreating" state.

Note: The example FIO pod resource declaration included with this release references a PVC named ms-volume-claim, consistent with the example PVC created in this section of the quickstart. If you have elected to name your PVC differently, deploy the Pod using the example YAML, modifying the claimName field appropriately.

Run the FIO Tester Application

We now execute the FIO Test utility against the Mayastor PV for 60 seconds, checking that I/O is handled as expected and without errors. In this quickstart example, we use a pattern of random reads and writes, with a block size of 4k and a queue depth of 16.

If no errors are reported in the output then Mayastor has been correctly configured and is operating as expected. You may create and consume additional Persistent Volumes with your own test applications.

Advanced Operations

Mayastor Kubectl Plugin

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

The Mayastor kubectl plugin can be used to view and manage Mayastor resources such as nodes, pools and volumes. It is also used for operations such as scaling the replica count of volumes.

Install kubectl plugin

The Mayastor kubectl plugin is available for the Linux platform. The binary for the plugin can be found .
Add the downloaded Mayastor kubectl plugin under $PATH.

To verify the installation, execute:

Use kubectl plugin to retrieve data

Sample command to use kubectl plugin:

You can use the plugin with the following options:

Get Mayastor Volumes

Get Mayastor Pools

Get Mayastor Nodes

All the above resource information can be retrieved for a particular resource using its ID. The command to do so is as follows: kubectl mayastor get <resource_name> <resource_id>

Scale the replica count of a volume

Retrieve resource in any of the output formats (table, JSON or YAML)

Table is the default output format.

Retrieve replica topology for specific volumes

The plugin requires access to the Mayastor REST server for execution. It gets the master node IP from the kube-config file. In case of any failure, the REST endpoint can be specified using the ‘–rest’ flag.

List available volume snapshots

Limitations of kubectl plugin

The plugin currently does not have authentication support.
The plugin can operate only over HTTP.

High Availability

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Mayastor 2.0 enhances High Availability (HA) of the volume target with the nexus switch-over feature. In the event of the target failure, the switch-over feature quickly detects the failure and spawns a new nexus to ensure I/O continuity. The HA feature consists of two components: the HA node agent (which runs in each csi- node) and the cluster agent (which runs alongside the agent-core). The HA node agent looks for io-path failures from applications to their corresponding targets. If any such broken path is encountered, the HA node agent informs the cluster agent. The cluster-agent then creates a new target on a different (live) node. Once the target is created, the node-agent establishes a new path between the application and its corresponding target. The HA feature restores the broken path within seconds, ensuring negligible downtime.

The volume's replica count must be higher than 1 for a new target to be established as part of switch-over.

How do I disable this feature?

We strongly recommend keeping this feature enabled.

The HA feature is enabled by default; to disable it, pass the parameter --set=agents.ha.enabled=false with the helm install command.

Supportability

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

The supportability tool collects Mayastor specific information from the cluster using the command-line tool. It uses the dump command, which interacts with the Mayastor services to build an archive (ZIP) file that acts as a placeholder for the bundled information.

Using the supportability tool

To bundle Mayastor's complete system information, execute:

To view all the available options and sub-commands that can be used with the dump command, execute:

Note: The information collected by the supportability tool is solely used for debugging purposes. The content of these files is human-readable and can be reviewed, deleted, or redacted as necessary to adhere to the organization's data protection/privacy commitments and security policies before transmitting the bundles. Please refer to the section: to get more details.

The archive files generated by the dump command are stored in the specified output directories. The tables below specify the path and the content that will be stored in each archive file.

Topology Information

Folder Path

Mayastor Resource

Node Name

File Name

Description

Historical Logs

Folder Path

Host Name(optional)

Component

File Name

Configuration details of Kubernetes resources

Folder Path

Component

File Name

etcd dump information

Folder Path

Component

File Name

Supportability tool logs

Folder Path

Component

File Name

Does the supportability tool expose sensitive data?

The supportability tool generates support bundles, which are used for debugging purposes. These bundles are created in response to the user's invocation of the tool and can be transmitted only by the user. Below is the information collected by the supportability tool that might be identified as 'sensitive' based on the organization's data protection/privacy commitments and security policies. Logs: The default installation of Mayastor includes the deployment of a log aggregation subsystem based on Grafana Loki. All the pods deployed in the same namespace as Mayastor and labelled with openebs.io/logging=true will have their logs incorporated within this centralized collector. These logs may include the following information:

Kubernetes (K8s) node hostnames
IP addresses
- container addresses
API endpoints
- Mayastor
- K8s
Container names
K8s Persistent Volume names (provisioned by Mayastor)
DiskPool names
Block device details (except the content) K8s Definition Files: The support bundle includes definition files for all the Mayastor components. Some of these are listed below:
Deployments
DaemonSets
StatefulSets
VolumeSnapshotClass
VolumeSnapshotContent

K8s Events: The archive files generated by the supportability tool contain information on all the events of the Kubernetes cluster present in the same namespace as Mayastor.

etcd Dump: The default installation of Mayastor deploys an etcd instance for its exclusive use. This key-value pair is used to persist state information for Mayastor-managed objects. These key-value pairs are required for diagnostic and troubleshooting purposes. The etcd dump archive file consists of the following information:

Kubernetes node hostnames
IP addresses
PVC/PV names
Container names
- Mayastor
- User applications within the mayastor namespace
Block device details (except data content)

Note:The list provided above is frequently reviewed and updated by the Mayastor maintainers. However, it might not be fully exhaustive, given that "sensitive information" is a subjective term.

Monitoring

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Pool metrics exporter

The Mayastor pool metrics exporter runs as a sidecar container within every io-engine pod and exposes pool usage metrics in Prometheus format. These metrics are exposed on port 9502 using an HTTP endpoint /metrics and are refreshed every five minutes.

Supported pool metrics

Name

Type

Unit

Description

Stats exporter metrics

When is activated, the stats exporter operates within the obs-callhome-stats container, located in the callhome pod. The statistics are made accessible through an HTTP endpoint at port 9090, specifically using the /stats route.

Supported stats metrics

Name

Type

Unit

Description

Integrating exporter with Prometheus monitoring stack

To install, add the Prometheus-stack helm chart and update the repo.

Then, install the Prometheus monitoring stack and set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues to false. This enables Prometheus to discover custom ServiceMonitor for Mayastor.

Next, install the ServiceMonitor resource to select services and specify their underlying endpoint objects.

Upon successful integration of the exporter with the Prometheus stack, the metrics will be available on the port 9090 and HTTP endpoint /metrics.

CSI metrics exporter

Name

Type

Unit

Description

Node Cordon

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Cordoning a node marks or taints the node as unschedulable. This prevents the scheduler from deploying new resources on that node. However, the resources that were deployed prior to cordoning off the node will remain intact.

This feature is in line with the node-cordon functionality of Kubernetes.

To add a label and cordon a node, execute:

To get the list of cordoned nodes, execute:

To view the labels associated with a cordoned node, execute:

How to uncordon a node?

In order to make a node schedulable again, execute:

Additional Information

Upgrade

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Mayastor supports seamless upgrades starting with target version 2.1.0 and later, and source versions 2.0.0 and later. To upgrade from a previous version(1.0.5 or prior) to 2.1.0 or later, visit .

Supported upgrade paths

From 2.0.x to 2.4.0

The upgrade operation utilises the .
The upgrade process is generally non-disruptive for volumes with a replication factor greater than 1 and all replicas being healthy, prior to starting the upgrade.

To upgrade Mayastor deployment on the Kubernetes cluster, execute:

To view all the available options and sub-commands that can be used with the upgrade command, execute:

To view the status of upgrade, execute:

To view the logs of upgrade job, execute:

The time taken to upgrade is directly proportional to the number of Mayastor nodes and Mayastor volumes.
To upgrade to a particular Mayastor version, ensure you are using the same version of kubectl plugin.
The above process of upgrade creates one Job in the namespace where Mayastor is installed, one ClusterRole, one ClusterRoleBinding and one ServiceAccount.

Tips and Tricks

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

What if I have no Disk Devices available that I can use for testing?

For basic test and evaluation purposes it may not always be practical or possible to allocate physical disk devices on a cluster to Mayastor for use within its pools. As a convenience, Mayastor supports two disk device type emulations for this purpose:

Memory-Backed Disks ("RAM drive")
File-Backed Disks

Memory-backed Disks are the most readily provisioned if node resources permit, since Mayastor will automatically create and configure them as it creates the corresponding pool. However they are the least durable option - since the data is held entirely within memory allocated to a Mayastor pod, should that pod be terminated and rescheduled by Kubernetes, that data will be lost. Therefore it is strongly recommended that this type of disk emulation be used only for short duration, simple testing. It must not be considered for production use.

File-backed disks, as their name suggests, store pool data within a file held on a file system which is accessible to the Mayastor pod hosting that pool. Their durability depends on how they are configured; specifically on which type of volume mount they are located. If located on a path which uses Kubernetes ephemeral storage (eg. EmptyDir), they may be no more persistent than a RAM drive would be. However, if placed on their own Persistent Volume (eg. a Kubernetes Host Path volume) then they may considered 'stable'. They are slightly less convenient to use than memory-backed disks, in that the backing files must be created by the user as a separate step preceding pool creation. However, file-backed disks can be significantly larger than RAM disks as they consume considerably less memory resource within the hosting Mayastor pod.

Using Memory-backed Disks

Creating a memory-backed disk emulation entails using the "malloc" uri scheme within the Mayastor pool resource definition.

The example shown defines a pool named "mempool-1". The Mayastor pod hosted on "worker-node-1" automatically creates a 64MiB emulated disk for it to use, with the device identifier "malloc0" - provided that at least 64MiB of 2MiB-sized Huge Pages are available to that pod after the Mayastor container's own requirements have been satisfied.

The malloc:/// URI schema

The pool definition caccepts URIs matching the malloc:/// schema within its disks field for the purposes of provisioning memory-based disks. The general format is:

malloc:///malloc<DeviceId>?<parameters>

Where <DeviceId> is an integer value which uniquely identifies the device on that node, and where the parameter collection <parameters> may include the following:

Parameter

Function

Value Type

Notes

Note: Memory-based disk devices are not over-provisioned and the memory allocated to them is so from the 2MiB-sized Huge Page resources available to the Mayastor pod. That is to say, to create a 64MiB device requires that at least 33 (32+1) 2MiB-sized pages are free for that Mayastor container instance to use. Satisfying the memory requirements of this disk type may require additional configuration on the worker node and changes to the resource request and limit spec of the Mayastor daemonset, in order to ensure that sufficient resource is available.

Using File-backed Disks

Mayastor can use file-based disk emulation in place of physical pool disk devices, by employing the aio:/// URI schema within the pool's declaration in order to identify the location of the file resource.

The examples shown seek to create a pool using a file named "disk1.img", located in the /var/tmp directory of the Mayastor container's file system, as its member disk device. For this operation to succeed, the file must already exist on the specified path (which should be FULL path to the file) and this path must be accessible by the Mayastor pod instance running on the corresponding node.

The aio:/// schema requires no other parameters but optionally, "blk_size" may be specified. Block size accepts a value of either 512 or 4096, corresponding to the emulation of either a 512-byte or 4kB sector size device. If this parameter is omitted the device defaults to using a 512-byte sector size.

File-based disk devices are not over-provisioned; to create a 10GiB pool disk device requires that a 10GiB-sized backing file exist on a file system on an accessible path.

The preferred method of creating a backing file is to use the linux truncate command. The following example demonstrates the creation of a 1GiB-sized file named disk1.img within the directory /tmp.

Platform Support

Troubleshooting

Replica Operations

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Basics

When a Mayastor volume is provisioned based on a StorageClass which has a replication factor greater than one (set by its repl parameter), the control plane will attempt to maintain through a 'Kubernetes-like' reconciliation loop that number of identical copies of the volume's data "replicas" (a replica is a nexus target "child") at any point in time. When a volume is first provisioned the control plane will attempt to create the required number of replicas, whilst adhering to its internal heuristics for their location within the cluster (which will be discussed shortly). If it succeeds, the volume will become available and will bind with the PVC. If the control plane cannot identify a sufficient number of eligble Mayastor Pools in which to create required replicas at the time of provisioning, the operation will fail; the Mayastor Volume will not be created and the associated PVC will not be bound. Kubernetes will periodically re-try the volume creation and if at any time the appropriate number of pools can be selected, the volume provisioning should succeed.

Once a volume is processing I/O, each of its replicas will also receive I/O. Reads are round-robin distributed across replicas, whilst writes must be written to all. In a real world environment this is attended by the possibility that I/O to one or more replicas might fail at any time. Possible reasons include transient loss of network connectivity, node reboots, node or disk failure. If a volume's nexus (NVMe controller) encounters 'too many' failed I/Os for a replica, then that replica's child status will be marked Faulted and it will no longer receive I/O requests from the nexus. It will remain a member of the volume, whose departure from the desired state with respect to replica count will be reflected with a volume status of Degraded. How many I/O failures are considered "too many" in this context is outside the scope of this discussion.

The control plane will first 'retire' the old, faulted one which will then no longer be associated to the volume. Once retired, a replica will become available for garbage collection (deletion from the Mayastor Pool containing it), assuming that the nature of the failure was such that the pool itself is still viable (i.e. the underlying disk device is still accessible). Then it will attempt to restore the desired state (replica count) by creating a new replica, following its replica placement rules. If it succeeds, the nexus will "rebuild" that new replica - performing a full copy of all data from a healthy replica Online, i.e. the source. This process can proceed whilst the volume continues to process application I/Os although it will contend for disk throughput at both the source and destination disks.

If a nexus is cleanly restarted, i.e. the Mayastor pod hosting it restarts gracefully, with the assistance of the control plane it will 'recompose' itself; all of the previous healthy member replicas will be re-attached to it. If previously faulted replicas are available to be re-connected (Online), then the control plane will attempt to reuse and rebuild them directly, rather than seek replacements for them first. This edge-case therefore does not result in the retirement of the affected replicas; they are simply reused. If the rebuild fails then we follow the above process of removing a Faulted replica and adding a new one. On an unclean restart (i.e. the Mayastor pod hosting the nexus crashes or is forcefully deleted) only one healthy replica will be re-attached and all other replicas will eventually be rebuilt.

Once provisioned, the replica count of a volume can be changed using the kubectl-mayastor plugin scale subcommand. The value of the num_replicas field may be either increased or decreased by one and the control plane will attempt to satisfy the request by creating or destroying a replicas as appropriate, following the same replica placement rules as described herein. If the replica count is reduced, faulted replicas will be selected for removal in preference to healthy ones.

Replica Placement Heuristics

Accurate predictions of the behaviour of Mayastor with respect to replica placement and management of replica faults can be made by reference to these 'rules', which are a simplified representation of the actual logic:

"Rule 1": A volume can only be provisioned if the replica count (and capacity) of its StorageClass can be satisfied at the time of creation
"Rule 2": Every replica of a volume must be placed on a different Mayastor Node)
"Rule 3": Children with the state Faulted are always selected for retirement in preference to those with state Online

N.B.: By application of the 2nd rule, replicas of the same volume cannot exist within different pools on the same Mayastor Node.

Example Scenarios

Scenario One

A cluster has two Mayastor nodes deployed, "Node-1" and "Node-2". Each Mayastor node hosts two Mayastor pools and currently, no Mayastor volumes have been defined. Node-1 hosts pools "Pool-1-A" and "Pool-1-B", whilst Node-2 hosts "Pool-2-A and "Pool-2-B". When a user creates a PVC from a StorageClass which defines a replica count of 2, the Mayastor control plane will seek to place one replica on each node (it 'follows' Rule 2). Since in this example it can find a suitable candidate pool with sufficient free capacity on each node, the volume is provisioned and becomes "healthy" (Rule 1). Pool-1-A is selected on Node-1, and Pool-2-A selected on Node-2 (all pools being of equal capacity and replica count, in this initial 'clean' state).

Sometime later, the physical disk of Pool-2-A encounters a hardware failure and goes offline. The volume is in use at the time, so its nexus (NVMe controller) starts to receive I/O errors for the replica hosted in that Pool. The associated replica's child from Pool-2-A enters the Faulted state and the volume state becomes Degraded (as seen through the kubectl-mayastor plugin).

Expected Behaviour: The volume will maintain read/write access for the application via the remaining healthy replica. The faulty replica from Pool-2-A will be removed from the Nexus thus changing the nexus state to Online as the remaining is healthy. A new replica is created on either Pool-2-A or Pool-2-B and added to the nexus. The new replica child is rebuilt and eventually the state of the volume returns to Online.

Scenario Two

A cluster has three Mayastor nodes deployed, "Node-1", "Node-2" and "Node-3". Each Mayastor node hosts one pool: "Pool-1" on Node-1, "Pool-2" on Node-2 and "Pool-3" on Node-3. No Mayastor volumes have yet been defined; the cluster is 'clean'. A user creates a PVC from a StorageClass which defines a replica count of 2. The control plane determines that it is possible to accommodate one replica within the available capacity of each of Pool-1 and Pool-2, and so the volume is created. An application is deployed on the cluster which uses the PVC, so the volume receives I/O.

Unfortunately, due to user error the SAN LUN which is used to persist Pool-2 becomes detached from Node-2, causing I/O failures in the replica which it hosts for the volume. As with scenario one, the volume state becomes Degraded and the faulted child's becomes Faulted.

Expected Behaviour: Since there is a Mayastor pool on Node-3 which has sufficient capacity to host a replacement replica, a new replica can be created (Rule 2: this 'third' incoming replica isn't located on either of the nodes that the two original ones are). The faulted replica in Pool-2 is retired from the nexus and a new replica is created on Pool-3 and added to the nexus. The new replica is rebuilt and eventually the state of the volume returns to Online.

Scenario Three

In the cluster from Scenario three, sometime after the Mayastor volume has returned to the Online state, a user scales up the volume, increasing the num_replicas value from 2 to 3. Before doing so they corrected the SAN misconfiguration and ensured that the pool on Node-2 was Online.

Expected Behaviour: The control plane will attempt to reconcile the difference in current (replicas = 2) and desired (replicas = 3) states. Since Node-2 no longer hosts a replica for the volume (the previously faulted replica was successfully retired and is no longer a member of the volume's nexus), the control plane will select it to host the new replica required (Rule 2 permits this). The volume state will become initially Degraded to reflect the difference in actual vs required redundant data copies but a rebuild of the new replica will be performed and eventually the volume state will be Online again.

Scenario Four

A cluster has three Mayastor nodes deployed; "Node-1", "Node-2" and "Node-3". Each Mayastor node hosts two Mayastor pools and currently, no Mayastor volumes have been defined. Node-1 hosts pools "Pool-1-A" and "Pool-1-B", whilst Node-2 hosts "Pool-2-A and "Pool-2-B" and Node-3 hosts "Pool-3-A" and "Pool-3-B". A single volume exists in the cluster, which has a replica count of 3. The volume's replicas are all healthy and are located on Pool-1-A, Pool-2-A and Pool-3-A. An application is using the volume, so all replicas are receiving I/O.

The host Node-3 goes down causing failure of all I/O sent to the replica it hosts from Pool-3-A.

Expected Behaviour: The volume will enter and remain in the Degraded state. The associated child from the replica from Pool-3-A will be in the state Faulted, as observed in the volume through the kubectl-mayastor plugin. Said replica will be removed from the Nexus thus changing the nexus state to Online as the other replicas are healthy. The replica will then be disowned from the volume (it won't be possible to delete it since the host is down). Since Rule 2 dictates that every replica of a volume must be placed on a different Mayastor Node no new replica can be created at this point and the volume remains Degraded indefinitely.

Scenario Five

Given the post-host failure situation of Scenario four, the user scales down the volume, reducing the value of num_replicas from 3 to 2.

Expected Behaviour: The control plane will reconcile the actual (replicas=3) vs desired (replicas=2) state of the volume. The volume state will become Online again.

Scenario Six

In scenario Five, after scaling down the Mayastor volume the user waits for the volume state to become Online again. The desired and actual replica count are now 2. The volume's replicas are located in pools on both Node-1 and Node-2. The Node-3 is now back up and its pools Pool-3-A and Pool-3-B are Online. The user then scales the volume again, increasing the num_replicas from 2 to 3 again.

Expected Behaviour: The volume's state will become Degraded, reflecting the difference in desired vs actual replica count. The control plane will select a pool on Node-3 as the location for the new replica required. Node-3 is therefore again a suitable candidate and has online pools with sufficient capacity.

Restore to Mayastor

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Cassandra is a popular NoSQL database used for handling large amounts of data with high availability and scalability. In Kubernetes environments, managing and restoring Cassandra backups efficiently is crucial. In this article, we'll walk you through the process of restoring a Cassandra database in a Kubernetes cluster using Velero, and we'll change the storage class to Mayastor for improved performance.

Before you begin, make sure you have the following:

Access to a Kubernetes cluster with Velero installed.
A backup of your Cassandra database created using Velero.
Mayastor configured in your Kubernetes environment.

Step 1: Set Up Kubernetes Credentials and Install Velero

Set up your Kubernetes cluster credentials for the target cluster where you want to restore your Cassandra database. Use the same values for the BUCKET-NAME and SECRET-FILENAME placeholders that you used during the initial Velero installation. This ensures that Velero has the correct credentials to access the previously saved backups. Use the gcloud command if you are using Google Kubernetes Engine (GKE) as shown below:

gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE --project PROJECT_NAME

Install Velero with the necessary plugins, specifying your backup bucket, secret file, and uploader type. Make sure to replace the placeholders with your specific values:

velero get backup | grep YOUR_BACKUP_NAME

Step 2: Verify Backup Availability and Check BackupStorageLocation Status

Confirm that your Cassandra backup is available in Velero. This step ensures that there are no credentials or bucket mismatches:

velero get backup | grep YOUR_BACKUP_NAME

Check the status of the BackupStorageLocation to ensure it's available:

kubectl get backupstoragelocation -n velero

Step 3: Create a Restore Request

Create a Velero restore request for your Cassandra backup:

velero restore create RESTORE_NAME --from-backup YOUR_BACKUP_NAME

Step 4: Monitor Restore Progress

Monitor the progress of the restore operation using the below commands. Velero initiates the restore process by creating an initialization container within the application pod. This container is responsible for restoring the volumes from the backup. As the restore operation proceeds, you can track its status, which typically transitions from in progress to Completed.

In this scenario, the storage class for the PVCs remains as cstor-csi-disk since these PVCs were originally imported from a cStor volume.

Your storage class was originally set to cstor-csi-disk because you imported this PVC from a cStor volume, the status might temporarily stay as In Progress and your PVC will be in Pending status.

velero get restore | grep RESTORE_NAME

Inspect the status of the PVCs in the cassandra namespace:

kubectl get pvc -n cassandra

kubectl get pods -n cassandra

Step 5: Back Up PVC YAML

Create a backup of the Persistent Volume Claims (PVCs) and then modify their storage class to mayastor-single-replica.

The statefulset for Cassandra will still have the cstor-csi-disk storage class at this point. This will be addressed in the further steps.

kubectl get pvc -n cassandra -o yaml > cassandra_pvc_19-09.yaml

ls -lrt | grep cassandra_pvc_19-09.yaml

Edit the PVC YAML to change the storage class to mayastor-single-replica. You can use the provided example YAML snippet and apply it to your PVCs.

apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app.kubernetes.io/instance: cassandra
      app.kubernetes.io/name: cassandra
      velero.io/backup-name: cassandra-backup-19-09-23
      velero.io/restore-name: cassandra-restore-19-09-23
    name: data-cassandra-0
    namespace: cassandra
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 3Gi
    storageClassName: mayastor-single-replica
    volumeMode: Filesystem
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app.kubernetes.io/instance: cassandra
      app.kubernetes.io/name: cassandra
      velero.io/backup-name: cassandra-backup-19-09-23
      velero.io/restore-name: cassandra-restore-19-09-23
    name: data-cassandra-1
    namespace: cassandra
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 3Gi
    storageClassName: mayastor-single-replica
    volumeMode: Filesystem
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app.kubernetes.io/instance: cassandra
      app.kubernetes.io/name: cassandra
      velero.io/backup-name: cassandra-backup-19-09-23
      velero.io/restore-name: cassandra-restore-19-09-23
    name: data-cassandra-2
    namespace: cassandra
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 3Gi
    storageClassName: mayastor-single-replica
    volumeMode: Filesystem
kind: List
metadata:
  resourceVersion: ""

Step 6: Delete and Recreate PVCs

Delete the pending PVCs and apply the modified PVC YAML to recreate them with the new storage class:

kubectl delete pvc PVC_NAMES -n cassandra

kubectl apply -f cassandra_pvc.yaml -n cassandra

Step 7: Observe Velero Init Container and Confirm Restore

Observe the Velero init container as it restores the volumes for the Cassandra pods. This process ensures that your data is correctly recovered.

Events:
  Type     Reason                  Age                     From                     Message
  ----     ------                  ----                    ----                     -------
  Warning  FailedScheduling        8m37s                   default-scheduler        0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        8m36s                   default-scheduler        0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        83s                     default-scheduler        0/3 nodes are available: 3 persistentvolumeclaim "data-cassandra-0" not found. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        65s                     default-scheduler        running PreFilter plugin "VolumeBinding": %!!(MISSING)w(<nil>)
  Normal   Scheduled               55s                     default-scheduler        Successfully assigned cassandra/cassandra-0 to gke-mayastor-pool-2acd09ca-4v3z
  Normal   NotTriggerScaleUp       3m34s (x31 over 8m35s)  cluster-autoscaler       pod didn't trigger scale-up:
  Normal   SuccessfulAttachVolume  55s                     attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-bf8a2fb7-8ddb-4e53-aa48-f8bbf2064b41"
  Normal   Pulled                  47s                     kubelet                  Container image "velero/velero-restore-helper:v1.11.1" already present on machine
  Normal   Created                 47s                     kubelet                  Created container restore-wait
  Normal   Started                 47s                     kubelet                  Started container restore-wait
  Normal   Pulled                  41s                     kubelet                  Container image "docker.io/bitnami/cassandra:4.1.3-debian-11-r37" already present on machine
  Normal   Created                 41s                     kubelet                  Created container cassandra
  Normal   Started                 41s                     kubelet                  Started container cassandra

Run this command to check the restore status:

velero get restore | grep cassandra-restore-19-09-23

Run this command to check if all the pods are running:

kubectl get pods -n cassandra

Step 8: Verify Cassandra Data and StatefulSet

Access a Cassandra pod using cqlsh and check the data

You can use the following command to access the Cassandra pods. This command establishes a connection to the Cassandra database running on pod cassandra-1:

cqlsh -u <enter-your-user-name> -p <enter-your-password> cassandra-1.cassandra-headless.cassandra.svc.cluster.local 9042

The query results should display the data you backed up from cStor. In your output, you're expecting to see the data you backed up.

cassandra@cqlsh> USE openebs;

cassandra@cqlsh:openebs> select * from openebs.data;

 replication | appname   | volume
-------------+-----------+--------
           3 | cassandra |  cStor

(1 rows)

After verifying the data, you can exit the Cassandra shell by typing exit.

Modify your Cassandra StatefulSet YAML to use the mayastor-single-replica storage class

Before making changes to the Cassandra StatefulSet YAML, create a backup to preserve the existing configuration by running the following command:

kubectl get sts cassandra -n cassandra -o yaml > cassandra_sts_backup.yaml

You can modify the Cassandra StatefulSet YAML to change the storage class to mayastor-single-replica. Here's the updated YAML:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: cassandra
    meta.helm.sh/release-namespace: cassandra
  labels:
    app.kubernetes.io/instance: cassandra
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cassandra
    helm.sh/chart: cassandra-10.5.3
    velero.io/backup-name: cassandra-backup-19-09-23
    velero.io/restore-name: cassandra-restore-19-09-23
  name: cassandra
  namespace: cassandra
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: cassandra
      app.kubernetes.io/name: cassandra
  serviceName: cassandra-headless
  template:
    # ... (rest of the configuration remains unchanged)
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: cassandra
        app.kubernetes.io/name: cassandra
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 3Gi
      storageClassName: mayastor-single-replica # Change the storage class here
      volumeMode: Filesystem

Apply the modified YAML to make the changes take effect:

kubectl apply -f cassandra_sts_modified.yaml

Delete the Cassandra StatefulSet with the --cascade=orphan flag

Delete the Cassandra StatefulSet while keeping the pods running without controller management:

kubectl delete sts cassandra -n cassandra --cascade=orphan

Recreate the Cassandra StatefulSet using the updated YAML

Use the kubectl apply command to apply the modified StatefulSet YAML configuration file, ensuring you are in the correct namespace where your Cassandra deployment resides. Replace <path_to_your_yaml> with the actual path to your YAML file.

kubectl apply -f <path_to_your_yaml> -n cassandra

To check the status of the newly created StatefulSet, run:

kubectl get sts -n cassandra

To confirm that the pods are running and managed by the controller, run:

kubectl get pods -n cassandra

Basic Troubleshooting

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Logs

The correct set of log file to collect depends on the nature of the problem. If unsure, then it is best to collect log files for all Mayastor containers. In nearly every case, the logs of all of the control plane component pods will be needed;

csi-controller
core-agent
rest
msp-operator

kubectl -n mayastor get pods -o wide

NAME                    READY   STATUS    RESTARTS   AGE   IP             NODE       NOMINATED NODE   READINESS GATES
mayastor-csi-7pg82      2/2     Running   0          15m   10.0.84.131    worker-2   <none>           <none>
mayastor-csi-gmpq6      2/2     Running   0          15m   10.0.239.174   worker-1   <none>           <none>
mayastor-csi-xrmxx      2/2     Running   0          15m   10.0.85.71     worker-0   <none>           <none>
mayastor-qgpw6          1/1     Running   0          14m   10.0.85.71     worker-0   <none>           <none>
mayastor-qr84q          1/1     Running   0          14m   10.0.239.174   worker-1   <none>           <none>
mayastor-xhmj5          1/1     Running   0          14m   10.0.84.131    worker-2   <none>           <none>
... etc (output truncated for brevity)

Mayastor pod log file

Mayastor containers form the data plane of a Mayastor deployment. A cluster should schedule as many mayastor container instances as required storage nodes have been defined. This log file is most useful when troubleshooting I/O errors however, provisioning and management operations might also fail because of a problem on a storage node.

kubectl -n mayastor logs mayastor-qgpw6 mayastor

CSI agent pod log file

If experiencing problems with (un)mounting a volume on an application node, this log file can be useful. Generally all worker nodes in the cluster will be configured to schedule a mayastor CSI agent pod, so it's good to know which specific node is experiencing the issue and inspect the log file only for that node.

kubectl -n mayastor logs mayastor-csi-7pg82 mayastor-csi

CSI sidecars

These containers implement the CSI spec for Kubernetes and run within the same pods as the csi-controller and mayastor-csi (node plugin) containers. Whilst they are not part of Mayastor's code, they can contain useful information when a Mayastor CSI controller/node plugin fails to register with k8s cluster.

kubectl -n mayastor logs $(kubectl -n mayastor get pod -l app=moac -o jsonpath="{.items[0].metadata.name}") csi-attacher
kubectl -n mayastor logs $(kubectl -n mayastor get pod -l app=moac -o jsonpath="{.items[0].metadata.name}") csi-provisioner

kubectl -n mayastor logs mayastor-csi-7pg82 csi-driver-registrar

Coredumps

A coredump is a snapshot of process' memory combined with auxiliary information (PID, state of registers, etc.) and saved to a file. It is used for post-mortem analysis and it is generated automatically by the operating system in case of a severe, unrecoverable error (i.e. memory corruption) causing the process to panic. Using a coredump for a problem analysis requires deep knowledge of program internals and is usually done only by developers. However, there is a very useful piece of information that users can retrieve from it and this information alone can often identify the root cause of the problem. That is the stack (backtrace) - a record of the last action that the program was performing at the time when it crashed. Here we describe how to get it. The steps as shown apply specifically to Ubuntu, other linux distros might employ variations.

We rely on systemd-coredump that saves and manages coredumps on the system, coredumpctl utility that is part of the same package and finally the gdb debugger.

sudo apt-get install -y systemd-coredump gdb lz4

If installed correctly then the global core pattern will be set so that all generated coredumps will be piped to the systemd-coredump binary.

cat /proc/sys/kernel/core_pattern

|/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h

coredumpctl list

TIME                            PID   UID   GID SIG COREFILE  EXE
Tue 2021-03-09 17:43:46 UTC  206366     0     0   6 present   /bin/mayastor

If there is a new coredump from the mayastor container, the coredump alone won't be that useful. GDB needs to access the binary of crashed process in order to be able to print at least some information in the backtrace. For that, we need to copy the contents of the container's filesystem to the host.

docker ps | grep mayadata/mayastor

b3db4615d5e1        mayadata/mayastor                          "sleep 100000"           27 minutes ago      Up 27 minutes                           k8s_mayastor_mayastor-n682s_mayastor_51d26ee0-1a96-44c7-85ba-6e50767cd5ce_0
d72afea5bcc2        mayadata/mayastor-csi                      "/bin/mayastor-csi -…"   7 hours ago         Up 7 hours                              k8s_mayastor-csi_mayastor-csi-xrmxx_mayastor_d24017f2-5268-44a0-9fcd-84a593d7acb2_0

mkdir -p /tmp/rootdir
docker cp b3db4615d5e1:/bin /tmp/rootdir
docker cp b3db4615d5e1:/nix /tmp/rootdir

Now we can start GDB. Don't use the coredumpctl command for starting the debugger. It invokes GDB with invalid path to the debugged binary hence stack unwinding fails for Rust functions. At first we extract the compressed coredump.

coredumpctl info | grep Storage | awk '{ print $2 }'

/var/lib/systemd/coredump/core.mayastor.0.6a5e550e77ee4e77a19bd67436ce7a98.64074.1615374302000000000000.lz4

sudo lz4cat /var/lib/systemd/coredump/core.mayastor.0.6a5e550e77ee4e77a19bd67436ce7a98.64074.1615374302000000000000.lz4 >core

gdb -c core /tmp/rootdir$(readlink /tmp/rootdir/bin/mayastor)

GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 13]
[New LWP 17]
[New LWP 14]
[New LWP 16]
[New LWP 18]
Core was generated by `/bin/mayastor -l0 -nnats'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffdad99fb37 in clock_gettime ()
[Current thread is 1 (LWP 13)]

Once in GDB we need to set a sysroot so that GDB knows where to find the binary for the debugged program.

set auto-load safe-path /tmp/rootdir
set sysroot /tmp/rootdir

Reading symbols from /tmp/rootdir/nix/store/f1gzfqq10dlha1qw10sqvgil34qh30af-systemd-246.6/lib/libudev.so.1...
(No debugging symbols found in /tmp/rootdir/nix/store/f1gzfqq10dlha1qw10sqvgil34qh30af-systemd-246.6/lib/libudev.so.1)
Reading symbols from /tmp/rootdir/nix/store/0kdiav729rrcdwbxws653zxz5kngx8aa-libspdk-dev-21.01/lib/libspdk.so...
Reading symbols from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libdl.so.2...
(No debugging symbols found in /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libdl.so.2)
Reading symbols from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libgcc_s.so.1...
(No debugging symbols found in /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libgcc_s.so.1)
Reading symbols from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0...
...

After that we can print backtrace(s).

thread apply all bt

Thread 5 (Thread 0x7f78248bb640 (LWP 59)):
#0  0x00007f7825ac0582 in __lll_lock_wait () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#1  0x00007f7825ab90c1 in pthread_mutex_lock () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#2  0x00005633ca2e287e in async_io::driver::main_loop ()
#3  0x00005633ca2e27d9 in async_io::driver::UNPARKER::{{closure}}::{{closure}} ()
#4  0x00005633ca2e27c9 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#5  0x00005633ca2e27b9 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} ()
#6  0x00005633ca2e27a9 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#7  0x00005633ca2e26b4 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#8  0x00005633ca723cda in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454/library/alloc/src/boxed.rs:1546
#9  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454/library/alloc/src/boxed.rs:1546
#10 std::sys::unix::thread::Thread::new::thread_start () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454//library/std/src/sys/unix/thread.rs:71
#11 0x00007f7825ab6e9e in start_thread () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#12 0x00007f78259e566f in clone () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6

Thread 4 (Thread 0x7f7824cbd640 (LWP 57)):
#0  0x00007f78259e598f in epoll_wait () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6
#1  0x00005633ca2e414b in async_io::reactor::ReactorLock::react ()
#2  0x00005633ca583c11 in async_io::driver::block_on ()
#3  0x00005633ca5810dd in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x00005633ca580e5c in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x00005633ca723cda in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454/library/alloc/src/boxed.rs:1546
#6  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454/library/alloc/src/boxed.rs:1546
#7  std::sys::unix::thread::Thread::new::thread_start () at /rustc/d1206f950ffb76c76e1b74a19ae33c2b7d949454//library/std/src/sys/unix/thread.rs:71
#8  0x00007f7825ab6e9e in start_thread () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#9  0x00007f78259e566f in clone () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6

Thread 3 (Thread 0x7f78177fe640 (LWP 61)):
#0  0x00007f7825ac08b7 in accept () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#1  0x00007f7825c930bb in socket_listener () from /tmp/rootdir/nix/store/0kdiav729rrcdwbxws653zxz5kngx8aa-libspdk-dev-21.01/lib/libspdk.so
#2  0x00007f7825ab6e9e in start_thread () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#3  0x00007f78259e566f in clone () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6

Thread 2 (Thread 0x7f7817fff640 (LWP 60)):
#0  0x00007f78259e598f in epoll_wait () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6
#1  0x00007f7825c7f174 in eal_intr_thread_main () from /tmp/rootdir/nix/store/0kdiav729rrcdwbxws653zxz5kngx8aa-libspdk-dev-21.01/lib/libspdk.so
#2  0x00007f7825ab6e9e in start_thread () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libpthread.so.0
#3  0x00007f78259e566f in clone () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6

Thread 1 (Thread 0x7f782559f040 (LWP 56)):
#0  0x00007fff849bcb37 in clock_gettime ()
#1  0x00007f78259af1d0 in clock_gettime@GLIBC_2.2.5 () from /tmp/rootdir/nix/store/a6rnjp15qgp8a699dlffqj94hzy1nldg-glibc-2.32/lib/libc.so.6
#2  0x00005633ca23ebc5 in <tokio::park::either::Either<A,B> as tokio::park::Park>::park ()
#3  0x00005633ca2c86dd in mayastor::main ()
#4  0x00005633ca2000d6 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#5  0x00005633ca2cad5f in main ()

Diskpool behaviour

The below behaviour may be encountered while uprading from older releases to Mayastor 2.4 release and above.

Get Dsp

Running kubectl get dsp -n mayastor could result in the error due to the v1alpha1 schema in the discovery cache. To resolve this, run the command kubectl get diskpools.openebs.io -n mayastor. After this kubectl discovery cache will be updated with v1beta1 object for dsp.

Create API

When creating a Disk Pool with kubectl create -f dsp.yaml, you might encounter an error related to v1alpha1 CR definitions. To resolve this, ensure your CR definition is updated to v1beta1 in the YAML file (for example, apiVersion: openebs.io/v1beta1).

You can validate the schema changes by executing kubectl get crd diskpools.openebs.io.

Restore to Mayastor

This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).

Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.

Before you begin, make sure you have the following:

Access to a Kubernetes cluster with Velero installed.
A backup of your Mongo database created using Velero.
Mayastor configured in your Kubernetes environment.

Step 1: Install Velero with GCP Provider on Destination (Mayastor Cluster)

Install Velero with the GCP provider, ensuring you use the same values for the BUCKET-NAME and SECRET-FILENAME placeholders that you used originally. These placeholders should be replaced with your specific values:

velero install --use-node-agent --provider gcp --plugins velero/velero-plugin-for-gcp:v1.6.0 --bucket BUCKET-NAME --secret-file SECRET-FILENAME --uploader-type restic

CustomResourceDefinition/backuprepositories.velero.io: attempting to create resource
CustomResourceDefinition/backuprepositories.velero.io: attempting to create resource client
CustomResourceDefinition/backuprepositories.velero.io: created
CustomResourceDefinition/backups.velero.io: attempting to create resource
CustomResourceDefinition/backups.velero.io: attempting to create resource client
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource
CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource client
CustomResourceDefinition/backupstoragelocations.velero.io: created
CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource
CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource client
CustomResourceDefinition/deletebackuprequests.velero.io: created
CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource
CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource client
CustomResourceDefinition/downloadrequests.velero.io: created
CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource
CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource client
CustomResourceDefinition/podvolumebackups.velero.io: created
CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource
CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource client
CustomResourceDefinition/podvolumerestores.velero.io: created
CustomResourceDefinition/restores.velero.io: attempting to create resource
CustomResourceDefinition/restores.velero.io: attempting to create resource client
CustomResourceDefinition/restores.velero.io: created
CustomResourceDefinition/schedules.velero.io: attempting to create resource
CustomResourceDefinition/schedules.velero.io: attempting to create resource client
CustomResourceDefinition/schedules.velero.io: created
CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource
CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource client
CustomResourceDefinition/serverstatusrequests.velero.io: created
CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource
CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource client
CustomResourceDefinition/volumesnapshotlocations.velero.io: created
Waiting for resources to be ready in cluster...
Namespace/velero: attempting to create resource
Namespace/velero: attempting to create resource client
Namespace/velero: created
ClusterRoleBinding/velero: attempting to create resource
ClusterRoleBinding/velero: attempting to create resource client
ClusterRoleBinding/velero: created
ServiceAccount/velero: attempting to create resource
ServiceAccount/velero: attempting to create resource client
ServiceAccount/velero: created
Secret/cloud-credentials: attempting to create resource
Secret/cloud-credentials: attempting to create resource client
Secret/cloud-credentials: created
BackupStorageLocation/default: attempting to create resource
BackupStorageLocation/default: attempting to create resource client
BackupStorageLocation/default: created
VolumeSnapshotLocation/default: attempting to create resource
VolumeSnapshotLocation/default: attempting to create resource client
VolumeSnapshotLocation/default: created
Deployment/velero: attempting to create resource
Deployment/velero: attempting to create resource client
Deployment/velero: created
DaemonSet/node-agent: attempting to create resource
DaemonSet/node-agent: attempting to create resource client
DaemonSet/node-agent: created
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
thulasiraman_ilangovan@cloudshell:~$

Step 2: Verify Backup Availability

Check the availability of your previously-saved backups. If the credentials or bucket information doesn't match, you won't be able to see the backups:

velero get backup | grep 13-09-23

mongo-backup-13-09-23     Completed   0        0          2023-09-13 13:15:32 +0000 UTC   29d       default            <none>

kubectl get backupstoragelocation -n velero

NAME      PHASE       LAST VALIDATED   AGE     DEFAULT
default   Available   23s              3m32s   true

Step 3: Restore Using Velero CLI

Initiate the restore process using Velero CLI with the following command:

velero restore create mongo-restore-13-09-23 --from-backup mongo-backup-13-09-23

Restore request "mongo-restore-13-09-23" submitted successfully.
Run `velero restore describe mongo-restore-13-09-23` or `velero restore logs mongo-restore-13-09-23` for more details.

Step 4: Check Restore Status

You can check the status of the restore process by using the velero get restore command.

velero get restore

When Velero performs a restore, it deploys an init container within the application pod, responsible for restoring the volume. Initially, the restore status will be InProgress.

Step 5: Backup PVC and Change Storage Class

Retrieve the current configuration of the PVC which is in Pending status using the following command:

kubectl get pvc mongodb-persistent-storage-claim-mongod-0 -o yaml > pvc-mongo.yaml

Confirm that the PVC configuration has been saved by checking its existence with this command:

ls -lrt | grep pvc-mongo.yaml

Edit the pvc-mongo.yaml file to update its storage class. Below is the modified PVC configuration with mayastor-single-replica set as the new storage class:

The statefulset for Mongo will still have the cstor-csi-disk storage class at this point. This will be addressed in the further steps.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    role: mongo
    velero.io/backup-name: mongo-backup-13-09-23
    velero.io/restore-name: mongo-restore-13-09-23
  name: mongodb-persistent-storage-claim-mongod-0
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
  storageClassName: mayastor-single-replica
  volumeMode: Filesystem

Step 6: Resolve issue where PVC is in a Pending

Begin by deleting the problematic PVC with the following command:

kubectl delete pvc mongodb-persistent-storage-claim-mongod-0

Once the PVC has been successfully deleted, you can recreate it using the updated configuration from the pvc-mongo.yaml file. Apply the new configuration with the following command:

kubectl apply -f pvc-mongo.yaml

Step 7: Check Velero init container

After recreating the PVC with Mayastor storageClass, you will observe the presence of a Velero initialization container within the application pod. This container is responsible for restoring the required volumes.

You can check the status of the restore operation by running the following command:

kubectl describe pod <enter_your_pod_name>

The output will display the pods' status, including the Velero initialization container. Initially, the status might show as "Init:0/1," indicating that the restore process is in progress.

You can track the progress of the restore by running:

velero get restore

NAME                     BACKUP                  STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
mongo-restore-13-09-23   mongo-backup-13-09-23   Completed   2023-09-13 13:56:19 +0000 UTC   2023-09-13 14:06:09 +0000 UTC   0        4          2023-09-13 13:56:19 +0000 UTC   <none>

You can then verify the data restoration by accessing your MongoDB instance. In the provided example, we used the "mongosh" shell to connect to the MongoDB instance and check the databases and their content. The data should reflect what was previously backed up from the cStor storage.

mongosh mongodb://admin:admin@mongod-0.mongodb-service.default.svc.cluster.local:27017

Step 8: Monitor Pod Progress

Due to the statefulset's configuration with three replicas, you will notice that the mongo-1 pod is created but remains in a Pending status. This behavior is expected as we have the storage class set to cStor in statefulset configuration.

Step 9: Capture the StatefulSet Configuration and Modify Storage Class

Capture the current configuration of the StatefulSet for MongoDB by running the following command:

kubectl get sts mongod -o yaml > sts-mongo-original.yaml

This command will save the existing StatefulSet configuration to a file named sts-mongo-original.yaml. Next, edit this YAML file to change the storage class to mayastor-single-replica.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    backup.velero.io/backup-volumes: mongodb-persistent-storage-claim
    meta.helm.sh/release-name: mongo
    meta.helm.sh/release-namespace: default
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    velero.io/backup-name: mongo-backup-13-09-23
    velero.io/restore-name: mongo-restore-13-09-23
  name: mongod
  namespace: default
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      role: mongo
  serviceName: mongodb-service
  template:
    metadata:
      creationTimestamp: null
      labels:
        environment: test
        replicaset: rs0
        role: mongo
    spec:
      containers:
      - command:
        - mongod
        - --bind_ip
        - 0.0.0.0
        - --replSet
        - rs0
        env:
        - name: MONGO_INITDB_ROOT_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: secrets
        - name: MONGO_INITDB_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: secrets
        image: mongo:latest
        imagePullPolicy: Always
        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - sleep 90 ; ./tmp/scripts/script.sh  > /tmp/script-log
        name: mongod-container
        ports:
        - containerPort: 27017
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data/db
          name: mongodb-persistent-storage-claim
        - mountPath: /tmp/scripts
          name: mongo-scripts
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 10
      volumes:
      - configMap:
          defaultMode: 511
          name: mongo-replica
        name: mongo-scripts
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      annotations:
        volume.beta.kubernetes.io/storage-class: mayastor-single-replica #Make the change here
      creationTimestamp: null
      name: mongodb-persistent-storage-claim
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 3Gi
      volumeMode: Filesystem

Step 10: Delete StatefulSet (Cascade=False)

Delete the StatefulSet while preserving the pods with the following command:

kubectl delete sts mongod --cascade=false

You can run the following commands to verify the status:

kubectl get sts

kubectl get pods

kubectl get pvc

Step 11: Deleting Pending Secondary Pods and PVCs

Delete the MongoDB Pod mongod-1.

kubectl delete pod mongod-1

Delete the Persistent Volume Claim (PVC) for mongod-1.

kubectl delete pvc mongodb-persistent-storage-claim-mongod-1

Step 12: Recreate StatefulSet

Recreate the StatefulSet with the Yaml file.

kubectl apply -f sts-mongo-mayastor.yaml

statefulset.apps/mongod created

kubectl get pods

NAME                            READY   STATUS    RESTARTS   AGE
mongo-client-758ddd54cc-h2gwl   1/1     Running   0          31m
mongod-0                        1/1     Running   0          31m
mongod-1                        1/1     Running   0          7m54s
mongod-2                        1/1     Running   0          6m13s
ycsb-775fc86c4b-kj5vv           1/1     Running   0          31m

kubectl mayastor get volumes

ID                                    REPLICAS  TARGET-NODE                      ACCESSIBILITY  STATUS  SIZE  THIN-PROVISIONED  ALLOCATED 
 f41c2cdc-5611-471e-b5eb-1cfb571b1b87  1         gke-mayastor-pool-2acd09ca-ppxw  nvmf           Online  3GiB  false             3GiB 
 113882e1-c270-4c72-9c1f-d9e09bfd66ad  1         gke-mayastor-pool-2acd09ca-4v3z  nvmf           Online  3GiB  false             3GiB 
 fb4d6a4f-5982-4049-977b-9ae20b8162ad  1         gke-mayastor-pool-2acd09ca-q30r  nvmf           Online  3GiB  false             3GiB

Step 13: Verify Data Replication on Secondary DB

Verify data replication on the secondary database to ensure synchronization.

root@mongod-1:/# mongosh mongodb://admin:admin@mongod-1.mongodb-service.default.svc.cluster.local:27017 
Current Mongosh Log ID: 6501c744eb148521b3716af5
Connecting to:          mongodb://<credentials>@mongod-1.mongodb-service.default.svc.cluster.local:27017/?directConnection=true&appName=mongosh+1.10.6
Using MongoDB:          7.0.1
Using Mongosh:          1.10.6

For mongosh info see: https://docs.mongodb.com/mongodb-shell/

------
   The server generated these startup warnings when booting
   2023-09-13T14:19:37.984+00:00: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem
   2023-09-13T14:19:38.679+00:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
   2023-09-13T14:19:38.679+00:00: You are running this process as the root user, which is not recommended
   2023-09-13T14:19:38.679+00:00: vm.max_map_count is too low
------

rs0 [direct: secondary] test> use mydb 
switched to db mydb
rs0 [direct: secondary] mydb> db.getMongo().setReadPref('secondary')

rs0 [direct: secondary] mydb> db.accounts.find()
[
  {
    _id: ObjectId("65019e2f183959fbdbd23f00"),
    name: 'john',
    total: '1058'
  },
  {
    _id: ObjectId("65019e2f183959fbdbd23f01"),
    name: 'jane',
    total: '6283'
  },
  {
    _id: ObjectId("65019e31183959fbdbd23f02"),
    name: 'james',
    total: '472'
  }
]
rs0 [direct: secondary] mydb>

develop

Welcome to Mayastor!

This documentation is staged on an unstable branch

What is Mayastor?

Where can I find Mayastor?

Scope

Basic Architecture

Dramatis Personae

Component Roles

io-engine

csi-node

etcd

Supportability:

Loki:

Promtail:

Quickstart

Prerequisites

General

Networking requirements

Recommended resource requirements

DiskPool requirements

RBAC permission requirements

Minimum worker node count

Transport protocols

Preparing the Cluster

Verify / Enable Huge Page Support

Label Mayastor Node Candidates

Deploy Mayastor

Overview

Installation via helm

Configure Mayastor

Create DiskPool(s)

What is a DiskPool?

Permissible Schemes for spec.disks under DiskPool CR

Configure Pools for Use with this Quickstart

Verify Pool Creation and Status

Create Mayastor StorageClass(s)

Storage Class Parameters

"fsType"

"protocol"

"repl"

"thin"

"stsAffinityGroup"

"cloneFsIdAsVolumeId"

Deploy a Test Application

Objective

Define the PVC

Deploy the FIO Test Pod

Verify the Volume and the Deployment

Verify the Volume Claim

Verify the Persistent Volume

Verify the Mayastor Volume

Verify the application

Run the FIO Tester Application

Advanced Operations

Mayastor Kubectl Plugin

Install kubectl plugin

Use kubectl plugin to retrieve data

Get Mayastor Volumes

Get Mayastor Pools

Get Mayastor Nodes

Scale the replica count of a volume

Retrieve resource in any of the output formats (table, JSON or YAML)

Retrieve replica topology for specific volumes

List available volume snapshots

Limitations of kubectl plugin

High Availability

How do I disable this feature?

Supportability

Supportability

Using the supportability tool

Topology Information

Historical Logs

Configuration details of Kubernetes resources

etcd dump information

Supportability tool logs

Does the supportability tool expose sensitive data?

Monitoring

Pool metrics exporter

Supported pool metrics

Permissible Schemes for `spec.disks` under DiskPool CR

Permissible Schemes for `spec.disks` under DiskPool CR