Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).
Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.
The objective of this section is to provide the user and evaluator of Mayastor with a topological view of the gross anatomy of a Mayastor deployment. A description will be made of the expected pod inventory of a correctly deployed cluster, the roles and functions of the constituent pods and related Kubernetes resource types, and of the high level interactions between them and the orchestration thereof.
More detailed guides to Mayastor's components, their design and internal structure, and instructions for building Mayastor from source, are maintained within the project's GitHub repository.
Control Plane
agent-core
Deployment
Principle control plane actor
Single
csi-controller
Deployment
Hosts Mayastor's CSI controller implementation and CSI provisioner side car
Single
api-rest
Pod
Hosts the public API REST server
Single
api-rest
Service
Exposes the REST API server
operator-diskpool
Deployment
Hosts DiskPool operator
Single
csi-node
DaemonSet
Hosts CSI Driver node plugin containers
All worker nodes
etcd
StatefulSet
Hosts etcd Server container
Configurable(Recommended: Three replicas)
etcd
Service
Exposes etcd DB endpoint
Single
etcd-headless
Service
Exposes etcd DB endpoint
Single
io-engine
DaemonSet
Hosts Mayastor I/O engine
User-selected nodes
DiskPool
CRD
Declares a Mayastor pool's desired state and reflects its current state
User-defined, one or many
Additional components
metrics-exporter-pool
Sidecar container (within io-engine DaemonSet)
Exports pool related metrics in Prometheus format
All worker nodes
pool-metrics-exporter
Service
Exposes exporter API endpoint to Prometheus
Single
promtail
DaemonSet
Scrapes logs of Mayastor-specific pods and exports them to Loki
All worker nodes
loki
StatefulSet
Stores the historical logs exported by promtail pods
Single
loki
Service
Exposes the Loki API endpoint via ClusterIP
Single
The io-engine pod encapsulates Mayastor containers, which implement the I/O path from the block devices at the persistence layer, up to the relevant initiators on the worker nodes mounting volume claims. The Mayastor process running inside this container performs four major functions:
Creates and manages DiskPools hosted on that node.
Creates, exports, and manages volume controller objects hosted on that node.
Creates and exposes replicas from DiskPools hosted on that node over NVMe-TCP.
Provides a gRPC interface service to orchestrate the creation, deletion and management of the above objects, hosted on that node. Before the io-engine pod starts running, an init container attempts to verify connectivity to the agent-core in the namespace where Mayastor has been deployed. If a connection is established, the io-engine pod registers itself over gRPC to the agent-core. In this way, the agent-core maintains a registry of nodes and their supported api-versions. The scheduling of these pods is determined declaratively by using a DaemonSet specification. By default, a nodeSelector field is used within the pod spec to select all worker nodes to which the user has attached the label openebs.io/engine=mayastor
as recipients of an io-engine pod. In this way, the node count and location are set appropriately to the hardware configuration of the worker nodes, and the capacity and performance demands of the cluster.
The csi-node pods within a cluster implement the node plugin component of Mayastor's CSI driver. As such, their function is to orchestrate the mounting of Mayastor-provisioned volumes on worker nodes on which application pods consuming those volumes are scheduled. By default, a csi-node pod is scheduled on every node in the target cluster, as determined by a DaemonSet resource of the same name. Each of these pods encapsulates two containers, csi-node, and csi-driver-registrar. The node plugin does not need to run on every worker node within a cluster and this behavior can be modified, if desired, through the application of appropriate node labeling and the addition of a corresponding nodeSelector entry within the pod spec of the csi-node DaemonSet. However, it should be noted that if a node does not host a plugin pod, then it will not be possible to schedule an application pod on it, which is configured to mount Mayastor volumes.
etcd is a distributed reliable key-value store for the critical data of a distributed system. Mayastor uses etcd as a reliable persistent store for its configuration and state data.
The supportability tool is used to create support bundles (archive files) by interacting with multiple services present in the cluster where Mayastor is installed. These bundles contain information about Mayastor resources like volumes, pools and nodes, and can be used for debugging. The tool can collect the following information:
Topological information of Mayastor's resource(s) by interacting with the REST service
Historical logs by interacting with Loki. If Loki is unavailable, it interacts with the kube-apiserver to fetch logs.
Mayastor-specific Kubernetes resources by interacting with the kube-apiserver
Mayastor-specific information from etcd (internal) by interacting with the etcd server.
Loki aggregates and centrally stores logs from all Mayastor containers which are deployed in the cluster.
Promtail is a log collector built specifically for Loki. It uses the configuration file for target discovery and includes analogous features for labeling, transforming, and filtering logs from containers before ingesting them to Loki.
This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).
Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.
When a node allocates storage capacity for a replica of a persistent volume (PV) it does so from a DiskPool. Each node may create and manage zero, one, or more such pools. The ownership of a pool by a node is exclusive. A pool can manage only one block device, which constitutes the entire data persistence layer for that pool and thus defines its maximum storage capacity.
A pool is defined declaratively, through the creation of a corresponding DiskPool
custom resource on the cluster. The DiskPool must be created in the same namespace where Mayastor has been deployed. User configurable parameters of this resource type include a unique name for the pool, the node name on which it will be hosted and a reference to a disk device which is accessible from that node. The pool definition requires the reference to its member block device to adhere to a discrete range of schemas, each associated with a specific access mechanism/transport/ or device type.
Mayastor versions before 2.0.1 had an upper limit on the number of retry attempts in the case of failure in create events
in the DSP operator. With this release, the upper limit has been removed, which ensures that the DiskPool operator indefinitely reconciles with the CR.
spec.disks
under DiskPool CRIt is highly recommended to specify the disk using a unique device link that remains unaltered across node reboots. One such device link is its UUID
. To get the UUID of a disk, execute: sudo blkid
Usage of the device name (for example, /dev/sdx) is not advised, as it may change if the node reboots, which might cause data corruption.
Disk(non PCI) with disk-by-guid reference (Best Practice)
Device File
aio:////dev/disk/by-uuid/ OR uring:////dev/disk/by-uuid/
Asynchronous Disk(AIO)
Device File
/dev/sdx
Asynchronous Disk I/O (AIO)
Device File
aio:///dev/sdx
io_uring
Device File
uring:///dev/sdx
Once a node has created a pool it is assumed that it henceforth has exclusive use of the associated block device; it should not be partitioned, formatted, or shared with another application or process. Any pre-existing data on the device will be destroyed.
A RAM drive isn't suitable for use in production as it uses volatile memory for backing the data. The memory for this disk emulation is allocated from the hugepages pool. Make sure to allocate sufficient additional hugepages resource on any storage nodes which will provide this type of storage.
To get started, it is necessary to create and host at least one pool on one of the nodes in the cluster. The number of pools available limits the extent to which the synchronous N-way mirroring (replication) of PVs can be configured; the number of pools configured should be equal to or greater than the desired maximum replication factor of the PVs to be created. Also, while placing data replicas ensure that appropriate redundancy is provided. Mayastor's control plane will avoid placing more than one replica of a volume on the same node. For example, the minimum viable configuration for a Mayastor deployment which is intended to implement 3-way mirrored PVs must have three nodes, each having one DiskPool, with each of those pools having one unique block device allocated to it.
Using one or more the following examples as templates, create the required type and number of pools.
When using the examples given as guides to creating your own pools, remember to replace the values for the fields "metadata.name", "spec.node" and "spec.disks" as appropriate to your cluster's intended configuration. Note that whilst the "disks" parameter accepts an array of values, the current version of Mayastor supports only one disk device per pool.
The status of DiskPools may be determined by reference to their cluster CRs. Available, healthy pools should report their State as online
. Verify that the expected number of pools have been created and that they are online.
Mayastor-2.0.1 adds two new fields to the DiskPool operator YAML:
status.cr_state: The cr_state
, which can either be creating, created or terminating, will be used by the operator to reconcile with the CR. The cr_state
is set to Terminating
when a CR delete event is received.
status.pool_status: The pool_status
represents the status of the respective control plane pool resource.
Pool configuration and state information can also be obtained by using the Mayastor kubectl plugin
Mayastor dynamically provisions PersistentVolumes (PVs) based on StorageClass definitions created by the user. Parameters of the definition are used to set the characteristics and behaviour of its associated PVs. For a detailed description of these parameters see storage class parameter description. Most importantly StorageClass definition is used to control the level of data protection afforded to it (that is, the number of synchronous data replicas which are maintained, for purposes of redundancy). It is possible to create any number of StorageClass definitions, spanning all permitted parameter permutations.
We illustrate this quickstart guide with two examples of possible use cases; one which offers no data redundancy (i.e. a single data replica), and another having three data replicas.
Both the example YAMLs given below have thin provisioning enabled. You can modify these as required to match your own desired test cases, within the limitations of the cluster under test.
The default installation of Mayastor includes the creation of a StorageClass with the name mayastor-single-replica
. The replication factor is set to 1. Users may either use this StorageClass or create their own.
This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).
Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.
Storage class resource in Kubernetes is used to supply parameters to volumes when they are created. It is a convenient way of grouping volumes with common characteristics. All parameters take a string value. Brief explanation of each supported Mayastor parameter follows.
The storage class parameter local
has been deprecated and is a breaking change in Mayastor version 2.0. Ensure that this parameter is not used.
File system that will be used when mounting the volume. The default file system when not specified is 'ext4'. We recommend to use 'xfs' that is considered to be more advanced and performant. Though make sure that XFS is installed on all nodes in the cluster before using it.
Expressed in seconds and it sets the io_timeout
parameter in the linux block device driver for Mayastor volume. It also sets ctrl-loss-tmo
timeout in linux NVMe driver that is used for detecting inaccessible target (nexus in our case). In either case, if IO cannot be served by Mayastor volume, because it is stuck or because the nvmf target is not there, it will take roughly this time to the initiator to return an error to data consumer, which can be either filesystem or an application if using raw block volume. The usual reaction of a filesystem to such error is to switch the filesystem to read-only state and require manual intervention from user - remounting. Be careful when setting this to a low value because any node reboot that does not fit into this time interval may require manual intervention afterwards or can result in application failure - depending on how the error is handled by the application.
The setting is supported only when using "nvmf" protocol.
The parameter 'protocol' takes the value nvmf
(NVMe over TCP protocol). It is used to mount the volume (target) on the application node.
The string value should be a number and the number should be greater than zero. Mayastor control plane will try to keep always this many copies of the data if possible. If set to one then the volume does not tolerate any node failure. If set to two, then it tolerates one node failure. If set to three, then two node failures, etc.
The volumes can either be thick
or thin
provisioned. Adding the thin
parameter to the StorageClass YAML allows the volume to be thinly provisioned. To do so, add thin: true
under the parameters
spec, in the StorageClass YAML. Sample YAML When the volumes are thinly provisioned, the user needs to monitor the pools, and if these pools start to run out of space, then either new pools must be added or volumes deleted to prevent thinly provisioned volumes from getting degraded or faulted. This is because when a pool with more than one replica runs out of space, Mayastor moves the largest out-of-space replica to another pool and then executes a rebuild. It then checks if all the replicas have sufficient space; if not, it moves the next largest replica to another pool, and this process continues till all the replicas have sufficient space.
The capacity usage on a pool can be monitored using exporter metrics.
The agents.core.capacity.thin
spec present in the Mayastor helm chart consists of the following configurable parameters that can be used to control the scheduling of thinly provisioned replicas:
poolCommitment parameter specifies the maximum allowed pool commitment limit (in percent).
volumeCommitment parameter specifies the minimum amount of free space that must be present in each replica pool in order to create new replicas for an existing volume. This value is specified as a percentage of the volume size.
volumeCommitmentInitial minimum amount of free space that must be present in each replica pool in order to create new replicas for a new volume. This value is specified as a percentage of the volume size.
Note:
By default, the volumes are provisioned as thick
.
For a pool of a particular size, say 10 Gigabytes, a volume > 10 Gigabytes cannot be created, as Mayastor 2.2.0 does not support pool expansion.
The replicas for a given volume can be either all thick or all thin. Same volume cannot have a combination of thick and thin replicas
stsAffinityGroup
represents a collection of volumes that belong to instances of Kubernetes StatefulSet. When a StatefulSet is deployed, each instance within the StatefulSet creates its own individual volume, which collectively forms the stsAffinityGroup
. Each volume within the stsAffinityGroup
corresponds to a pod of the StatefulSet.
This feature enforces the following rules to ensure the proper placement and distribution of replicas and targets so that there isn't any single point of failure affecting multiple instances of StatefulSet.
Anti-Affinity among single-replica volumes : This rule ensures that replicas of different volumes are distributed in such a way that there is no single point of failure. By avoiding the colocation of replicas from different volumes on the same node.
Anti-Affinity among multi-replica volumes : If the affinity group volumes have multiple replicas, they already have some level of redundancy. This feature ensures that in such cases, the replicas are distributed optimally for the stsAffinityGroup volumes.
Anti-affinity among targets : The High Availability feature ensures that there is no single point of failure for the targets. The stsAffinityGroup
ensures that in such cases, the targets are distributed optimally for the stsAffinityGroup volumes.
By default, the stsAffinityGroup
feature is disabled. To enable it, modify the storage class YAML by setting the parameters.stsAffinityGroup
parameter to true.
This website/page will be End-of-life (EOL) after 31 August 2024. We recommend you to visit OpenEBS Documentation for the latest Mayastor documentation (v2.6 and above).
Mayastor is now also referred to as OpenEBS Replicated PV Mayastor.
This quickstart guide describes the actions necessary to perform a basic installation of Mayastor on an existing Kubernetes cluster, sufficient for evaluation purposes. It assumes that the target cluster will pull the Mayastor container images directly from OpenEBS public container repositories. Where preferred, it is also possible to build Mayastor locally from source and deploy the resultant images but this is outside of the scope of this guide.
Deploying and operating Mayastor in production contexts requires a foundational knowledge of Mayastor internals and best practices, found elsewhere within this documentation.