Monitoring API

The monitoring API currently provides the following monitoring information:

  • Long term cluster health indicators
  • Metrics for:
    • Cassandra status
    • reads and writes operations per second
    • cpu utilization
    • disk utilization
    • pending compactions and active repairs

Metrics information is provided either for an individual node or for all nodes in a cluster and cluster data centre.

The API also provides key statistics for each table in the cluster (similar to what is available through "nodetool tablehistograms"):

  • read & write counts (mean, distribution)
  • read & write latency (mean, distribution)
  • live cells & tombstones per read (mean, max)
  • number of sstables read for each read operation (mean, max)

The set of available metrics will expand as we build out this API. Descriptions of each of the metrics can be found in the monitoring section of this support site:
https://support.instaclustr.com/hc/en-us/sections/200689300-Monitoring-Information

Authentication

All requests to the API must use Basic Authentication and contain a valid username and the monitoring API key. API keys are created per user account and can be retrieved via the Instaclustr Console from the Account > API Key tab.

api_keys.png

All available metrics are updated every 20 seconds (i.e. requesting the same metric twice in 20 seconds will always return the same response).

Cluster Health Indicator

Cluster Health Indicator API provides a summary of indicators on the long term health of your cluster and is retrieved by making a GET request to https://api.instaclustr.com/monitoring/v1/clusters/<clusterId>/indicators

The API will respond with status 200 OK and a JSON packet containing the following information:

[
    {
        "type": "REPLICATION_STRATEGY",
        "stateDetails": {
            "PASS": [
                {
                    "message": "",
                    "keyspace": "testkeyspace"
                }
            ]
        }
    },
    {
        "type": "REPLICATION_FACTOR",
        "stateDetails": {
            "PASS": [
                {
                    "message": "",
                    "keyspace": "testkeyspace"
                }
            ]
        }
    },
    {
        "type": "DISK_USAGE",
        "stateDetails": {
            "PASS": [
                {
                    "message": "",
                    "publicIp": "52.5.37.217",
                    "privateIp": "10.224.145.126"
                },
                {
                    "message": "",
                    "publicIp": "34.232.115.13",
                    "privateIp": "10.224.80.183"
                },
                {
                    "message": "",
                    "publicIp": "34.233.151.239",
                    "privateIp": "10.224.9.122"
                }
            ]
        }
    },
    {
        "type": "PARTITION_SIZE",
        "stateDetails": {
            "PASS": [
                {
                    "message": "",
                    "keyspace": "testkeyspace",
                    "table": "units"
                }
            ],
            "PASS": [
                {
                    "message": "",
                    "keyspace": "testkeyspace",
                    "table": "students"
                }
            ]
        }
    },
    {
        "type": "TOMBSTONE_LIVECELL",
        "stateDetails": {
            "UNKNOWN": [
                {
                    "message": "No tobmstone/liveCell information found.",
                    "keyspace": "testkeyspace",
                    "table": "units"
                },
                {
                    "message": "No tobmstone/liveCell information found.",
                    "keyspace": "testkeyspace",
                    "table": "students"
                }
            ]
        }
    }
]

Example: Response packet showing cluster health

The output JSON consists of:

  • type: The name of the indicator being returned. The API returns five indicator types; REPLICATION_STRATEGY and REPLICATION_FACTOR for each keyspace. DISK_USAGE for each node. PARTITION_SIZE and TOMBSTONE_LIVECELL for every table.
  • stateDetails: The state of the indicator type. stateDetails can be PASS, UNKNOWN, FAIL, WARN with further details provided in the form of a message.

A detailed description of cluster health indicators can be found in this support article:

https://support.instaclustr.com/hc/en-us/articles/226437447-Cluster-Health-Check

Metrics

Metrics are requested by constructing a GET request, consisting of:

  • type: Either 'clusters', 'datacentres' or 'nodes'. Specifying 'clusters' will return the metrics for each node in the cluster. Specifying 'datacentres' will return the metrics for each node belonging to the datacenter. Specifying 'nodes' will return the metrics for a specific node.
  • UUID or public IP: If the type is set to 'clusters' or 'datacentres', then the UUID of cluster or datacentre must be specified. However, if the type is set to 'nodes', than either the nodes' UUID or public IP may be specified.
  • metrics: The metrics to return are specified as a comma delimited querystring parameter. Up to 20 metrics may be specified.
https://api.instaclustr.com/monitoring/v1/clusters/e7342f08-d32f-41af-95be-cfaa0a43
3a26?metrics=n::cpuUtilization,n::diskUtilization

Example: Endpoint to return the CPU and disk utilization for each node in the cluster with a UUID of e7342f08-d32f-41af-95be-cfaa0a433a26 

https://api.instaclustr.com/monitoring/v1/datacentres/001224dc-989c-4ad0-8b37-1ce34
5065b8f?metrics=n::cassandraReads,n::cassandraWrites

Example: Endpoint to return the read and write per second by Cassandra for each node belonging to the datacentre with a UUID of 001224dc-989c-4ad0-8b37-1ce345065b8f 

https://api.instaclustr.com/monitoring/v1/nodes/52.70.191.97?metrics=cf::tk1::tcf1:
:readlatencydistribution

Example: Endpoint to return the read latency distribution for the 'tcf1' table in the 'tk1' keyspace, for just the 52.70.191.97 node. 

For a complete list of available metrics, refer to the Reference section.

Successfully processed metric API requests will return a 200 status code and accompanying JSON packet. JSON packets follow the same basic structure as listed in the following example:

[
   {
      "id":"be456b5e-e81a-4ea3-99f1-23905942d1d9",
      "payload":[
         {
            "metric":"cpuUtilization",
            "type":"percentage",
            "unit":"1",
            "values":[
               {
                  "time":"2017-01-04T03:53:32.000Z",
                  "value":"7.401636"
               }
            ]
         }
      ],
      "publicIp":"123.123.123.123",
      "privateIp":"10.0.0.1",
      "rack":{
         "name":"us-east-1a",
         "dataCentre":{
            "name":"US_EAST_1",
            "provider":"AWS_VPC",
            "customDCName":"AWS_VPC_US_EAST_1"
         },
         "providerAccount":{
            "name":"INSTACLUSTR",
            "provider":"AWS_VPC"
         }
      }
   }
]

Example: Response with CPU Utilization for a single node 

Each payload item represents an individual metric and will consist of:

  • metric:  The name of the metric being returned
  • type: The sub-type of the metric that is being measured (e.g. for the diskUsed metric, the available 'types' are livediskspaceused and totaldiskspaceused)
  • unit:  The unit of measurement.  The following unit abbreviations are used:
    • GB: Gigabyte
    • MB: Megabyte
    • B: Byte
    • s: Second
    • ms: Millisecond
    • us: Microsecond
    • 1: Non-standard unit (e.g. percentage)
    • us/1: Microseconds pre non-standard unit (e.g. latency per read operation)
    • 1/s: Non-standard unit per second (e.g. write operations per second)
  • values: An array of time/value maps containing the measurement as recorded by Instaclustr

If multiple metrics are requested, the response will include multiple payload entries:

[
    {
        "id": "ce456b5e-c81a-4ea3-99f1-13805942d1d9",
        "payload": [
            {
                "metric": "diskUtilization",
                "type": "percentage",
                "unit": "1",
                "values": [
                    {
                        "time": "2017-01-04T03:59:14.000Z",
                        "value": "47.104115"
                    }
                ]
            },
            {
                "metric": "cpuUtilization",
                "type": "percentage",
                "unit": "1",
                "values": [
                    {
                        "time": "2017-01-04T03:59:14.000Z",
                        "value": "7.545443"
                    }
                ]
            }
        ],
        "publicIp": "123.123.123.123",
        "privateIp": "10.0.0.1",
        "rack": {
            "name": "us-east-1a",
            "dataCentre": {
                "name": "US_EAST_1",
                "provider": "AWS_VPC",
                "customDCName": "AWS_VPC_US_EAST_1"
            },
            "providerAccount": {
                "name": "INSTACLUSTR",
                "provider": "AWS_VPC"
            }
        }
    }
]

Example: Get CPU Utilization and Disk Utilization for a single node 

Unsuccessful calls will return the following responses, depending upon the issue:

  • 400 Bad Request: Returned when the expected node or cluster ID is not a valid UUID or an incorrect metric name has been supplied.
  • 401 Unauthorized: Returned when no or incorrect username and/or api key details are provided.
  • 404 Not Found: Returned when accessing an incorrect URL or trying to access a cluster/node not owned by the authenticated user.
  • 429 Too Many Requests: Returned when more than 70 requests per second are being received by your user.
  • 500 Server Error: All other errors
> GET /monitoring/v1/nodes/0aa675db-fe5a-4c54-80e7-e6be9dd60f35/badendpoint
HTTP/1.1
> Authorization: Basic 12345678==
> User-Agent: curl/7.40.0
> Host: api.instaclustr.com
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Server: nginx/1.9.4
< Date: Thu, 03 Sep 2015 02:10:57 GMT
< Content-Type: application/json
< Content-Length: 68
< Connection: keep-alive
< Set-Cookie: rememberMe=deleteMe; Path=/; Max-Age=0; Expires=Wed, 02-Sep-2015
02:10:57 GMT
<
* Connection #0 to host api.instaclustr.com left intact
{"name":"Endpoint not found","message":"Please check the URL path."}

Example: Error response

Reference

Nodes

General Metrics

Non-table metrics follow the format n::{metricName}.

Each metric type will contain the latest available measurement.

  • n::nodeStatus: Whether Cassandra is available on the node. Returns a "warn" value, if no check in has been logged in the last 30 seconds.
  • n::cpuUtilization: Current CPU utilisation as a percentage of total available. Maximum value is 100%, regardless of the number of cores on the node.
  • n::diskUtilization: Total disk space utilisation, by Cassandra, as a percentage of total available.
  • n::cassandraReads: Reads per second by Cassandra. (Deprecated, please use n::reads)
  • n::cassandraWrites: Writes per second by Cassandra. (Deprecated, please use n::writes)
  • n::compactions: Number of pending compactions.
  • n::repairs: Number of active and pending repair tasks.
  • n::clientRequestRead: 95th percentile distribution and average latency per client read request (i.e. the period from when a node receives a client request, gathers the records and response to the client).
  • n::clientRequestWrite: 95th percentile distribution and average latency per client write request (i.e. the period from when a node receives a client request, gathers the records and response to the client).

Note: All deprecated metrics and endpoints will be removed in the future.

Table Metrics

Table metric names follow the format cf::{keyspace}::{table}::{metricType}. Optionally, a 'sub-type' may be specified to return a specific part of the metric. For example,

cf::tk1::tcf1::readlatencydistribution

will return the various distributions of the read latency metric.

cf::tk1::tcf1::readlatencydistribution::50thPercentile

will only return the 50th percentile distribution of the read latency metric.

Each metric type will contain the latest available measurement.

  • cf::{keyspace}::{table}::readLatencyDistribution: Measurement of local read latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of read latency
    • 75thPercentile: 75th percentile distribution of read latency
    • 95thPercentile: 95th percentile distribution of read latency
    • 99thPercentile: 99th percentile distribution of read latency
  • cf::{keyspace}::{table}::reads: General measurements of local read latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local read latency per second
    • count_per_second: Reads of the table performed on the individual node
  • cf::{keyspace}::{table}::writeLatencyDistribution: Metrics for local write latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of write latency
    • 75thPercentile: 75th percentile distribution of write latency
    • 95thPercentile: 95th percentile distribution of write latency
    • 99thPercentile: 99th percentile distribution of write latency
  • cf::{keyspace}::{table}::writes: General measurements of local write latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local write latency per second
    • count_per_second: Writes to the table performed on the individual node
  • cf::{keyspace}::{table}::sstablesPerRead: SSTables accessed per read of the table on the individual node. Available sub-types:
    • average: Average SSTables accessed per read
    • max: Maximum SSTables accessed per read
  • cf::{keyspace}::{table}::tombstonesPerRead: Tombstoned cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average tombstones accessed per read
    • max: Maximum tombstones accessed per read
  • cf::{keyspace}::{table}::liveCellsPerRead: Live cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average live cells accessed per read
    • max: Maximum live cells accessed per read
  • cf::{keyspace}::{table}::diskUsed: Live and total disk used by the table. Available sub-types:
    • livediskspaceused: Disk used by live cells
    • totaldiskspaceused: Disk used by both live cells and tombstones

Listing Monitored Tables

A list of monitored tables, grouped by keyspace, can be generated by making a GET request to:

https://api.instaclustr.com/monitoring/v1/clusters/{cluster-id}/columnFamilies

 

The API will respond with the following packet:

{
 "keyspace1": [
 "standard1",
 "counter1",
 "Counter3"
 ],
 "keyspace2": [
 "table2",
 "table1"
 ]
}

Example: Response packet listing monitored tables

Clusters

Requesting 'cluster' metrics returns the requested measurements for each provisioned node in the cluster and follows the same format as the 'nodes' endpoint. All node metrics are available for use.

For example, this request:

https://api.instaclustr.com/monitoring/v1/clusters/37af4800-5166-3d3c-cb9a-c9a4b960
196e?metrics=n::cpuUtilization,cf::tk1::tcf1::sstablesPerRead

would return the following response packet:

[
    {
        "id": "694294d9-ea82-49c2-9f71-aacac81f0325",
        "payload": [
            {
                "metric": "cpuUtilization",
                "type": "percentage",
                "unit": "1",
                "values": [
                    {
                        "time": "2017-01-04T04:19:28.000Z",
                        "value": "7.639166"
                    }
                ]
            },
            {
                "metric": "reads",
                "type": "count_per_second",
                "unit": "1/s",
                "values": [
                    {
                        "time": "2017-01-04T04:19:28.000Z",
                        "value": "3.80952380952381"
                    }
                ]
            }
        ],
        "publicIp": "123.123.123.123",
        "privateIp": "10.0.0.1",
        "rack": {
            "name": "us-east-1c",
            "dataCentre": {
                "name": "US_EAST_1",
                "provider": "AWS_VPC",
                "customDCName": "AWS_VPC_US_EAST_1"
            },
            "providerAccount": {
                "name": "INSTACLUSTR",
                "provider": "AWS_VPC"
            }
        }
    },
    {
        "id": "4d848f48-5e24-41d6-81f2-44c2f578895f",
        "payload": [
            {
                "metric": "cpuUtilization",
                "type": "percentage",
                "unit": "1",
                "values": [
                    {
                        "time": "2017-01-04T04:19:30.000Z",
                        "value": "7.915636"
                    }
                ]
            },
            {
                "metric": "reads",
                "type": "count_per_second",
                "unit": "1/s",
                "values": [
                    {
                        "time": "2017-01-04T04:19:30.000Z",
                        "value": "5.571428571428571"
                    }
                ]
            }
        ],
        "publicIp": "123.123.123.124",
        "privateIp": "10.0.0.2",
        "rack": {
            "name": "us-east-1a",
            "dataCentre": {
                "name": "US_EAST_1",
                "provider": "AWS_VPC",
                "customDCName": "AWS_VPC_US_EAST_1"
            },
            "providerAccount": {
                "name": "INSTACLUSTR",
                "provider": "AWS_VPC"
            }
        }
    }
]

 

 

Last updated:
If you have questions regarding this article, feel free to add it to the comments below.

0 Comments

Please sign in to leave a comment.