Indexer Clustering Flashcards

1
Q

What is the reason for an automatic detention?

A

If disk space runs low (default <5 GB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which features are disabled in a manual detention?

A
  • Indexing (except internal)
  • Data replication
  • Inputs can be blocked if wished (not HEC)
  • Continues to participate in searches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which features are enabled in automatic detention?

A

none, all indexing features are completely stopped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Does the automatic detention mode recovers by itself?

A

Yes, if there is sufficient disk space (default > 5GB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a possible scenario for manual detention?

A
  • Shift incoming data from a forwarder to another indexer
  • To partially decommission an old indexer (but still use it for searches for existing data)
  • Troubleshooting purpose
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When should the maintenance mode be activated?

A

Only for maintenance reason (eg updates, switch from single site to multi site etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens if a cluster is in maintenance mode?

A

It prevents the cluster from bucket fixup tasks and also from rolling buckets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which command puts the cluster into maintenance mode?

A

./splunk enable maintenance-mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between ./splunk stop and ./splunk offline in a clustered environment ?

A

./splunk stop is not recommended in a clustered environment. With ./splunk offline the CM makes sure to re-assign or copy primary buckets to other peers to have at least a valid cluster. Once the CM finishes this task, the peer goes offline. The CM then waits for 60 seconds until the peer comes back (can be extended). If the peer does not come back, the CM starts bucket fixup activities to gain a complete cluster state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which command decommissions a peer permanent?

A

/.splunk offline –enforce-counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of rebalancing does the cluster support?

A

Primary and data rebalancing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does primary rebalancing mean?

A

It means that the CM marks the primary buckets evenly to ensure that each peer has approx. the same number of primary copies. It does not copy the buckets, it just re-assigns the markers. There is no movement of buckets, and because of this limitation there will be never a perfect distribution of primary buckets. It automatically happens at the end of a rolling restart or if a new peer joins or re-joins. It also can be done through a REST call.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Whats does data rebalancing mean?

A

Data rebalancing means to distribute the data storage evenly across all peers. It balances primary, searchable, non-searchable buckets so that each indexer has approx. the same amount of buckets. It does move buckets from one peer to another. The rebalance can be started through the GUI or through CLI or REST. You can set an attribute to make sure that searching is still possible (search-safe feature). An imbalance usually happens if a new peer joins the cluster or if a Forwarder does no distribute properly. Best practise is to perform a ‘remove excessive buckets’ task before a rebalancing to make sure that the process is efficient.

Command to perform a data rebalancing:
splunk rebalance cluster-data -action start [-searchable true] [-index index_name] [-max_runtime interval_in_minutes]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the idea behind the ‘remove excessive buckets’ feature?

A

Lets assume a peer goes offline for longer, or there is a network outage and one indexer can not connect properly. The CM recognizes it since there is no hearbeat sent by the peer. The CM starts fixup activities to recover into a valid/complete state, means the CM re-assigns primary buckets and may copies buckets. Lets assume the indexer comes back. Now the picture looks different, since another peer took over the data which the missing peer held. Means, there is excessive data in the cluster. This has no negative impact on the cluster itself, but is consumes storage. This excessive buckets can be removed through this feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

List the migration procedure from a single site cluster to a multi site cluster

A

1) Breath
2) Make sure that the new servers meet the system requierements (CPU, RAM, storage, IOPS etc.)
3) Install all the new servers with Splunk Enterprise (same version)
4) Configure the CM with the new multi site configuration (do not delete single site policy)
5) Put the CM into maintenance mode
6) Configure all other new instances (SHs and IDXs) for multi-site (do not delete the single site config)
7) Disable maintenance mode, check log for errors and CM dashboard
8) If required, configure Forwarders for indexer discovery and site awareness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the configuration looks like if you want to setup a CM initially for a single site environment with a rep factor of 2 and a search factor of 3 ?

A

./splunk edit cluster-config -mode master -replication_factor 2 -search_factor 2 -secret mysecret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does the configuration looks like if you want to setup an indexer initially for a single site environment with the replication port set to 9887 and a secret ‘mysecret’ ?

A

./splunk edit cluster-config -mode slave -master_uri https://cm:8089 -replication_port 9887 -secret mysecret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does the configuration looks like if you want to setup a searchhead initially to join the single site environment?

A

./splunk edit cluster-config -mode searchhead -master_uri https://cm:8089 -secret mysecret

19
Q

How about a CM failover scenario? How could that be realized?

A

A hot standby is not recommend. Instead, a cold standy (active/passive) works properly. A DNS based failover is a possible scenerio where the failover system has a clone of /opt/splunk/etc/master-apps and the [clustering] stanza from the server.conf. (be careful with hashed passwords)

20
Q

When does the CM blocks indexing and it needs to be unblocked with the ./splunk set indexing-ready command?

A

A site goes down and you later need to restart the master for any reason or the site with the master goes down and you bring up a stand-by master on another site. The reason for the blocking is that the CM needs to make sure to fullfil the replication factor. This does not occur when the CM is already running, the CM only checks after a restart.

21
Q

Where does the configuration bundle lives on a CM in a clustered environment?

A

$SPLUNK_HOME/var/run/splunk/cluster/remote_bundles

22
Q

What is the updating order in a clustered environment?

A
  1. (MC)
  2. CM (after upgrading CM, put it into maintenance mode to avoid bucket replication)
  3. SH
  4. IDX
  5. If required, FWD too
23
Q

How does the replication work for indexes with accelerated data models or report acceleration?

A

By default, those indexes are not replicated. It is recommended to activate the replication on the CM through:

[clustering]
summary_replication = true

24
Q

Explain the different naming schema of buckets in a clustered environment

A

Buckets do have a different naming schema in a clustered environment.

Buckets beginning with db_XXXX_GUID are buckets from the originator.

Buckets beginning with rb_XXXX_GUID are replica buckets.

The GUID is always the GUID from the originator.

25
Q

What happens if the CM goes offline?

A

The cluster still operates, but new data may not replicate.

26
Q

What does site0 mean?

A

Its also known as the ‘magic site’. Usually a searchhead has a site configured in which they participate. If the site is set to 0, that means that the searchhead does not participate in any site anymore, it searches globally and search affinity is disabled.

27
Q

How does the CM/peers communicate?

A

Every communication in a cluster goes through REST endpoints (hearbeat, bucket status, generation etc.)

28
Q

Does the indexer discovery stanza on the CM requieres the same pass4SymmKey as in the clustering stanza?

A

No, those are different passwords. It is a different feature and the same password would introduce a security risk of breaking into the cluster.

29
Q

In which interval the SH polls the CM for a current list of indexers? (generations)

A

Default is set to 5s

30
Q

What is the default heartbeat send from an indexer to the CM ?

A

1s

31
Q

What is the default heartbeat timeout set on the CM?

A

60s

32
Q

How should the service_interval in the clustering stanze be tuned?

A

Raise it to one second for every 50k buckets (total)

33
Q

When should the heartbeat_period be tuned?

A

Tune this if you have lots of indexers (> 50) or lots of buckets (> 100k)

34
Q

Which bucket type acts different in terms of notifying the CM when it rolled?

A

The frozen bucket.

A roll from hot to warm and a roll from warm to cold both notify the CM replies with a list of streaming targets, for replication. This list is randomly chosen from the list of available indexers (generation), with enough to make sure we meet policy.

As for the frozen bucket, the peers notifies the CM but the CM does not replicate the change of state. The assumption is that the other indexers will freeze it themselves relatively soon. The only caveat to this is if one of the remaining copies is a searchable copy (and we froze the primary); in this instance, an already-existing searchable copy can be flagged primary

35
Q

How does the CM recognizes the primary and non-primary buckets?

A

Buckets do have a flag:

0x000000000 and 0xffffffff for non-primary and primary, respectively

The flags can be seen with the REST command:| rest /services/cluster/master/buckets

36
Q

Is it possible that buckets can have the same ID in a clustered environment? If yes, how does the CM makes sure to not mix up the data?

A

Buckets can have the same IDs. But the CM always uses not just the ID as reference, it uses the combination of ID and GUID to have unique values.

37
Q

What can you read out of the following search results? “index=_internal source=splunkd.log component=CMMaster event=addBucket”

A

If, where and when a new hot bucket was minted.

38
Q

How to recover the pass4SymmKey?

A

Splunk 7.2.2+

/opt/splunk/bin/splunk show-decrypted –value $hash

39
Q

Does buckets in a multi site cluster look different compared to single site buckets? If yes, what is the difference?

A

The look very similar, except that there is one difference. Multi site buckets now cointain a site marker (written in the journal.gz)

40
Q

If a search is performed, which buckets will be searched?

A

Only buckets who have a primary flag

41
Q

What will happen if you change the RF/SF in a clustered environment?

A

It will result in a fixup activity.
Raising RF = “lots of replica copies gone missing” Raising SF = “lots of searchable copies gone missing”

Reducing RF / SF = “copies have been made redundant”

Important: Always consider to re-calculculate the requiered disk storage for a change of the replication policy

42
Q

Can the internal Splunk replication task in a clustered environment be tuned?

A

Yes through the server.conf on the CM:
max_peer_rep_load
max_peer_build_load

43
Q

Can a searchhead search across different indexer cluster?

A

Yes, it is possible. In the clustering stanza on the CM you need to add another configuration for the other cluster

44
Q

What should be the maximum network latency between cluster peers?

A

<100ms.

Network latency should not exceed 100 milliseconds. Higher latencies can significantly slow indexing performance and hinder recovery from cluster node failures