Linux · SQL Server Tips

SQL Server Failover Clustering in Ubuntu 16.0.4

It’s been a while since my last post in this blog, but now I am back with new post about how to setup SQL Server Failover Clustering in Linux Ubuntu 16.04. You can do all this tutorial on whatever environment you have, but for the record, I did all of this tutorial on my virtual environment using Hyper-V on Windows 10 Professional.

In this post I configured 2 virtual machine running Ubuntu 16.04 with following hardware configuration :

  • Processor 2 cores
  • Memory 4GB
  • 2 Hard disk (60GB for system & 8GB for block device)
  • 2 Public Network interface will be bonded for redundancy
  • 2 Private network interface for heartbeat also will be bonded
  • DVD Drive for installation media (loaded from Ubuntu 16.04 iso file)

I am using same hardware configuration for both servers. And later my environment will look like the following figure :

fail-over-cluster-diagram

I assumed you already have Ubuntu Server 16.04 installed, but if not you can check my other blog post here on how to install Ubuntu Server 16.04 on your server.

Here is my the content of this post :

[Preparation]

I’ve named both server as lnxsql01 and lnxsql02 and I will install NTP on both nodes to ensure both servers have same time.

$ sudo apt-get install -y ntp

Stop and disable apparmor on both servers to ensure that required services will started successfully

$ sudo invoke-rc.d apparmor stop
$ sudo update-rc.d -f apparmor remove

 

[Network Configuration]
Do all this following steps on both nodes (lnxsql01 & lnxsql02)

Install ifenslave

$ sudo apt-get install ifenslave

Setup interface bonding options by create a file in /etc/modprobe.d/ directory

$ sudo vi /etc/modprobe.d/bonding.conf

Append the following command to that file

alias bond0 bonding
alias bond1 bonding
options bonding mode=active-backup miimon=100 max_bonds=2 primary=eth0

Setup bonding parameter in network interface configuration

$ sudo vi /etc/network/interfaces

Modify your network interface configuration file as follow :

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto bond0 
iface bond0 inet static
    address 172.16.10.11
    netmask 255.255.255.0
    gateway 172.16.10.1
    network 172.16.10.0
    broadcast 172.16.10.255
    post-up ifenslave bond0 eth0 eth1
    pre-down ifenslave -d bond0 eth0 eth1

# The secondary network interface
auto bond1
iface bond1 inet static
    address 10.0.0.11
    netmask 255.255.255.0
    network 10.0.0.0
    broadcast 10.0.0.255
    post-up ifenslave bond1 eth2 eth3
    pre-down ifenslave -d bond1 eth2 eth3

You may need to change IP address, netmask, broadcast and network depend on your network environment. Then you need to ensure that bonding module was loaded by checking ‘/etc/modules’ file

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.

loop
lp
bonding

Reboot your servers (both nodes) and verify that network bonding configuration is work by run the following command on both nodes:

$ cat /proc/net/bonding/bond0
$ cat /proc/net/bonding/bond1

You will get below output

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 00:15:5d:40:11:00
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 00:15:5d:40:11:01
Slave queue ID: 0

Modify host files on both nodes :

127.0.0.1  localhost
172.16.10.11    lnxsql01.yohja.local
172.16.10.12    lnxsql02.yohja.local
10.0.0.11   lnxsql01-private
10.0.0.12   lnxsql02-private

Verify that all of your network configuration is working. Try to ‘ping’ either IP address or host name and see the result.

[Disk Configuration]

As I mentioned before that I have 2 hard disk on each server nodes, now we will configure the second hard disk which will be mirrored using DRBD on both nodes.

$ sudo fdisk /dev/sdb
n – add new partition
p – primary partition
1 – partition number
1 – where does partition starts
1 – size of this new partition
w – to write this change and exit

 

[DRBD Installation & Configuration]

Install DRBD utility on both server nodes (lnxsql01 & lnxsql02)

$ sudo apt-get update
$ sudo apt-get install -y drbd8-utils

After the installation finished, we need to modify /etc/drbd.conf or simply add another file inside /etc/drbd.d/ directory to store our DRBD configuration

$ sudo vi /etc/drbd.d/mssql.res

We should set .res as file extension since by default drbd.conf will load all .res files in this directory

resource mssql {
        device minor 0;
        disk /dev/sdb1;
        syncer {
                rate 150M;
                verify-alg md5;
        }
        on lnxsql01 {
                address 10.0.0.11:7788;
                meta-disk internal;
        }
        on lnxsql02 {
                address 10.0.0.12:7788;
                meta-disk internal;
        }
}

Initialize meta disk on both nodes and start the resource we’ve just created

$ sudo drbdadm create-md mssql
$ sudo drbdadm up mssql

We need to make lnxsql01 as our primary node, with just run the following command only on first node

$ sudo drbdadm -- --overwrite-data-of-peer primary mssql

Review DRBD status by run the following command on either lnxsql01 or lnxsql02

$ watch cat /proc/drbd

Wait a while until you have this output

Every 2.0s: cat /proc/drbd                                                                                                      version: 8.4.5 (api:1/proto:86-101)
srcversion: 37DD7526B304711A1B9AA5A
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:8731520 nr:0 dw:8731888 dr:8444401 al:72 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Format drbd partition with ext4 file system and later we will mount this partition to store SQL server data files

$ sudo mkfs.ext4 -f /dev/drbd0

Remove DRBD module from startup since we will control DRBD by pacemaker to fail over between both nodes

$ sudo update-rc.d -f drbd remove

 

[SQL Server Installation & SQL Server Tools Installation]

Click these link to learn how to install both SQL Server and SQL server tools on Ubuntu 16.04, and do it on both nodes

[Configure SQL Server Storage]

Stop SQL Server service on both nodes and remove from startup

$ sudo systemctl stop mssql-server
$ sudo systemctl disable mssql-server

Install SQL Server high availability options

$ sudo apt-get update
$ sudo apt-get install -y mssql-server-ha

On lnxsql01 relocate default SQL Server data to a temporary location

su mssql
mkdir /var/opt/mssql/tmp
cp /var/opt/mssql/data/* /var/opt/mssql/tmp
rm /var/opt/mssql/data/*
exit

We also need to remove all default SQL Server data on lnxsql02

$ sudo rm /var/opt/mssql/data/*

Mount DRBD partition to default SQL Server data directory, give proper permission and move back default data from temporary location

$ sudo mount /dev/drbd0 /var/opt/mssql/data 
$ sudo chown mssql:mssql /var/opt/mssql/data
$ su mssql
$ cp /var/opt/mssql/tmp/* /var/opt/mssql/data
$ exit

Validate that MS SQL Server service start successfully with new file path by run the following command on lnxsql01

$ sudo systemctl start mssql-server
$ sudo systemctl status mssql-server
$ sudo systemctl stop mssql-server

Create SQL Server login on lnxsql01 for Pacemaker and grant login permission to run sp_server_diagnostic

$ sudo systemctl start mssql-server
$ sqlcmd -s  -U  -P 

> USE [master]
> GO
> CREATE LOGIN [loginname] WITH PASSWORD=N’’
> GRANT VIEW SERVER STATE TO [loginname]
> GO
> exit

$ sudo systemctl stop mssql-server

Create a file on both nodes (lnxsql01 and lnxsql02) to store SQL Server login information for pacemaker which we just created :

$ su mssql
$ touch /var/opt/mssql/secrets/passwd
$ echo '<loginname>' >> /var/opt/mssql/secrets/passwd
$ echo '<loginpassword>' >> /var/opt/mssql/secrets/passwd
$ exit
$ sudo chown root:root /var/opt/mssql/secrets/passwd
$ sudo chmod 0600 /var/opt/mssql/secrets/passwd

 

[Pacemaker & Corosync]

Install pacemaker and corosync on both nodes

$ sudo apt-get update && sudo apt-get install -y pacemaker

By default the above command will also install corosync as pacemaker dependency. After pacemaker and corosync successfully installed, we need to update corosync configuration on both nodes

$ sudo vi /etc/corosync/corosync.conf
totem {
  version: 2
  cluster_name: mssqlserver
  transport: udpu
  interface {
    ringnumber: 0
    bindnetaddr: server_private_ip_address
    broadcast: yes
    mcastport: 5405
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
}

nodelist {
  node {
    ring0_addr: primary_private_ip_address
    name: lnxsql01
    nodeid: 1
  }
  node {
    ring0_addr: secondary_private_ip_address
    name: lnxsql02
    nodeid: 2
  }
}

logging {
  to_logfile: yes
  logfile: /var/log/corosync/corosync.log
  to_syslog: yes
  timestamp: on
}

Generate corosync authentication key on lnxsql01 and copy this authentication key to lnxsql02 so both nodes will have same authentication key

$ sudo apt-get install haveged
$ sudo corosync-keygen
$ sudo scp /etc/corosync/authkey yohja@lnxsql02:~
$ sudo apt-get remove --purge haveged

On lnxsql02 move the authentication key to corosync configuration directory and assign proper permission

$ sudo mv authkey /etc/corosync/
$ sudo chown root:root /etc/corosync/authkey
$ sudo chmod 0400 /etc/corosync/authkey

Modify /etc/default/corosync file and append the following line to the end of file on both nodes

START=yes

Start pacemaker and corosync on both nodes

$ sudo systemctl start pacemaker
$ sudo systemctl start corosync

Wait for a while and verify that both node has been clustered together by run the following command

sudo corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.0.0.11)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.0.0.12)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 2
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

You will notice private ip address of both nodes listed on the above output. You can also run the following command to check cluster status

$ sudo crm status

 

[Configure Pacemaker to Add Cluster Resource]

Run this following command on either lnxsql01 or lnxsql02

$ sudo crm configure
# property stonith-enabled=false
# property no-quorum-policy=ignore
# property cluster-name=mssqlserver

# primitive res_drbd_mssql ocf:linbit:drbd params drbd_resource=mssql op start interval=0 timeout=240 op stop interval=0 timeout=120
# ms ms_drbd_mssql res_drbd_mssql meta notify=true master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
# primitive res_fs_mssql ocf:heartbeat:Filesystem params device="/dev/drbd0" directory="/var/opt/mssql/data" fstype="ext4" op start interval=0 timeout=60 op stop interval=0 timeout=120

# primitive res_mssqlsrv ocf:mssql:fci op monitor interval=30 timeout=60 op start interval=0 timeout=60 op stop interval=0 timeout=60

# primitive mssql_vip ocf:heartbeat:IPaddr2 params ip=”172.16.10.15” iflabel=”mssqlvip” op monitor interval=5
# group mssqlserver res_fs_mssql res_mssqlsrv mssql_vip 
# colocation col_mssql_drbd inf: mssqlserver ms_drbd_mssql:Master
# order ord_mssql inf: ms_drbd_mssql:promote mssqlserver:start
# commit
# exit

Verify that all resource was added to the cluster

$ sudo crm status

If everything was fine, you will get the following output

You can try to failover all resource from lnxsql01 to lnxsql02 by shutting down lnxsql01 or just bring this node to standby mode by run the following command

$ sudo crm node standby lnxsql01

Verify that everything was successfully failover by run

$ sudo crm status

And you can try to connect to SQL Server from SQL Server Management Studio or from SQLCMD to verify if all configuration work by connection to virtual ip address that we’ve configured before.

Hope this help!!! 🙂

 

Reference:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s