NOTE: Now on www.mass-storage.org
I have a running {DRBD 8.2.4 (P/P) + OCFSv2} 2 nodes cluster. More Info here.
Kinda nice for small workload (think load-balanced webservers, fileservers, sql servers (careful, Oracle is OK, mysql need specific configuration for external lock)) but a bit on the limited side as scalability goes.
Removing the storage aspect from applications servers is the way to go. This is what SAN are for. Lets modify my two nodes (ruby and crystal) cluster to allow dynamic growth in term of application and storage nodes.
For this test, i’ll be bringing a third and fourth system : "jade" & "glouton", two debian based fileservers.
The setup will be as follow :
(jade & glouton): SAN target, exporting device through AOE
(ruby & crystal): SAN initiator + application server
Lexical info: an Initiator is a SAN client, whereas Target are servers.
Exporting through AoE
(glouton&jade)# apt-get install aoetools vblade
(glouton)# vblade 0 1 eth0 /dev/sdb1
(jade)# vblade 1 1 eth0 /dev/sdb1
Note 1: My current setup make me use the above configuration. In a true production environment dual NIC would be preferred (using linux bonding module) & the exported device would be a MD array. There is also a lot of fine-tuning that can be done along the way (jumbo frame, multipath algo, scheduling algo, kernel hacking … )
Note 2: I would against going with an integrated list of MAC addrs. in the vblade export command. The option is present, but the list is then static. Using ebtables seem to be a valid alternative since it can be dynamically modified.
Importing through AoE
(ruby&crystal)# apt-get install aoe-tools
(ruby&crystal)# modprobe aoe
If the file systems are already exported (from jade & glouton), they will be automatically available in /dev/etherd, or else, use "aoe-discover".
Creating MD device for redundancy.
(ruby&crystal)# apt-get install mdadm
(ruby)# mdadm –create /dev/md0 -l1 -n2 /dev/etherd/e0.1 /dev/etherd/e1.1
(crystal)# mdadm –assemble /dev/md0 /dev/etherd/e0.1 /dev/etherd/e1.1
So at this point, there is two md raid devices which use the same resources. They aren’t mounted yet. Using OCFSv2 will allow us to control the concurrent access.
Still using the same /etc/ocfs2/cluster.conf file (see previous post), we format the raid device in OCFS2 format (note: I now use label, it simplify the creation process of identical configuration files):
(ruby)# mkfs.ocfs2 -L "san" /dev/md0
(ruby & crystal)# mount -t ocfs2 -L "san" /storage
There we go, once again, a shared storage between ruby & crystal.
Note 01 : This such configuration can easily saturate your network. Do not even try if your max speed is 100Mb/s. This would give awful perfs (trust me!). Go for giga or even infiniband if you can afford it.
Note 02 : There is a lot of alternative options, you might want to check the md module documentation, under multipath. I know I will ;-)
But how exactly is this system scalable ?
Application node : If a system is built with aoetools, md-device support and ocfs2 installed, they can be hot-added to the network. No restart of any running sys. needed. However, It is still a very good idea to modify each cluster.conf file.
Storage node : A system with devices exported through AoE can be hot-added up to a certain point, depending on the underlying raid type (md-device), but I would advice against it. Anyway, you need to take OCFS2 offline to issue a resize command.
Filesystem size : Currently, due to 32 bits adressing, there seem to be a limit @ 16TB for a file system. A good reminder though is that AoE target can export more than one devices….
310-200 would have easier if the professionals would have approved of 650-178 or 70-292 before 70-431. However, one can also go for 70-528 if planning to attempt SY0-101 later.