mass-storage.org

In the last couples of days, I’ve been doing a lot of experimentations on mass-storage systems. I do not want to saturate this blog with high-ends labs when most of my friends and family doesn’t clearly see the difference between a SAN and a NAS. On the other hand, I still want to publish my research process. Research might seem a bit presumptuous in the light of what I’ve published so far, but this is really just a side effect of this dichotomy.

www.mass-storage.org is my answer to this dilemma. As one of my pet project, it is an oasis (ok: small wiki) where I (and any so oriented researcher) can publish informations related to mass-storage. I’ve already published 2 articles about the recent storage labs i’ve concluded (DRBD , OCFSv2, AoE) and more is under way (about labs thatare currently under way [Lustre, AoE, DRBD Optimization])…

I should start posting more insight into my own life here (hey, it was always noted as MY private little place), and move the storage related (and more "permanent") info at m-s.org.

If you have any comments, as always, feel free to post.

Pascal Charest, directly from Camellia Sinensis on an IleSansfil connection.

You may save your extra charges by having the final deals with the cheap web hosting companies. The functionality of dedicated servers is well-liked by all small and large webmasters. The different tactics of pay per click are valuable to boost up the revenue of the internet marketers. There are a lot of the drawbacks of the shared web hosting due to the limited services of hosting providers. The web hosting services of the reliable companies are more acceptable by all clients. The web hosting services of the reputable service provider are featured with all-inclusive hosting packages in the affordable ratings.

AoE + OCFSv2 (storage fun, part 3)

NOTE: Now on www.mass-storage.org

I have a running {DRBD 8.2.4 (P/P) + OCFSv2} 2 nodes cluster. More Info here.

Kinda nice for small workload (think load-balanced webservers, fileservers, sql servers (careful, Oracle is OK, mysql need specific configuration for external lock)) but a bit on the limited side as scalability goes.

Removing the storage aspect from applications servers is the way to go. This is what SAN are for. Lets modify my two nodes (ruby and crystal) cluster to allow dynamic growth in term of application and storage nodes.

For this test, i’ll be bringing a third and fourth system : "jade" & "glouton", two debian based fileservers.

The setup will be as follow :

(jade & glouton): SAN target, exporting device through AOE
(ruby & crystal): SAN initiator + application server

Lexical info: an Initiator is a SAN client, whereas Target are servers.
  
Exporting through AoE

(glouton&jade)# apt-get install aoetools vblade
(glouton)# vblade 0 1 eth0 /dev/sdb1
(jade)# vblade 1 1 eth0 /dev/sdb1

Note 1: My current setup make me use the above configuration. In a true production environment dual NIC would be preferred (using linux bonding module) & the exported device would be a MD array. There is also a lot of fine-tuning that can be done along the way (jumbo frame, multipath algo, scheduling algo, kernel hacking … )

Note 2: I would against going with an integrated list of MAC addrs. in the vblade export command. The option is present, but the list is then static. Using ebtables seem to be a valid alternative since it can be dynamically modified.

Importing through AoE

(ruby&crystal)# apt-get install aoe-tools
(ruby&crystal)# modprobe aoe

If the file systems are already exported (from jade & glouton), they will be automatically available in /dev/etherd, or else, use "aoe-discover".

Creating MD device for redundancy.

(ruby&crystal)# apt-get install mdadm
(ruby)# mdadm –create /dev/md0 -l1 -n2 /dev/etherd/e0.1 /dev/etherd/e1.1
(crystal)# mdadm –assemble /dev/md0 /dev/etherd/e0.1 /dev/etherd/e1.1

So at this point, there is two md raid devices which use the same resources. They aren’t mounted yet. Using OCFSv2 will allow us to control the concurrent access.

Still using the same /etc/ocfs2/cluster.conf file (see previous post), we format the raid device in OCFS2 format (note: I now use label, it simplify the creation process of identical configuration files):

(ruby)# mkfs.ocfs2 -L "san" /dev/md0 
(ruby & crystal)# mount -t ocfs2 -L "san" /storage

There we go, once again, a shared storage between ruby & crystal.

Note 01 : This such configuration can easily saturate your network. Do not even try if your max speed is 100Mb/s. This would give awful perfs (trust me!). Go for giga or even infiniband if you can afford it.

Note 02 : There is a lot of alternative options, you might want to check the md module documentation, under multipath. I know I will ;-)

But how exactly is this system scalable ?

Application node : If a system is built with aoetools, md-device support and ocfs2 installed, they can be hot-added to the network. No restart of any running sys. needed. However, It is still a very good idea to modify each cluster.conf file.

Storage node : A system with devices exported through AoE can be hot-added up to a certain point, depending on the underlying raid type (md-device), but I would advice against it. Anyway, you need to take OCFS2 offline to issue a resize command.

Filesystem size : Currently, due to 32 bits adressing, there seem to be a limit @ 16TB for a file system. A good reminder though is that AoE target can export more than one devices….

310-200 would have easier if the professionals would have approved of 650-178 or 70-292 before 70-431. However, one can also go for 70-528 if planning to attempt SY0-101 later.

DRBD-8.2.5 on Debian/SID

While updating my Gnu/Linux lab, I’ve decided to put the latest version of DRBD (stable: 8.2.4, unstable: 8.2.5) on the testing bench. I wanted to try the "online verification" and "primary/primary" state for cluster filesystem (OCFS2, GFS).

The current version available through Debian repository is out-of-date (v8.0.8) and doesn’t have the online verification option, so I’ve had no other choice than to build my own modules & utils. Another problem was the "out-of-date" status of the ./drbd-8.2/INSTALL file. Especially about Debian systems - in fact, most of the debian related stuff seem to be broken.

So here goes the missing "INSTALL.debian" for DRBD-8.2.x. This is hosted on googledocs and will change as I invest time into it.

The whole "normal procedure" for the unstable version of DRBD over a minimal Debian/SID install would be summarized as :

# apt-get install git-core
# cd /usr/local/src
# git-clone git://git.drbd.org/drbd-8.2.git drbd-8.2
# apt-get install linux-headers-`uname -r` build-essential flex docbook-utils
# cd /usr/local/src/drbd-8.2
# make
# make doc
# make install

This will give you a valid DRBD-8.2.5 installation. You’ll need to modify /etc/drbd.conf to match your setup. One cool new feature is the "online verification":

You add the following line inside your syncer section of /etc/drbd.conf and modprobe the kernel module:

// in /etc/drbd.conf, syncer section: verify-alg crc32c;
# modprobe crc32c

# drbdadm verify store

where store is my ressource name. But…. this isn’t the end of my problems… because the command doesn’t work here. This cause my primary system to lose connection with the secondary node. Humfff… i’ll see what I can do about that tomorrow.

NOTE: finall, the problem is easy enough : the unstable is not a working version of DRBD.

For 640-863 or even 642-642 it is important to have some background knowledge of 70-292 and 70-528. If you already have 70-536 to your credit, you may be exempted from SY0-101 as well.