Labs: installation & configuration of GlusterFS as synchronous data storage solution.
By: Pascal Charest, Freesoftware consultant
Date: September, 2008.
Synchronization of files in a cloud environment is a challenge in the path of high-{availability, performance}. From simple load balanced web sites to full-blown applications - some files always need to be in sync. Peoples, for simplicity, rely on asynchronous transfer (ie: rsync ), others deploy bigger solutions (ie: block device replication through DRBD or shared storage through AoE protocol & concurrency management with OCFSv2) or even go for the “lazy” “no-shared-storage” solution through NFS.
To address this problem in the PraizedMedia software stack, I decided to give FUSE based GlusterFS a try. Awesome, really ! The technical knowledge to deploy a basic solution is very very low. The modularity of the program also help to have “something working right now”. This isn’t meant as a direct alternative to DRBD or a good SAN deployment but in my use case, it fit perfectly.
In this lab, I will guide you through the installation of GlusterFS on 2 networked systems. They will be both used as “servers” & “client” for the GlusterFS filesystem. They will be sharing a directory (on both system : /var/production/brick), re-mounted as /var/production/static through GlusterFS. Any write I/O on this directory (of any client server) will be synchronized to the pool. This last feature is called “AFR” (for automatic file replication) and is a module (called a translator) to the GlusterFS file system.
The specificity of my environment is around the file-locking management : I don’t need any. By design, the application will never try to write the same file twice on any of the server.
#Installation of requirement (standard tools)
apt-get install flex bison libfuse-dev linux-headaers-`uname -r` curl
#download of the sources
cd /usr/local/src/
curl -O http://ftp.zresearch.com/pub/gluster/glusterfs/1.3/glusterfs-CURRENT.tar.gz
tar zxf glusterfs-CURRENT.tar.gz
# configure
cd glusterfs-1.3.11
./configure --prefix=/usr/local/glusterfs-1.3.11
make && make install
ln -s /usr/local/glusterfs-1.3.11 /usr/local/glusterfs
So we now have a basic 2 servers GlusterFS systems installed. Lets be honest, that wasn’t really hard! We are still missing configuration files though.
#Editing /usr/local/glusterfs/etc/glusterfs/glusterfs-server.vol
#
# glusterfs-servers definition
# volume definition are on first lvl, other are on second lvl (tabbed)
volume brick
type storage/posix
option directory /mnt/production/brick
end-volume
volume server
type protocol/server
option transport-type tcp/server
option auth.ip.brick.allow *
subvolumes brick
end-volume
#Editing the /usr/local/glusterfs/etc/glusterfs/glusterfs-client.vol
#
# glusterfs-client.vol
# volume definition are on first lvl, other are on second lvl (tabbed)
#
volume remote1
type protocol/client
option transport-type tcp/client
option remote-host 002.praized.com
option remote-subvolume brick
end-volume
volume remote2
type protocol/client
option transport-type tcp/client
option remote-host 001.praized.com
option remote-subvolume brick
end-volume
volume mirror0
type cluster/afr
subvolumes remote1 remote2
end-volume
#Launching services (servers and clients)
mkdir -p /mnt/production/brick
/usr/local/glusterfs-1.3.11/sbin/glusterfsd -f /usr/local/glusterfs-1.3.11/etc/glusterfs/glusterfs-server.vol
mkdir -p /mnt/production/static
/usr/local/glusterfs-1.3.11/sbin/glusterfs -f /usr/local/glusterfs-1.3.11/etc/glusterfs/glusterfs-client.vol /mnt/production/static/
You now possess a synchronized directory between your two systems. Please note that GlusterFS require TCP/6996 port to be open. There is also some improvement that can be done to this setup through adding a locking mechanism & i/o thread - I don’t currently need them, but you might.
Enjoy!
Debugging notes ; after starting the server process you should have a kernel process call glusterfs. All log files are in /usr/local/glusterfs/var/log/glusterfs*. After starting the client, “df -h” should show you your new mount point. Careful with UID/GID (&Permission), there is no such thing as root_squash_fs in GlusterFS yet.
Other notes ; Using Amazon EBS would have been the perfect solution if they did allow multiple servers-volume mount and lets us deal with concurrency / lock problems. But, they don’t.
Oups !
J’ai oublié de mentioner que le Linux Symposium d’Ottawa arrive à grand pas. Je vais y être présent, du 22 au 26 Juillet 2008, pour blogger sur les différents acteurs du milieu de l’OpenSource. Compte tenu mon “background” en storage / virtualisation / cloud computing ; j’ai un intéret particulier pour les discussions suivantes:
“Tux meets Radar O’reilly - Linux in military telecom” : Il est toujours intéressant de voir un déploiement dans une optique militaire. Dans le militaire, tout comme pour le bancaire et medical, l’erreur est beaucoup moins tolérée et peut être source de répercussions incroyables… Pour prévenir les bourdes, les systèmes sont testés très précisement - le commun des mortels a surement beaucoup à apprendre de cet état d’esprit. Investir pour la stabilité - ce n’est pas fou comme idée!
“A Survey of Virtualization Workloads” : Simple, mais si la présentation - et les recherches! - est bien effectuée, il peut y avoir correlation avec des use-case que je rencontre lors de mes consultations. Elle est suivie d’une deuxième présentation qui lui semble quasi identique - worst case : j’irai voir les deux.
Applying Green Computing to clusters and the data center” : Je ne connais que très superficiellement ce domaine, si nous excluons le “tu configures du wake-on-lan associé avec un control de charge”. Étant particulièrement biasé vers la solution “tu ajoutes des systèmes” - autant au niveau création d’actif financier et réduction de coûts - J’imagine que je vais pouvoir briser le voile de mon ignorance et changer ma position.
“SynergyFS: A Stackable File System Creating Synergies Between Heterogeneous Storage Devices” : Les discussions sur le stockage sur des environnements hybrides m’acrochent toujours. Le storage est un problème avec plus de 50 solutions dans le monde GNU/Linux (50+ fs supportés par le noyau) - chacune d’entre elle avec des forces et faiblesses. Voyons voir comment profiter des forces en “patchant les faiblesses avec d’autre système de fichiers”.
“If I turn this knob… what happens?” :
Dernière présentation qui à le mérite de mon intéret - et pas la moindre. Il est question de la prise de métrique (io, scheduler, lock_wait, sys/proc fs,…) et d’agir sur les résultats. En résumé, elle calque exactement ce que je fais en capacity planning pour les clouds/clusters que je déploie. J’aime toujours voir et dialoguer sur les processus utilisés par d’autres consultants.
See you there!
Executive summary : Give me 10k$, a month, 3 poweredge servers, a gigabytes capable switch and I’ll build you a scalable cloud infrastructure ;-).
And, the post:
Last year dominant meme was "Virtualization". Since you can’t have the same focus for two consecutives years (must be a law about that written somewhere), they (for various definition of "they") had to enhance it. Here come "Cloud Computing".
Cloud computing, as defined here, here, here, here, here and… is still in condensation phase. Ideas appear and usability should emerge… soon.
While this is concentrated fun for theorician, I would prefer a more technical discussion. I am aware of Montreal based corporations currently studying Cloud/Grid systems. One of the next big player, in Montreal/North-Eastern USA, might be iWeb Technologies - they already have hardware, a customer base and so much to gain on the scalability aspect of cloud computing. Think about dynamically closing unused shared hosting system and relocating instance in relation of their impact on server resources. A lot of other corporation are also present in the field.
But I don’t have access to the same quantity of hardware as they have, so lets see what is available / can be built in my small lab.
SunGrid Engine, as an online service, no hardware needed, have more of a grid heritage than a cloud computing future. Application are launch, run, and a specific output is gathered and sent. The list of application, while impressive, doesn’t have "Apache" - this is a system meant for raw processing power, not offering services.
IBM’s BlueCloud is still more of a vapor cloud around a press release than anything that has to do with computing. Though, I’m sure it look awesome in their lab. But, again, I’m sure their whole lab look nice.
3TERA’s apps logic does look neat, yet, there is no public price tag. This also look like the kind of system that is built around templates "which should not be modified". I have no idea how the system reliability goes when customization are made. And I won’t know… no price tag is a straight no-go for me. If you are ashamed of your pricing model, there is a problem. If its not the case, there is no reason not to show "figures".
Another online service, Amazon AWS (EC2 & S3), is one of the current market leader. Based on XEN, you can have a remote instance for couples of cents an hour. The main concern with EC2 is the volatile aspect of the storage, which kinda defeat most of services real purpose, dealing with informations.
So ?
While I don’t have much hardware, I still have a labs of 4 dev + 2 prod systems. Lets see what can be done. Lets design a home brewed cloud infrastructure.
Nodes types
ConfigNode :
role : CNode is a standard Debian sys. It is the DHCP + PXE + tftp server. It hold the HardwareNode kernel. All cloud configuration happen on those systems.
min : 1 sys.
normal : 2 sys.. {Primary/Slave}. with software raid + drbd + heartbeat.
Scalable: no use. 2 systems is more than enough, there isn’t really any CPU/Network load.
StorageNode :
role: SNode is a network booted GNU/Linux system. It serve AoE devices on the network. All nodee (except ConfigNode) use SNODE as root filesystem.
min: 1 sys.
prefered: 2 sys, {Primary/Primary} with software raid + drbd. MD-device Multipathing is required from clients to preserve the P/P coherence and reliability to network failure.
Scalable : This is a building block. The limit of SNODE is defined by the network fabric speed.
HardwareNode :
role : HNode is a network booted GNU/Linux/XEN-dom0 system. It use a SNODE array as its root filesystem. This is where INODE will be launched. This node is diskless.
min: 1 sys.
prefered: no limit.
Scalable: This is a building block of the infrastructure. The limit of HNODE is defined by the acceptable speed of the root file system located on a SNODE.
Instance :
role : an Instance is a network booted GNU/Linux/XEN-domU system. In the presence of VT technologies, it can also be an unmodified guest operating system (hear full-fledge GNU/Linux or Microsoft Windows). It is started on a specific HNODE using SNODE resources.
min : 1 sys.
prefered : no limit.
Scalable : Currently limited to the underlying HNODE ressources.
Summary : Using a specific configuration node we start a StorageNode and an hardware node. Then, once the infrastructure is "running", Instances can be dynamically started on HardwareNode.
Since Instances are XEN/domU based, running on shared storage, they can be migrated LIVE without downtime between HardwareNode. A ping to the virtual instance would not fail, even in the middle of the live migration.
Since HardwareNode are network booted, adding new server is as simple as adding the MAC addrs in the dhcp configuration and tagging it as HNODE. As long as system are able to PXE boot, it is really a matter of minutes to add new nodes.
Since HardwareNode are network booted with remote root filesystem, they do not need to have hard drive. This remove one of the main failing pieces of current infrastructure. There isn’t much to fail in a server with only a CPU, memory and network interfaces.
The storage aspect is taken care of Storage node where good raid + redundancy + hard-drive snapshot can be used to control the environment. The only limit on the number of storage node is the network… but then, link aggregation is your friend.
Since multipathing is used, with DRBD and AoE, a storage node can be shutdown without impacting running instances.
The creating of new InstanceNode is easy : either copy an instance or debootstrap a new system. Doing something similar from 3Tera would be fairly easy at this point, creating template and preparing configuration interfaces/scripts.
What now ?
Took me a week-end day. I have a running ConfigNode, StorageNode (using NFS, but AoE /multipathing is next), HardwareNode and an Instance. Much of the time was spent waiting for kernel compilation and deploying distcc on my lan. Had little problems pxe booting a dom0, but found a fix.
I wonder what someone working full time could accomplish in a month…. Someone want to pay me to see ? ;-). Haaa.. and it would cost you (in addition to my salary for a month) a copy of Nicolas Carr’s BigSwitch book (which I haven’t read yet, but plan to, as soon as I can get my hand on a copy). I can even do a little presentation first for some kind of financial retribution (yeah, money drive me ;-)).
Seriously, such setup would be fully scalable and so easy to dynamically configure through scripts/GUI. One of the limiting factor is the CPU/Memory resources limit that instance have since they are linked to a single hardware node but if Xen (as a commercial solution) is able to create a resource pool, I’m sure there is way to go around that limitation.
Jeez, using VT enabled hardware node, you could even start Microsoft Windows instance in your cloud…
Btw, I know that everything i’ve spoke about can be done through VMWare infrastructure with vmotion (and maybe 3Tera’s Apps) but…. then, think about the fact that a 2 CPU licence for VMWARE Infrastructure is a little bits over 6900$USD….
I just don’t understand why there isn’t more cloud out there. This isn’t all that hard to deploy… not even time consuming…
In the last couples of days, I’ve been doing a lot of experimentations on mass-storage systems. I do not want to saturate this blog with high-ends labs when most of my friends and family doesn’t clearly see the difference between a SAN and a NAS. On the other hand, I still want to publish my research process. Research might seem a bit presumptuous in the light of what I’ve published so far, but this is really just a side effect of this dichotomy.
www.mass-storage.org is my answer to this dilemma. As one of my pet project, it is an oasis (ok: small wiki) where I (and any so oriented researcher) can publish informations related to mass-storage. I’ve already published 2 articles about the recent storage labs i’ve concluded (DRBD , OCFSv2, AoE) and more is under way (about labs thatare currently under way [Lustre, AoE, DRBD Optimization])…
I should start posting more insight into my own life here (hey, it was always noted as MY private little place), and move the storage related (and more "permanent") info at m-s.org.
If you have any comments, as always, feel free to post.
Pascal Charest, directly from Camellia Sinensis on an IleSansfil connection.
You may save your extra charges by having the final deals with the cheap web hosting companies. The functionality of dedicated servers is well-liked by all small and large webmasters. The different tactics of pay per click are valuable to boost up the revenue of the internet marketers. There are a lot of the drawbacks of the shared web hosting due to the limited services of hosting providers. The web hosting services of the reliable companies are more acceptable by all clients. The web hosting services of the reputable service provider are featured with all-inclusive hosting packages in the affordable ratings.
NOTE: Now on www.mass-storage.org
I have a running {DRBD 8.2.4 (P/P) + OCFSv2} 2 nodes cluster. More Info here.
Kinda nice for small workload (think load-balanced webservers, fileservers, sql servers (careful, Oracle is OK, mysql need specific configuration for external lock)) but a bit on the limited side as scalability goes.
Removing the storage aspect from applications servers is the way to go. This is what SAN are for. Lets modify my two nodes (ruby and crystal) cluster to allow dynamic growth in term of application and storage nodes.
For this test, i’ll be bringing a third and fourth system : "jade" & "glouton", two debian based fileservers.
The setup will be as follow :
(jade & glouton): SAN target, exporting device through AOE
(ruby & crystal): SAN initiator + application server
Lexical info: an Initiator is a SAN client, whereas Target are servers.
Exporting through AoE
(glouton&jade)# apt-get install aoetools vblade
(glouton)# vblade 0 1 eth0 /dev/sdb1
(jade)# vblade 1 1 eth0 /dev/sdb1
Note 1: My current setup make me use the above configuration. In a true production environment dual NIC would be preferred (using linux bonding module) & the exported device would be a MD array. There is also a lot of fine-tuning that can be done along the way (jumbo frame, multipath algo, scheduling algo, kernel hacking … )
Note 2: I would against going with an integrated list of MAC addrs. in the vblade export command. The option is present, but the list is then static. Using ebtables seem to be a valid alternative since it can be dynamically modified.
Importing through AoE
(ruby&crystal)# apt-get install aoe-tools
(ruby&crystal)# modprobe aoe
If the file systems are already exported (from jade & glouton), they will be automatically available in /dev/etherd, or else, use "aoe-discover".
Creating MD device for redundancy.
(ruby&crystal)# apt-get install mdadm
(ruby)# mdadm –create /dev/md0 -l1 -n2 /dev/etherd/e0.1 /dev/etherd/e1.1
(crystal)# mdadm –assemble /dev/md0 /dev/etherd/e0.1 /dev/etherd/e1.1
So at this point, there is two md raid devices which use the same resources. They aren’t mounted yet. Using OCFSv2 will allow us to control the concurrent access.
Still using the same /etc/ocfs2/cluster.conf file (see previous post), we format the raid device in OCFS2 format (note: I now use label, it simplify the creation process of identical configuration files):
(ruby)# mkfs.ocfs2 -L "san" /dev/md0
(ruby & crystal)# mount -t ocfs2 -L "san" /storage
There we go, once again, a shared storage between ruby & crystal.
Note 01 : This such configuration can easily saturate your network. Do not even try if your max speed is 100Mb/s. This would give awful perfs (trust me!). Go for giga or even infiniband if you can afford it.
Note 02 : There is a lot of alternative options, you might want to check the md module documentation, under multipath. I know I will ;-)
But how exactly is this system scalable ?
Application node : If a system is built with aoetools, md-device support and ocfs2 installed, they can be hot-added to the network. No restart of any running sys. needed. However, It is still a very good idea to modify each cluster.conf file.
Storage node : A system with devices exported through AoE can be hot-added up to a certain point, depending on the underlying raid type (md-device), but I would advice against it. Anyway, you need to take OCFS2 offline to issue a resize command.
Filesystem size : Currently, due to 32 bits adressing, there seem to be a limit @ 16TB for a file system. A good reminder though is that AoE target can export more than one devices….
310-200 would have easier if the professionals would have approved of 650-178 or 70-292 before 70-431. However, one can also go for 70-528 if planning to attempt SY0-101 later.