Setting up the cluster
Contents
- 1 Building a Beowulf Cluster (MPICH2)
- 1.1 Prerequisites
- 1.2 Let's Go
- 1.3 1? Install Ubuntu
- 1.4 1. Installing Ubuntu For Reals
- 1.5 2. Setting up your hosts file
- 1.6 3. Adding the Cluster User
- 1.7 4. Sharing Your Home Directory With NFS
- 1.8 5. Passwordless SSH
- 1.9 6. MPICH
- 1.10 7. Process Manager
- 1.11 Installing mpi4py
- 1.12 Troubleshooting Errors
Building a Beowulf Cluster (MPICH2)
Prerequisites
- Have computers
- Have computers on the same network connected by ethernet. I can't explain that too much because I don't know much about it, but I assume you can just get any sort of server console and hook them up via that.
- Be ok with wiping the computers. I installed clean versions of Ubuntu onto the cluster I made.
- Also, if you want to look at other sources, these are two sources that were very helpful to me
Let's Go
- Get a copy of Ubuntu 14.04 Server. If you really want to, you can install a gui interface onto it later, like unity.
1? Install Ubuntu
This in itself can be a process, and is slow. At least on my computers it was.
- Make the bootable USB.
- At the time of writing this, you can download Ubuntu 14.04 Server here.
- I'm sure that if a newer version comes out, it should work as well, although I tried initially with 15.10 (Desktop) and had some trouble with that.
- Choose a master node
- People recommend that the master node be the most powerful. Of course if you are using identical computers just choose whichever. In my case, I had 1 computer with 4 cores, whereas the rest had 2.
- Plop the usb into the master node and get to the install screen.
- If you're not booting from the usb it may be because it is not the top priority in the boot sequence. To change this you'll need to go to the bios setup screen.
- While the computer is booting up press f2 or the Delete key. (It could be either or, some manufacturers choose different keys, but that's what it usually is)
- Find the settings for something along the lines of "Boot Sequence".
- Identify the usb and move it to the top. Save and Reboot.
Once you got that all working we can start installing Ubuntu!
1. Installing Ubuntu For Reals
The installation is mostly straightforward but just follow along to make sure we're on the same page. There are a few things that need to be done during this process - like not encrypting the home directory.
- Do not detect keyboard layout - Choose EN - US (or whichever you'd prefer) manually
- Why? I'm actually not sure, it was just what I was directed to do. Follow directions and stop asking questions. These were directions for installing 12.04, so maybe there was a bug in the autodetect feature.
- Set host name to ub<x>, where x is the node number.
- For example, the master node should be ub0.
- Set the name of the new user to "new-user".
- Set the account name to "new-user"
- This might not be an option for you. I don't think I was given this option.
- Do not encrypt the home directory.
- If you do, setting up a shared folder with nfs will be rather difficult.
- Partition Method: Guided Use Entire Disk
- Remove existing logical volume data
- Basically just allow it to overwrite whatever it wants.
- Leave HTTP proxy information blank
- No Automatic Updates
- Choose software to install. Just select OpenSSH
- Note that you actually have to press the spacebar to select the option. If you press enter it will just go on without installing. If you do this, it's ok. You can install it later with sudo apt-get later.
- Install GRUB
- Hooray!
- Repeat this process for your other nodes. You can of course do some now and some later (I did), though it may be easier to just do them all at once.
2. Setting up your hosts file
You'll want to do this so that you don't have to type in an entire ip address everytime you want to communicate with another node. Write a list of each ip address and which node it corresponds to.
- Set this up on every node. Go edit the hosts file like so.
allnodes: sudo nano /etc/hosts/
Write to the file so it looks like this
127.0.0.1 localhost 192.168.1.6 ub0 192.168.1.7 ub1 192.168.1.8 ub2 192.168.1.9 ub3
Be sure that each name is only used once, and replace the ip addresses with yours. You can find the ip address of each node by using ifconfig
in the terminal.
3. Adding the Cluster User
Now we'll make a new user that will be our cluster user. This user will have the same name and password on every node. I will call my user "beo", for beowulf. We also need to clarify a user id. Make the id be a number between 900 and 999. That makes it so it is a user that doesn't show up in the usual gui interface.
Take note that I will write which node the command needs to be run on, followed by a colon, before writing the actual command.
allnodes: sudo adduser beo --uid 999
4. Sharing Your Home Directory With NFS
- Now we need to set up nfs on the master node so that you can share a folder for programs and whatnot. This is so you don't have to install a lot of things on every single node, or put a script on every single computer. As you can imagine that would be quite tedious, especially if you have many nodes.
- Install nfs-kernel-server on the master node
masternode: sudo apt-get install nfs-kernel-server
- And on the children nodes install nfs-common
childnode: sudo apt-get install nfs-common
- We will need to indicate which folder we want to share in our exports file. Edit it with nano
masternode: sudo nano /etc/exports
/home/beo *(rw, sync, no_subtree_check)
Add the above line to the bottom of /etc/exports and restart the server
masternode: sudo service nfs-kernel-server restart
- Now here, the york article discusses running a sudo ufw allow from <ipaddress>, but I didn't have to do this. If after you run the next few steps you find that your specified folder is not being shared to your other nodes, you may need to check out the york article I posted near the beginning of this post.
- Now we need to edit our
/etc/fstab
file and install nfs-common on the child nodes. This will tell us where to copy the incoming shared folder from the master node.
childnode: sudo apt-get install nfs-common
childnode: sudo nano /etc/fstab
Add this line to the file.
ub0:/home/beo /home/beo nfs
- Now, when the computer boots it should automatically mount the home directory from the master node, to the child nodes' home directories. Check to see with
childnode: ls /home/beo/
This should mirror what is on the master node
5. Passwordless SSH
To get communication working smoothly between the nodes, we're going to want to set up passwordless ssh.
- Get on master node
- Change into your cluster user (beo)
masternode: su beo
masternode: ssh-keygen
- When asked for a keyphrase, do not enter one. Just leave the field blank, so that it will be "passwordless".
- Once that finishes, run the command
masternode: ssh-copy-id localhost
- Now you should be able to quickly log into the other nodes in your cluster through ssh.
masternode: ssh ub1
If you want to change the default port that is used for ssh, we have to make some changes to config files. Unfortunately we have to do this in all nodes.
allnodes: sudo nano /etc/ssh/ssh_config
Port xxxxx
allnodes: sudo nano /etc/ssh/sshd_config
Port xxxxx PermitRootLogin no
Note that there is already a line that says "PermitRootLogin". Replace that line with the above.
6. MPICH
So there are two main options for setting up a message passing interface on your cluster as far as I can tell. MPICH, and OpenMPI. I am using MPICH. Right before installing mpich2, we might want to install some other software.
beo@ub0:~$ sudo apt-get install build-essentials gfortran gfortran-multilib autoconf
Install MPICH2. Now. There are two ways to do this. The easy way, and the way that worked for me. Here's the easy way.
beo@ub0:~$ sudo apt-get install mpich2
- If this works for you, then great! If not, then you're going to have to do the way that worked for me. I had to install the package manually. See <a href="http://www.mpich.org/static/downloads/3.1.3/mpich-3.1.3-installguide.pdf">this guide</a> for how to install it. Make sure to install it in your shared directory, otherwise you're going to have to install it on every node.
- You will need a link to the mpich download. At this time, the download is available at
- Use the
wget <link>
command to download the file and then follow the instructions on the guide.
7. Process Manager
Once mpich2 is installed we need to set up the machine file so that mpich2 knows how many processes to use in each node. We can make this file somewhere in the shared directory.
beo@ub0:~$ sudo nano /home/beo/machinefile
ub0:4 ub1:2 ub2:2 ub3:4
Where we write the node name and how many processes we want to use separated by a colon.
Great! Now everything is set up and should be working, theoretically. Try to test it with this helloworld program from https://help.ubuntu.com/community/MpichCluster</a>. Place the script in your home directory as "mpi_hello.c"
#include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int myrank, nprocs; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("Hello from processor %d of %d\n", myrank, nprocs); MPI_Finalize(); return 0; }
You need to compile it and then run it with...
beo@ub0:~$ mpicc mpi_hello.c -o mpi_hello
beo@ub0:~$ mpiexec -n 8 -f /home/beo/machinefile /home/beo/mpi_hello
Where -n is the flag for how many processors to use, and -f is the flag for the path to the machinefile.
Installing mpi4py
If you're like me and you really like using python, you might want to install mpi4py onto your cluster. This was easy enough for me. I just used pip to install the module.
- Don't use sudo apt-get to install mpi4py. Apparently that has given people problems when being used with mpich2.
beo@ub0:~$ sudo pip install mpi4py --user
- Notice that I added a --user flag in my pip install. This is because if you do not do this, pip will install the module somewhere in the /usr/ folder, but we want it installed in the shared directory.
Troubleshooting Errors
During this process I had quite a few errors. I'll try to go through the things that happened to me and tell you how I fixed them.
Installing Ubuntu
- Got an error right off the bat. "Error loading cdrom".
- Moved usb to a new port and hit "retry" and it detected everything fine. Strange Error.
Installing mpich
- Like I said before, I had problems with using the apt-get install method to install mpich2.
- Installed manually. Use the guide I posted.
- http://www.mpich.org/static/downloads/3.1.3/mpich-3.1.3-installguide.pdf
- "cannot find hydraproxy file"
- This file is installed when you install mpich2. What I had to do is add its location to my path variable
- When all was said and done, my .bashrc file had these lines added to it
sudo nano ~/.bashrc
-
export PATH=/home/beo/mpich-install/bin:/home/beo/mpich-install/lib:$PATH export LD_LIBRARY_PATH=/usr/lib:/home/beo/mpich-install/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/home/beo/.local/python2.7/site-packages:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH export HYDRA_HOST_FILE=/home/beo/machinefile export PYTHONPATH=/home/beo/.local/python2.7/site-packages:$PYTHONPATH export PYTHONPATH=/home/beo/mpich-install/lib:$PYTHONPATH
- The HYDRA_HOST_FILE variable didn't actually do what I was hoping it would do, so that line may be unecessary.
- Some error about not being able to find libmpich.so files
sudo apt-get install libcr-dev
- I actually ran that on each node
- Problems with getting nodes to communicate
- This was a strange error, but when I was running python programs on the cluster and trying to get the nodes to exchange information, it just refused to work because it couldn't find libmpich.so.10
- For this I actually physically moved the file to somewhere on the path variable. Also, I couldn't find a file called libmpich.so.10, but I had one called libmpich.so.10.4. So here's what I did
- Move libmpich.so.10.4 from
/usr/lib/x86_64-linux-gnu
to/home/beo/mpich-install/lib
then
- Move libmpich.so.10.4 from
- Create a symbolic link while in the mpich-install/lib directory
sudo ln -s libmpich.so.10.0.4 libmpich.so.10
Python Packages "not installed"
- Errors where after installing a python package via
pip install <package> --user
, other nodes could not import the module.- Saw that the nodes did not have read/write permissions in the shared python package folder. Changed permissions with
sudo chown -r 755 /home/beo/.local/
- Saw that the nodes did not have read/write permissions in the shared python package folder. Changed permissions with