Running Mapreduce over Openstack Swift

I recently got a chance of experimenting mapreduce over several opensource object storages. Openstack swift is definitely one well known object storage service. However, I found it non-trivial to set it up for evaluation. Therefore, I put some notes below.

This article is only for os=Ubuntu 14.04 and swift=liberty. I noticed that the newer versions of swift is with much better documentation, which is much easier to follow. The appendix contains my early trials of installing swift from source.

Read the followings

Make sure you have some sense for concepts including user, role, tenant, project, endpoint, proxy server, account server, object server, rings, and etc.

To my understanding, they are:

  • user: a real user, or a service
  • role: role of users, corresponding to a set of permissions
  • tenant = project: a set of users
  • endpoint: the entry point urls of openstack services
  • proxy server: handling user requests in swift
  • account server: handling user account in swift, also used as domain or primary namespace
  • object server: handling object storage in swift
  • rings: the consistent hashing algorithm used by swift
  • keystone: the authentication service in openstack. Key concepts on keystone can be found here

Setup keynote and swift

Install dependencies

  • Install MariaDB (or MySQL) and memcache following page

  • Install keystone following page1 and page2. Note that if you want to run mapreduce over swift, you can not use the TempAuth approach. Read this page for more details.

  • Install swift following page1, page2, and page3. You can start the swift service with

1
swift-init all start

Setup Hadoop

  • Setup Hadoop with version >= 2.3 and configure it following page1 and page2.
  • Make sure SwiftNativeFileSystem is in the classpath, read this page for any problem you find.
  • Configure etc/hadoop/core-site.xml,add following contents:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
<configuration>
<property>
<name>fs.swift.impl</name>
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
</property>
<property>
<name>fs.swift.service.SwiftTest.auth.url</name> <---SwiftTest is the name of the service
<value>http://127.0.0.1:5000/v2.0/tokens</value> <---the ip:port should point to keystone. Make sure having "/tokens" there.
</property>
<property>
<name>fs.swift.service.SwiftTest.auth.endpoint.prefix</name>
<value>/AUTH_</value>
</property>
<property>
<name>fs.swift.service.SwiftTest.http.port</name>
<value>8080</value> <--- the same with that in proxy-server.conf
</property>
<property>
<name>fs.swift.service.SwiftTest.region</name>
<value>RegionOne</value>
</property>
<property>
<name>fs.swift.service.SwiftTest.public</name>
<value>true</value>
</property>
<property>
<name>fs.swift.service.SwiftTest.tenant</name>
<value>admin</value> <--- name of the project, not the role!
</property>
<property>
<name>fs.swift.service.SwiftTest.username</name>
<value>admin</value> <--- user name
</property>
<property>
<name>fs.swift.service.SwiftTest.password</name>
<value>adminpassword</value> <--- password
</property>
</configuration>

Verify all setup

  • create a container (swift post nameofcontainer). Important! the name should be only consist of letters. No ‘_’ is allowed.
  • upload a file (swift upload nameofcontainer nameoffile).
  • hdfs dfs -ls swift://nameofcontainer.SwiftTest/. This should show you files you uploaded previously.
  • hadoop distcp nameoffile swift://mycontainer.SwiftTest/nameoffile

Appendix: Install swift from source

Dependencies

From the swift book

1
2
apt-get install git curl gcc memcached rsync sqlite3 xfsprogs \
git-core libffi-dev python-setuptools
1
2
3
4
apt-get install python-coverage python-dev python-nose \
python-simplejson python-xattr python-eventlet \
python-greenlet python-pastedeploy python-netifaces \
python-pip python-dnspython python-mock

Install the latest liberasurecode (as per some post, may not be necessary)

1
2
3
4
5
6
7
8
9
sudo apt-get install build-essential autoconf automake libtool
cd /opt/
git clone https://github.com/openstack/liberasurecode.git
cd liberasurecode
./autogen.sh
./configure
make
make test
sudo make install

Install the Swift CLI

1
2
3
4
cd /opt
git clone https://github.com/openstack/python-swiftclient.git
cd /opt/python-swiftclient; sudo pip install -r requirements.txt;
python setup.py install; cd ..

I run into a problem of invalid format of the requirement.txt. You may need to do some editing in that case.

Install the Swift

1
2
3
cd /opt
git clone https://github.com/openstack/swift.git
cd /opt/swift ; sudo python setup.py install; cd ..

The above is from the swift book. But I realized I still need to run “sudo pip install -r requirements.txt” before the setup*

Another issue I found is that it takes forever to pip install cryptography in the requirements.txt. I searched in Google where some guy said it took 20 min for building. For me, after 20 min, it still hanged there.

I tried to install all dependencies manually, as follows without the cryptography. After that, it was installed successfully:

1
2
3
4
5
6
7
8
9
Requirement already satisfied (use --upgrade to upgrade): cryptography in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied (use --upgrade to upgrade): idna>=2.0 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): pyasn1>=0.1.8 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): six>=1.4.1 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): setuptools>=11.3 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): cffi>=1.4.1 in /usr/local/lib/python2.7/dist-packages (from cryptography)
Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2.7/dist-packages (from cffi>=1.4.1->cryptography)

Copy in Swift configuration files

1
2
3
4
5
6
7
8
9
10
mkdir -p /etc/swift
cd /opt/swift/etc
cp account-server.conf-sample /etc/swift/account-server.conf
cp container-server.conf-sample /etc/swift/container-server.conf
cp object-server.conf-sample /etc/swift/object-server.conf
cp proxy-server.conf-sample /etc/swift/proxy-server.conf
cp drive-audit.conf-sample /etc/swift/drive-audit.conf
cp swift.conf-sample /etc/swift/swift.conf

chown -R swift:swift /etc/swift

In this setup, we rely on tempauth for authentication.

Config the swift.conf

Setting the hash path prefix and suffix

These two random strings are used to determine the hash result of the partitions on the rings. They are supposed to be confidential.

1
2
swift_hash_path_suffix = random_32_length_string1
swift_hash_path_prefix = random_32_length_string2

You can use

1
head -c 32 /dev/random | base64

to get one random 32 length string.

The storage policy

You can custermize the storage policy as per the swift book, but it is optional.

Build the rings

The 3 parameters are part_power, replicas, min_part_hours

  • part_power: the number of partitions in the storage cluster. The typical setting is log_2 ( 100 maximun number of disks ). For this setup, it is log_2 (100 1) which is close to 7.
  • replicas: in this setup, there is only one drive. Therefore, only 1 replica is allowed.
  • min_part_hours: the default is 24 hours. Tune it as per the swift book or the official document.
1
2
3
4
cd /etc/swift
swift-ring-builder account.builder create 7 1 1
swift-ring-builder container.builder create 7 1 1
swift-ring-builder object.builder create 7 1 1

Prepare the drives

You can use a disk or chop off disk space from existing disk to provide storage for swift. Follow the step 2 of this article. Copied here:

Attach a disk which would be used for storage or chop off some disk space from the existing disk.
Using additional disks:
Most likely this is done when there is large amount of data to be stored. XFS is the recommended filesystem and is known to work well with Swift. If the additional disk is attached as /dev/sdb then following will do the trick:

1
2
3
4
5
# fdisk /dev/sdb
# mkfs.xfs /dev/sdb1
# echo "/dev/sdb1 /srv/node/partition1 xfs noatime,nodiratime,nobarrier,logbufs=8 0 0" >> /etc/fstab
# mkdir -p /srv/node/partition1
# mount /srv/node/partition1

Chopping off disk space from the existing disk:
We can chop off disk from existing disks as well. This is usually done for smaller installations or for “proof-of-concept” stage. We can use XFS like before or we can use ext4 as well.

1
2
3
4
5
# truncate --size=2G /tmp/swiftstorage
# DEVICE=$(losetup --show -f /tmp/swiftstorage)
# mkfs.ext4 $DEVICE
# mkdir -p /srv/node/partition1
# mount $DEVICE /srv/node/partition1 -t ext4 -o noatime,nodiratime,nobarrier,user_xattr

In either case, you need to

1
chown -R swift:swift /srv/node/partition1

Also, you need to mount automatically after system reboot. So put the mount command line into a script, e.g., /opt/swift/bin/mount_devices. Then add a file start_swift.conf under /etc/init/ with content

1
2
3
4
description "mount swift drives"
start on runlevel [234]
stop on runlevel [0156]
exec /opt/swift/bin/mount_devices

Make sure to make the script runnable

1
chmod +x /opt/swift/bin/mount_devices

Add the drives to the rings

The single parameter 100 is the weight for load balancing. Note that the partition1 in the command is in the path of /srv/node/partition1. If you change the path, you need to change both places. It is not the name of the device in /dev/sda, for example. Make sure about the IPs and the PORTs. They are the address of the processes (account, container, and object). I was blocked by the port for quite a while, simply because the default ports in the conf files are different from the ones used below (not sure why…).

1
2
3
swift-ring-builder account.builder add r1z1-127.0.0.1:6002/partition1 100
swift-ring-builder container.builder add r1z1-127.0.0.1:6001/partition1 100
swift-ring-builder object.builder add r1z1-127.0.0.1:6000/partition1 100

Rebalance the rings to create them actually.

1
2
3
swift-ring-builder account.builder rebalance
swift-ring-builder container.builder rebalance
swift-ring-builder object.builder rebalance

You will find the *.ring.gz files and a backup folder.

Configure the logs

put a file named 0-swift.conf under /etc/rsyslog.d directory, containing only one line:

1
local0.* /var/log/swift/all.log

And then create the directory and set the right permissions.

1
2
3
4
5
mkdir /var/log/swift
chown -R syslog.adm /var/log/swift
chmod -R g+w /var/log/swift

service rsyslog restart

Start the proxy server

1
swift-init proxy start

If you see warning messages as discussed in this post. You can follow the solutions there, copied here:

1
2
3
4
5
curl -o libisal2_2.15.0-3~bpo14.04+1_amd64.deb http://mitaka-trusty.pkgs.mirantis.com/debian/pool/trusty-mitaka-backports/main/l/libisal/libisal2_2.15.0-3~bpo14.04+1_amd64.deb 
sudo dpkg -i libisal2_2.15.0-3~bpo14.04+1_amd64.deb
git clone https://bitbucket.org/kmgreen2/pyeclib.git
sudo python setup.py install
sudo apt-get uninstall python-pyeclib

Configure tempauth in proxy-server.conf

We rely on tempauth for authentication in this setup. If you are using keystone, follow this post or this one.

First, start the memcache:

1
service memcached start

Add your users to proxy-server.conf under the tempauth block.

1
2
3
4
5
6
7
8
9
10
[filter:tempauth]
use = egg:swift#tempauth
# You can override the default log routing for this filter here:
...
# <account> is from the user_<account>_<user> name.
# Here are example entries, required for running the tests:
user_admin_admin = admin .admin .reseller_admin
user_test_tester = testing .admin
user_test2_tester2 = testing2 .admin
user_test_tester3 = testing3

The syntax is

1
user_$SWIFTACCOUNT_$SWIFTUSER = $KEY [group] [group] [...] [storage_url]

which means you can configure the user for each storage url.

Then, modify these two options:

1
2
allow_account_management = true
account_autocreate = true

Start the servers

Create the cache directory

1
2
mkdir -p /var/swift/recon
chown -R swift:swift /var/swift/recon

Start the services after making sure the conf files are right, especially the ip and port. The ip is typically set to 0.0.0.0 and the port should be the same as adding the drives.

1
2
3
swift-init account start
swift-init container start
swift-init object start

Then, restart the proxy server

1
swift-init proxy restart

Verify the setup

Send in the authentication (note that I’m using the admin user defined in the proxy-server.conf with group admin and password admin):

1
curl -v -H 'X-Auth-User: admin:admin' -H 'X-Auth-Key: admin' http://localhost:8080/auth/v1.0/

The above will get you the X-Auth-Token if everything is right. For instance,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /auth/v1.0/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
> X-Auth-User: admin:admin
> X-Auth-Key: admin
>
< HTTP/1.1 200 OK
< X-Storage-Url: http://localhost:8080/v1/AUTH_admin
< X-Auth-Token-Expires: 18098
< X-Auth-Token: AUTH_tka1a0d192e57746839c1749f238ba5419
< Content-Type: text/html; charset=UTF-8
< X-Storage-Token: AUTH_tka1a0d192e57746839c1749f238ba5419
< Content-Length: 0
< X-Trans-Id: txb9682bc6bab448659fd50-005881a50f
< X-Openstack-Request-Id: txb9682bc6bab448659fd50-005881a50f
< Date: Fri, 20 Jan 2017 05:50:07 GMT
<
* Connection #0 to host localhost left intact

Next, use the token to access the account by

1
curl -v -H 'X-Storage-Token: AUTH_tka1a0d192e57746839c1749f238ba5419' http://127.0.0.1:8080/v1/AUTH_admin/

the admin in the url AUTH_admin should be the same as the username. You should get something like follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /v1/AUTH_admin HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 127.0.0.1:8080
> Accept: */*
> X-Storage-Token: AUTH_tka1a0d192e57746839c1749f238ba5419
>
< HTTP/1.1 200 OK
< Content-Length: 12
< X-Account-Object-Count: 0
< X-Account-Storage-Policy-Policy-0-Bytes-Used: 0
< X-Account-Storage-Policy-Policy-0-Container-Count: 1
< X-Timestamp: 1484824088.30547
< X-Account-Storage-Policy-Policy-0-Object-Count: 0
< X-Account-Bytes-Used: 0
< X-Account-Container-Count: 1
< Content-Type: text/plain; charset=utf-8
< Accept-Ranges: bytes
< X-Trans-Id: txf57c870a9ba146d9947d0-005881a5cd
< X-Openstack-Request-Id: txf57c870a9ba146d9947d0-005881a5cd
< Date: Fri, 20 Jan 2017 05:53:17 GMT
<
mycontainer
* Connection #0 to host 127.0.0.1 left intact

I was blocked here for quite a while simply because the port in the proxy-server.conf is 6202, not 6002. Be careful!!!

If you get something wrong, go to the log file (at /var/log/swift/all.log) and see the error message there.

You can further verify the setup by creating a container and upload/download a file with

1
2
curl -v -H 'X-Storage-Token: AUTH_tka1a0d192e57746839c1749f238ba5419' -X PUT http://127.0.0.1:8080/v1/AUTH_admin/mycontainer
swift -A http://127.0.0.1:8080/auth/v1.0/ -U admin:admin -K admin upload mycontainer some_file

In the last command line, we use swift client instead of curl. There are other subcommand other than upload. There are delete, download, list, post, and stat. To avoid putting the url and user info in every command, one could put the following into .bashrc

1
2
3
export ST_AUTH=http://localhost:8080/auth/v1.0
export ST_USER=admin:admin
export ST_KEY=admin

Start the consistency processes

Swift relies on a few other processes for consistency purposes.

Edit (or create) the /etc/default/rsync file and add the following line:

1
RSYNC_ENABLE=true

Then, edit (or create) the /etc/rsyncd.conf file, adding the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
uid = swift
gid = swift
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
[account]
max connections = 25
path = /srv/node/
read only = false
lock file = /var/lock/account.lock
[container]
max connections = 25
path = /srv/node/
read only = false
lock file = /var/lock/container.lock
[object]
max connections = 25
path = /srv/node/
read only = false
lock file = /var/lock/object.lock

Start the rsync service

1
service rsync start

Now, you can start the processes by

1
2
3
swift-init account-replicator start
swift-init container-replicator start
swift-init object-replicator start

One useful way of controlling these processes is by

1
swift-init all start/stop/restart

Setup on multiple nodes

The plan of deployment depends on the hardware. One need to read the part IV of the swift book before that. I found this article very well writen. One can learn the way of deployment for a small cluster.

There are a few points one should keep in mind.

  • Copy the swift.conf file to all nodes;
  • Add the drives across all nodes to the rings;
  • Copy the rings to all nodes;

References

  1. Install a Stand-alone, Multi-node OpenStack Swift Cluster with VirtualBox or VMware Fusion and Vagrant (https://thornelabs.net/2014/07/14/install-a-stand-alone-multi-node-openstack-swift-cluster-with-virtualbox-or-vmware-fusion-and-vagrant.html)
  2. OpenStack 101: How to Setup OpenStack Swift (http://blog.adityapatawari.com/2014/01/openstack-101-how-to-setup-openstack_12.html)
  3. OpenStack Swift (the swift book)