Download
The Dedoop prototype can be downloaded as a Web Archive. Note, that Dedoop is built and tested against Hadoop 0.20.2. We recommend using the Firefox Web browser.
Installation instructions:
The Dedoop Web application can be deployed on any servlet container conform to the 2.5 Servlet specification running on any operating system. We limit our explanation to Apache Tomcat 6 and assume a Debian-based OS.
Install and setup servlet container:
- Download the Dedoop.war file
sudo aptitude install tomcat6 tomcat6-admin
- Insert
<role rolename="dedoop"/>
to /etc/tomcat6/tomcat-users.xml - Add
<user username="username" password="pw" roles="manager-gui,dedoop"/>
to /etc/tomcat6/tomcat-users.xml in order grant access to the Tomcat manager and to the Dedoop Web application for the user "username" - Increase values of
<max-file-size>
and<max-request-size>
in /usr/share/tomcat6-admin/manager/WEB-INF/web.xml to allow the upload of the quite large Dedoop Web Archive (~60MB) sudo /etc/init.d/tomcat6 restart
- The Web application can now be deployed with the Tomcat manager ("WAR file to deploy")
- Alternatively, delete the elements
<security-constraint>
,<security-role>
, and<login-config>
from WEB-INF/web.xml and copy the Web archive to /var/lib/tomcat6/webapps - Access Dedoop via http://localhost:8080/Dedoop (Allow cookies)
Configure default Hadoop cluster:
- Set
fs.default.name
andmapred.job.tracker
in Dedoop.war/WEB-INF/classes/de/uni_leipzig/dbs/cloud_matching/server/cluster_connect/hdfs/ClusterConnectServiceImpl.properties - Redeploy the Web application
Configure default S3 credentials:
- Set
aws.default.access.key
andaws.default.secret.key
in WEB-INF/classes/de/uni_leipzig/dbs/cloud_matching/server/cluster_connect/hdfs/ec2/LaunchEC2ClusterServiceImpl.properties - Redeploy the Web application
The following step are only relevant if you want to launch or connect to Hadoop clusters running on Amazon EC2.
Allow SOCKS proxy server management:
A major simplification for users is that Dedoop fully supports Hadoop clusters running on Amazon EC2. Dedoop automatically spawns and terminates SOCKS proxy servers (similar to ssh -D port username@host) on the machine it is hosted on to pass connections to Hadoop nodes on EC2. This is required to invoke HDFS commands and to submit MapReduce jobs from outside EC2, mainly due to EC2’s use of internal IPs.
- Add
permission java.security.AllPermission;
to /etc/tomcat6/policy.d/04webapps.policy/04webapps.policy in order to allow Tomcat to etablish ssh connections sudo /etc/init.d/tomcat6 restart
- Connect to running Hadoop clusters using the public DNS of the Namenode and the Jobtracker (e.g. hdfs://ec2-176-34-79-226.eu-west-1.compute.amazonaws.com:8020 and ec2-176-34-79-226.eu-west-1.compute.amazonaws.com:8021)
- Note, that Dedoop asks you for a private key file. Assume, that the EC2 instances were launched using the AWS EC2 Dashboard with a given SSH public/private key pair. The instances are automatically configured, to allow passwordless ssh access for this private key. Dedoop requires to access this private key in order to etablish a ssh tunnel and to setup a SOCKS proxy server that forwards proxy requests to the Jobtracker node.
- The private key file must be located on the server Dedoop is deployed on and must be readable for the tomcat6 user
Launch and configure Hadoop clusters on EC2 via Dedoop:
Dedoop expedites the recurring and laborious task of launching a set of virtual machines and spawning a new Hadoop cluster on it via the "Launch EC2 cluster" tab. The user can specify a Linux-based Amazon Machine Image (AMI) stored in S3. The AMI must contain a distribution of Hadoop, a Java Virtual Machine, the command line utility xmlstarlet, and an ssh server with a secret private key. The ssh server must be configured to permit ssh access for the contained private key. On the one hand, this allows password-less ssh connections between VMs created from the AMI and on the other hand, enables Dedoop to modify the Hadoop configuration files (e.g., IP addresses, map/reduce task capacity, JVM child args) by submitting xmlstarlet commands via ssh. The following steps lead you through the creation of such an AMI:
- Launch new instance with a base AMI (e.g. Ubuntu 12.04 Server) via AWS’s EC2 Dashboard using the public/private key pair that you want to utilize later to launch Hadoop clusters via Dedoop
- Connect to the instance
ssh -i PRIVATE_KEY_FILE -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@publicdns
sudo aptitude install xmlstarlet ec2-ami-tools
- Increase maximum number of open files for the user "ubuntu"
echo "ubuntu hard nofile 16384" | sudo tee -a /etc/security/limits.conf
echo "ubuntu soft nofile 16384" | sudo tee -a /etc/security/limits.conf
echo "session required pam_limits.so" | sudo tee -a /etc/pam.d/common-session
- Download and install Oracle Java SE6 JDK
- Download and unpack Hadoop 0.20.2
- Set global Hadoop params (e.g. Log dir, Tmp dir, etc.)
- Upload your X.509 Certificate (pk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem and cert-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem) to ~/.ssh
- Upload the private key of the utilized key pair (KEY_PAIR_NAME.pem) to ~/.ssh and rename it to ~/.ssh/id_rsa
- Create a file ~/.ssh/config containing:
host 10.*.*.*
user ubuntu
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
CheckHostIP no
IdentityFile ~/.ssh/id_rsa
TCPKeepAlive yes
ServerAliveInterval 60
cd ~ && sudo ec2-bundle-vol -k .ssh/pk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem -c .ssh/cert-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem -u MY_AWS_ACCOUNT_NUMBER -a -r x86_64 -i .ssh/id_rsa,.ssh/config
ec2-upload-bundle -b MY_BUCKET/AMI_FOLDER -a MY_ACCESS_KEY -s MY_SECRET_KEY -m /tmp/image.manifest.xml
ec2-register MY_BUCKET/AMI_FOLDER/image.manifest.xml -K .ssh/pk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem -C .ssh/cert-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.pem
- The AWS account number
MY_AWS_ACCOUNT_NUMBER
is not the AWS access key. It can be found on the Account Activity area and has the form 9999-9999-9999. Leave out the hyphens! - Ensure that the
MY_BUCKET/AMI_FOLDER
is not publically readable since it contains your private key - The AMI should than be visible in the EC2 Dashboard
Note that you have to specify at least one security group in order to launch EC2 instances via the "Launch EC2 cluster" tab (the default security group is named "default"). Security Groups can be defined via the AWS Management Console. Security groups determine whether a network port is open or blocked on your instances. Ensure that port 22/TCP is open to allow ssh access. Furtheremore, open the TCP ports for the Namenode, the Jobtracker, the Datanode, and the Tasktracker processes/webinterfaces. In general, the port range 50000-50100/TCP should be open (Hadoop default ports) along with the ports from fs.default.name
and mapred.job.tracker
.
Configure (default) EC2 launch options:
- All properties (as well as their their default values) that are shown on the "Launch EC2 cluster" tab can be altered via editing WEB-INF/classes/de/uni_leipzig/dbs/cloud_matching/server/cluster_connect/hdfs/ec2/LaunchEC2ClusterServiceImpl.properties
- Most important properties are:
aws.default.access.key
aws.default.secret.key
ec2.default.ami
ec2.default.key.name
ec2.default.private.key.file
ec2.default.security.groups
ec2.default.namenode.port
ec2.default.dfs.http.address
ec2.default.jobtracker.port
ec2.default.mapred.job.tracker.http.address
- Redeploy the Web application
Supplementary Notes:
- Dedoop exclusively supports well-formed CSV files as MapReduce input and output. We recommend using opencsv.
- Separator char: ,
- Quote char: "
- Escape char: \
Changelog:
- Dedoop 0.2 (15.07.2013)
- Authentication: support of the hadoop.job.ugi property
- Amazon S3 file manager
- HDFS file manager: Compression/Decompression
- Copy data within HDFS (Strg+Drag&Drop) or between S3 and HDFS (using DistCp)
- Reduction of the submission overhead by utilizing a shared HDFS lib dir in conjunction with Hadoop’s Distributed Cache in order to reuse 3rd party libraries required by multiple MapReduce jobs
- Transitive closure computation and iterative MapReduce job execution
- Optimization of the BlockSplit algorithm
- Bugfix R-S Join (Block-Split)
- Dedoop 0.3 (12.09.2013)
- Bugfix: HDFS Connect
Wrong FS: hdfs://namenode/, expected: hdfs://namenode:port
- Bugfix: HDFS Connect
- Dedoop 0.4 (22.04.2014)
- Bugfix: BlockCount (
de.uni_leipzig.dbs.cloud_matching.map_reduce.io.TextIntPair cannot be cast to de.uni_leipzig.dbs.cloud_matching.map_reduce.io.TextPair
- Bugfix: Shared Libs directory (
Target /user/share/libs/ already exists
) - Optimization of the CC-MR algorithm (see CC-MR-MEM extension)
- Added support for “cloning” a configured Entity Matching workflow
- Bugfix: BlockCount (
Contact:
- Prof. Dr. Erhard Rahm