This blog serves as a step-by-step guidance about how to build and run HiBench 7.0 (https://github.com/intel-hadoop/HiBench) on a Big Data cluster.

Big Data Cluster used here is deployed through Bigtop 1.3.0. Components I installed are:

Hadoop: 2.8.4
Spark: 2.2.1

Regarding how I deploy Bigtop 1.3.0 on multiple physical nodes, I have another blog, which is posted on https://collaborate.linaro.org/pages/viewpage.action?pageId=115311164

1. HiBench Build

This is to install HiBench 7.0 on Bigtop master node.

1.1. Install Maven

Refer to: https://github.com/apache/bigtop/blob/branch-1.3/bigtop_toolchain/manifests/maven.pp

, which indicates 3.5.x version is required for Bigtop.

Refer to:

https://maven.apache.org/download.cgi

, for how to download and install.

# cd /usr/local/src

# wget http://www-us.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz

# tar -xf apache-maven-3.5.4-bin.tar.gz

# mv apache-maven-3.5.4/ apache-maven/

# cd /etc/profile.d/

# vim maven.sh

# Apache Maven Environment Variables

# MAVEN_HOME for Maven 1 - M2_HOME for Maven 2

export M2_HOME=/usr/local/src/apache-maven

export PATH=${M2_HOME}/bin:${PATH}

# chmod +x maven.sh

# source /etc/profile.d/maven.sh

# mvn --version

Apache Maven 3.5.4 ...

1.2. Build HiBench-7.0:

$ git clone https://github.com/intel-hadoop/HiBench

$ cd HiBench

$ git checkout -b working-hibench-7.0 HiBench-7.0

$ sudo yum -y install bc vim

$ mvn -Dspark=2.2 -Dscala=2.11 clean package (Ref.)

[INFO] BUILD SUCCESS

Note: spark version 2.2 comes from bigtop.bom.

2. HiBench Benchmarking

2.1. Config Hadoop.conf and Spark.conf

$ cd HiBench

2.1.1. Hadoop.conf

$ cp conf/hadoop.conf.template conf/hadoop.conf

$ vi conf/hadoop.conf

key	description	value
hibench.hadoop.home	The Hadoop installation location	/usr/lib/hadoop
hibench.hadoop.executable	The path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop	{hibench.hadoop.home}/bin/hadoop
hibench.hadoop.configure.dir	Hadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop	{hibench.hadoop.home}etc/hadoop
hibench.hdfs.master	The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username	hdfs://d05-001.bigtop.deploy:8020
hibench.hadoop.release	Hadoop release provider. Supported value: apache, cdh5, hdp	apache

2.1.2. Spark.conf

$ cp conf/spark.conf.template conf/spark.conf

$ vi conf/spark.conf

hibench.spark.home /usr/lib/spark

2.2. Additional Steps to Fix Known Issues

Please go through each of these subsections and do what they requested. Otherwise, problems may show up when running the benchmarking.

2.2.1. Set `hibench.hadoop.examples.test.jar`

Bigtop deployed hadoop is located in /usr/lib/hadoop, which has a different location for `hibench.hadoop.examples.test.jar`. Need to modify it.

Solution 1:

Add its setting into

$ vi conf/hibench.conf

hibench.hadoop.examples.test.jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.8.4-tests.jar

Solution 2:

Modify

$ vi bin/functions/load_config.py

diff --git a/bin/functions/load_config.py b/bin/functions/load_config.py

index 61101dc..041e8e6 100755

--- a/bin/functions/load_config.py

+++ b/bin/functions/load_config.py

@@ -423,7 +423,7 @@ def probe_hadoop_examples_test_jars():

examples_test_jars_candidate_hdp0 = HibenchConf[

'hibench.hadoop.home'] + "/../hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar"

examples_test_jars_candidate_hdp1 = HibenchConf[

- 'hibench.hadoop.home'] + "/../hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar"

+ 'hibench.hadoop.home'] + "/../hadoop-mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar"

examples_test_jars_candidate_list = [

examples_test_jars_candidate_apache0,

2.2.2. JAVA_HOME not set

Need to set JAVA_HOME.

$ sudo vi /etc/profile.d/javahome.sh

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

$ sudo chmod a+x /etc/profile.d/javahome.sh

$ . /etc/profile.d/javahome.sh

2.2.3. HDFS Permission Denied

Message like this happens when run wordcount prepare.sh as 'guodong':

org.apache.hadoop.security.AccessControlException: Permission denied: user=guodong, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x

It means: user 'guodong' want to WRITE into this hdfs folder '/user'. However, folder '/user' is owned by user 'hdfs' and by group 'hadoop', with permission drwxr-xr-x.

‘guodong' doesn't belong to group 'hadoop'. So to fix the issue, Need to change folder '/user' permission to be WRITE-able by group and by any, something similar to 'chmod'. As:

$ hadoop fs -chmod 755 hdfs://d05-001.bigtop.deploy:8020/user

2.2.4. Spark ClassNotFoundException

When run ./bin/workload/micro/wordcount/spark/run.sh, Error message goes at 'spark-submit':

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream

Reason analysis can be found in: https://spark.apache.org/docs/latest/hadoop-provided.html

To fix, need to modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars. In HiBench/Bigtop env., run this:

$ export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Also, make it automatic run on each reboot:

$ sudo vi /etc/profile.d/sparkdistclasspath.sh

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

$ sudo chmod a+x /etc/profile.d/sparkdistclasspath.sh

$ . /etc/profile.d/sparkdistclasspath.sh

2.3. Benchmark Running

Note: please ensure firewalld is disabled on all machines. Reboot can make firewalld up again.

2.3.1. Hadoop: Micro/wordcount

$ ./bin/workloads/micro/wordcount/prepare/prepare.sh

$ ./bin/workloads/micro/wordcount/hadoop/run.sh

2.3.2. Spark: Micro/wordcount

$ ./bin/workloads/micro/wordcount/prepare/prepare.sh

$ ./bin/workloads/micro/wordcount/spark/run.sh

2.4. Run_all.sh

Modify conf/benchmarks.lst to include only benchmarkings you need.

Modify conf/frameworks.lst to contain only frameworks you need.

Then, run:

$ ./bin/run_all.sh

2.4.1. Physical memory for each container

Default physical memory for each container is set to 1G bytes. Although it is ok for most test tasks, it doesn't work for nutchindexing. Please refer to later section: "Nutchindexing: Beyond Physical Memory Limits" for error messages that pop up.

To fix that, need to update

In file: /usr/lib/hadoop/etc/hadoop/mapred-site.xml, I found these two parameters:

mapreduce.map.java.opts

-Xmx1024m

mapreduce.reduce.java.opts

-Xmx1024m

To update, modity the above to 4096m (4GBytes).

$ sudo vi /usr/lib/hadoop/etc/hadoop/mapred-site.xml

=========

~Finished~