Setup additional dependencies which are needed for Spark.
In this step, we will be showing you the additional steps needed to create a successful Spark image.
We have already referenced the files that are needed, all you would need to do is copy them into the appropriate location.
The dependencies that need to be configured is Jupyter, which we will provide a shell script , and configuring the correct version of Java for Spark.
It is always good to check what version of Spark you are using to understand the dependencies you may need setup in your base image.
Task1:
Copy the pre-made jupyter script to the centos folder:
cp ~/test/configure_jupyter.sh ~/Spark/image/centos
Feel free to run the following command to see what is being executed in the shell script:
cat ~/Spark/image/centos/configure_jupyter.sh
You will see the steps required to setup jupyter listed.
Next, copy over the java script:.
cp ~/test/configure_java8.sh ~/Spark/image/centos
Task2:
We need to add additional configuration files under the appconfig directory. We have already made these files for you, to add them in, please execute the following commands:
Remove the appconfig folder from the Spark folder
rm -rf appconfig
yum install wget -y
Add the appconfig reference files using the below command(Due to space constraint in Katacoda we have uploaded the required appconfig files into dropbox)
wget https://www.dropbox.com/s/wbnr83q26przbs6/appconfig.zip
yum install unzip -y
Unzip the file
unzip appconfig.zip
Check the files under appconfig directory
ls appconfig
Make sure you see following files unders appconfig folder:
Startscript: It is a script file which contains code to start all service(s).
spark-slave: It is a service script to start/stop/get-status of the Spark-slave service.
spark-master: It is a service script to start/stop/get-status of the Spark-slave service.
spark-defaults.conf: It is a default system properties included when running spark-submit.
spark-env.sh: It is a file to setup the spark environment.
start_jupyterhub.sh: It is used for bringing up and down the jupyterhub service.
start_jupyter.sh: It is used for bringing up and down the jupyter service.
total_vcores.sh: It is used to obtain the total number of virtual CPU cores assigned to the nodes.
Macros.sh: Contains all the built in macros of BlueData that would be executed during image creation.
Logging.sh: Provides the logging facilities for a catalog configuration bundle.
Utils.sh: Contains utility functions defined which provides information on docker id, cpu share, memory status and fqdn of the current container.
p_kernel.json: Provides interactive python development for Jupyter.
sq_kernel.json: Provides interactive SQL interpreter for Jupyter.
core-site.xml,hadoop: These files are used to setup Hadoop related configurations
Appjob: Provides the information on the type of job to be launched and we can also add application specific jobs.
Systemd.service: It is used to bootstrap the user space and to manage system processes after booting.
jupyter and jupyterhub: Contains all the required configuration to run jupyter and jupyterhub.
p_kernel.json: Provides interactive python development for Jupyter.
Remove the zip file from the folder
rm -rf appconfig.zip
Task3:
When our image is ready to deploy in the EPIC Application Catalog, we need to include a picture that represents the image. For your reference, we have already created a .png file for your use.
cp ~/test/Logo_Spark.png ~/Spark
Logo.png file includes a logo file (400px x 200px .png) to visually identify each application in the App Store