Production grade pyspark jobs
When working with a pyspark job it can be really useful to include additional python packages. To manage python packages, conda usually does a great job. When distributing these to a spark cluster conda-pack can zip up the environment.
To automate this, consider a Makefile
with the following contents:
# Note that the extra activate is needed to ensure that the activate floats env to the front of PATH
CONDA_ACTIVATE=source $$(conda info --base)/etc/profile.d/conda.sh ; conda activate ; conda activate
# https://stackoverflow.com/questions/53382383/makefile-cant-use-conda-activate
setup:
# initial setup. Needs to be re-executed every time a dependency is changed (transitive dependency)
# typically only once when the workflow is deployed
conda env create -f environment.yml && \
rm -rf my-custom-foo-env.tar.gz && \
($(CONDA_ACTIVATE) my-custom-foo-env ; conda pack -n my-custom-foo-env )
prepare:
# needs to re-executed any time contents of the lib module are changed (basically every time)
python setup.py bdist_egg
training: prepare
spark-submit --verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python \
--archives my-custom-foo-env.tar.gz#my-custom-foo-env \
--py-files dist/my_library-0.0.1-py2.7.egg \
main.py
Firstly conda pack -n my-custom-foo-env
the desired conda environment.yml
file is taken to generate a ZIP file using conda-pack.
Secondly, the job must be defined as a python-library itself to nicely include any submodules. For this python setup.py bdist_egg
is used to generate an Egg file (mostly just another zip file.)
Finally, we need to tell pyspark to a) use the dependencies and modules and b) use our own specific version of python distributed in the conda-pack ZIP file. For a:
--archives my-custom-foo-env.tar.gz#my-custom-foo-env \
--py-files dist/PNDA_lib-0.0.1-py2.7.egg \
are required.
The configuration property: --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python
ensures b.
edit
Please keep in mind that the example above supposes to submit to spark using yarn cluster mode! spark.yarn.executorEnv.PYSPARK_PYTHON
and spark.yarn.appMasterEnv.PYSPARK_PYTHON
would be set to my-custom-foo-env/bin/python
(though the executor one seems to be optional). When you instead want to execute this in yarn client mode (like for example in a jupyter notebook started from an edge node of the cluster) you must set:
import os
os.environ['PYSPARK_PYTHON'] = 'my-custom-foo-env/bin/python'
as spark otherwise will not find the right version of python and the right environment