Spark on MaxCompute supports three running modes: local, cluster, and DataWorks.
The local mode is used to facilitate code debugging for applications. In local mode,
you can use Spark on MaxCompute the same way as native Spark in the community. You
can also use Tunnel to read data from and write data to MaxCompute tables. In this
mode, you can use an integrated development environment (IDE) or command lines to
run Spark on MaxCompute. If you use this mode, you must add the
spark.master=local[N] configuration. N indicates the number of CPUs required for this mode.
# /path/to/MaxCompute-Spark: the path where the compiled application JAR package is saved. cd $SPARK_HOME bin/spark-submit --master local --class com.aliyun.odps.spark.examples.SparkPi \ /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
# /path/to/MaxCompute-Spark: the path where the compiled application JAR package is saved. cd $SPARK_HOME bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \ /path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
DataWorks modeYou can run offline jobs of Spark on MaxCompute in cluster mode in DataWorks to integrate and schedule the other types of nodes.
- Upload the required resource in the DataWorks business flow and click the Submit icon.
The resource that is uploaded appears in the navigation tree.
- In the created business flow, select ODPS Spark from Data Analytics.
- Double-click the ODPS Spark node and configure the parameters for the Spark job.Each
of the Spark Version and Language parameters has two options for the ODPS Spark node. The other parameters that you
need to configure vary based on the Language parameter. You can configure the parameters as prompted. For more information, see
Create an ODPS Spark node.
- Main JAR Resource: the resource file used by the job. You must upload the resource file to DataWorks before you perform this operation.
- Configuration Items: the configuration items for you to submit the job.
You do not need to configure
spark.hadoop.odps.end.point. By default, the values of these configuration items are the same as those of the MaxCompute protect. You can also explicitly specify these configuration items to overwrite their default values.
You must add configurations in the
spark-default.conffile to the configuration items of the ODPS Spark node one by one. The configurations include the number of executors, memory size, and
spark.hadoop.odps.runtime.end.point.The resource file and configuration items of the ODPS Spark node map the parameters and items of the spark-submit command, as described in the following table. You do not need to upload the spark-defaults.conf file. Instead, you must add the configurations in the spark-defaults.conf file to the configuration items of the ODPS Spark node one by one.
ODPS Spark node spark-submit Main JAR Resource and Main Python Resource
app jar or python file
- Manually run the ODPS Spark node to view the operational logs of the job and obtain
the URLs of Logview and Jobview from the logs for further analysis and diagnosis.
After the Spark job is defined, orchestrate and schedule services of different types in the business flow if required.