Spark on MaxCompute supports three running modes: local, cluster, and DataWorks.

Local mode

The local mode is used to facilitate code debugging for applications. In local mode, you can use Spark on MaxCompute the same way as native Spark in the community. You can also use Tunnel to read data from and write data to MaxCompute tables. In this mode, you can use an integrated development environment (IDE) or command lines to run Spark on MaxCompute. If you use this mode, you must add the spark.master=local[N] configuration. N indicates the number of CPUs required for this mode.

In local mode, Tunnel is used to read data from and write data to tables. Therefore, you must add the Tunnel configuration item to the spark-defaults.conf file to read and write data in this mode. You must specify the endpoint based on the region where the MaxCompute project is located and network environment. For more information about how to obtain the endpoint, see Configure endpoints. The following code provides an example on how to use command lines to run Spark on MaxCompute in this mode:
# /path/to/MaxCompute-Spark: the path where the compiled application JAR package is saved.
cd $SPARK_HOME
bin/spark-submit --master local[4] --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

Cluster mode

In cluster mode, you must specify the Main method as the entry point of a custom application. A Spark job ends when Main succeeds or fails. This mode is suitable for offline jobs. You can use Spark on MaxCompute in this mode together with DataWorks to schedule jobs. The following code provides an example on how to use command lines to run Spark on MaxCompute in this mode:
# /path/to/MaxCompute-Spark: the path where the compiled application JAR package is saved.
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

DataWorks mode

You can run offline jobs of Spark on MaxCompute in cluster mode in DataWorks to integrate and schedule the other types of nodes.
Note DataWorks supports the Spark node in the following regions: China (Hangzhou), China (Beijing), China (Shanghai), China (Shenzhen), China (Hong Kong), US (Silicon Valley), Germany (Frankfurt), India (Mumbai), and Singapore (Singapore).
Procedure:
  1. Upload the required resource in the DataWorks business flow and click the Submit icon.

    The resource that is uploaded appears in the navigation tree.

  2. In the created business flow, select ODPS Spark from Data Analytics.
  3. Double-click the ODPS Spark node and configure the parameters for the Spark job.Each of the Spark Version and Language parameters has two options for the ODPS Spark node. The other parameters that you need to configure vary based on the Language parameter. You can configure the parameters as prompted. For more information, see Create an ODPS Spark node.
    • Main JAR Resource: the resource file used by the job. You must upload the resource file to DataWorks before you perform this operation.
    • Configuration Items: the configuration items for you to submit the job.

      You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, and spark.hadoop.odps.end.point. By default, the values of these configuration items are the same as those of the MaxCompute protect. You can also explicitly specify these configuration items to overwrite their default values.

      You must add configurations in the spark-default.conf file to the configuration items of the ODPS Spark node one by one. The configurations include the number of executors, memory size, and spark.hadoop.odps.runtime.end.point.

      The resource file and configuration items of the ODPS Spark node map the parameters and items of the spark-submit command, as described in the following table. You do not need to upload the spark-defaults.conf file. Instead, you must add the configurations in the spark-defaults.conf file to the configuration items of the ODPS Spark node one by one.
      ODPS Spark node spark-submit
      Main JAR Resource and Main Python Resource app jar or python file
      Configuration Items --conf PROP=VALUE
      Main Class --class CLASS_NAME
      Arguments [app arguments]
      JAR Resources --jars JARS
      Python Resources --py-files PY_FILES
      File Resources --files FILES
      Archive Resources --archives ARCHIVES
  4. Manually run the ODPS Spark node to view the operational logs of the job and obtain the URLs of Logview and Jobview from the logs for further analysis and diagnosis.

    After the Spark job is defined, orchestrate and schedule services of different types in the business flow if required.