spark ForeachWriter 源码

  • 2022-10-20
  • 浏览 (292)

spark ForeachWriter 代码

文件路径:/sql/core/src/main/scala/org/apache/spark/sql/ForeachWriter.scala

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql

/**
 * The abstract class for writing custom logic to process data generated by a query.
 * This is often used to write the output of a streaming query to arbitrary storage systems.
 * Any implementation of this base class will be used by Spark in the following way.
 *
 * <ul>
 * <li>A single instance of this class is responsible of all the data generated by a single task
 *     in a query. In other words, one instance is responsible for processing one partition of the
 *     data generated in a distributed manner.
 *
 * <li>Any implementation of this class must be serializable because each task will get a fresh
 *     serialized-deserialized copy of the provided object. Hence, it is strongly recommended that
 *     any initialization for writing data (e.g. opening a connection or starting a transaction)
 *     is done after the `open(...)` method has been called, which signifies that the task is
 *     ready to generate data.
 *
 * <li>The lifecycle of the methods are as follows.
 *
 *   <pre>
 *   For each partition with `partitionId`:
 *       For each batch/epoch of streaming data (if its streaming query) with `epochId`:
 *           Method `open(partitionId, epochId)` is called.
 *           If `open` returns true:
 *                For each row in the partition and batch/epoch, method `process(row)` is called.
 *           Method `close(errorOrNull)` is called with error (if any) seen while processing rows.
 *   </pre>
 *
 * </ul>
 *
 * Important points to note:
 * <ul>
 * <li>Spark doesn't guarantee same output for (partitionId, epochId), so deduplication
 *     cannot be achieved with (partitionId, epochId). e.g. source provides different number of
 *     partitions for some reason, Spark optimization changes number of partitions, etc.
 *     Refer SPARK-28650 for more details. If you need deduplication on output, try out
 *     `foreachBatch` instead.
 *
 * <li>The `close()` method will be called if `open()` method returns successfully (irrespective
 *     of the return value), except if the JVM crashes in the middle.
 * </ul>
 *
 * Scala example:
 * {{{
 *   datasetOfString.writeStream.foreach(new ForeachWriter[String] {
 *
 *     def open(partitionId: Long, version: Long): Boolean = {
 *       // open connection
 *     }
 *
 *     def process(record: String) = {
 *       // write string to connection
 *     }
 *
 *     def close(errorOrNull: Throwable): Unit = {
 *       // close the connection
 *     }
 *   })
 * }}}
 *
 * Java example:
 * {{{
 *  datasetOfString.writeStream().foreach(new ForeachWriter<String>() {
 *
 *    @Override
 *    public boolean open(long partitionId, long version) {
 *      // open connection
 *    }
 *
 *    @Override
 *    public void process(String value) {
 *      // write string to connection
 *    }
 *
 *    @Override
 *    public void close(Throwable errorOrNull) {
 *      // close the connection
 *    }
 *  });
 * }}}
 *
 * @since 2.0.0
 */
abstract class ForeachWriter[T] extends Serializable {

  // TODO: Move this to org.apache.spark.sql.util or consolidate this with batch API.

  /**
   * Called when starting to process one partition of new data in the executor. See the class
   * docs for more information on how to use the `partitionId` and `epochId`.
   *
   * @param partitionId the partition id.
   * @param epochId a unique id for data deduplication.
   * @return `true` if the corresponding partition and version id should be processed. `false`
   *         indicates the partition should be skipped.
   */
  def open(partitionId: Long, epochId: Long): Boolean

  /**
   * Called to process the data in the executor side. This method will be called only if `open`
   * returns `true`.
   */
  def process(value: T): Unit

  /**
   * Called when stopping to process one partition of new data in the executor side. This is
   * guaranteed to be called either `open` returns `true` or `false`. However,
   * `close` won't be called in the following cases:
   *
   * <ul>
   * <li>JVM crashes without throwing a `Throwable`</li>
   * <li>`open` throws a `Throwable`.</li>
   * </ul>
   *
   * @param errorOrNull the error thrown during processing data or null if there was no error.
   */
  def close(errorOrNull: Throwable): Unit
}

相关信息

spark 源码目录

相关文章

spark Column 源码

spark DataFrameNaFunctions 源码

spark DataFrameReader 源码

spark DataFrameStatFunctions 源码

spark DataFrameWriter 源码

spark DataFrameWriterV2 源码

spark Dataset 源码

spark DatasetHolder 源码

spark ExperimentalMethods 源码

spark KeyValueGroupedDataset 源码

0  赞