spark DataWriterFactory 源码

  • 2022-10-20
  • 浏览 (260)

spark DataWriterFactory 代码

文件路径:/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DataWriterFactory.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.connector.write;

import java.io.Serializable;

import org.apache.spark.TaskContext;
import org.apache.spark.annotation.Evolving;
import org.apache.spark.sql.catalyst.InternalRow;

/**
 * A factory of {@link DataWriter} returned by
 * {@link BatchWrite#createBatchWriterFactory(PhysicalWriteInfo)}, which is responsible for
 * creating and initializing the actual data writer at executor side.
 * <p>
 * Note that, the writer factory will be serialized and sent to executors, then the data writer
 * will be created on executors and do the actual writing. So this interface must be
 * serializable and {@link DataWriter} doesn't need to be.
 *
 * @since 3.0.0
 */
@Evolving
public interface DataWriterFactory extends Serializable {

  /**
   * Returns a data writer to do the actual writing work. Note that, Spark will reuse the same data
   * object instance when sending data to the data writer, for better performance. Data writers
   * are responsible for defensive copies if necessary, e.g. copy the data before buffer it in a
   * list.
   * <p>
   * If this method fails (by throwing an exception), the corresponding Spark write task would fail
   * and get retried until hitting the maximum retry times.
   *
   * @param partitionId A unique id of the RDD partition that the returned writer will process.
   *                    Usually Spark processes many RDD partitions at the same time,
   *                    implementations should use the partition id to distinguish writers for
   *                    different partitions.
   * @param taskId The task id returned by {@link TaskContext#taskAttemptId()}. Spark may run
   *               multiple tasks for the same partition (due to speculation or task failures,
   *               for example).
   */
  DataWriter<InternalRow> createWriter(int partitionId, long taskId);
}

相关信息

spark 源码目录

相关文章

spark BatchWrite 源码

spark DataWriter 源码

spark DeltaBatchWrite 源码

spark DeltaWrite 源码

spark DeltaWriteBuilder 源码

spark DeltaWriter 源码

spark DeltaWriterFactory 源码

spark LogicalWriteInfo 源码

spark PhysicalWriteInfo 源码

spark RequiresDistributionAndOrdering 源码

0  赞