DataFrame Class

Reference

Definition

Namespace:: Microsoft.Spark.Sql

Assembly:: Microsoft.Spark.dll

Package:: Microsoft.Spark v1.0.0

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

A distributed collection of data organized into named columns.

public sealed class DataFrame

type DataFrame = class

Public NotInheritable Class DataFrame

Inheritance: Object
DataFrame

Properties

Item[String]

Selects column based on the column name.

Methods

Agg(Column, Column[])	Aggregates on the entire `DataFrame` without groups.
Alias(String)	Returns a new `DataFrame` with an alias set. Same as As().
As(String)	Returns a new `DataFrame` with an alias set.
Cache()	Persist this DataFrame with the default storage level MEMORY_AND_DISK.
Checkpoint(Boolean)	Returns a checkpointed version of this `DataFrame`.
Coalesce(Int32)	Returns a new `DataFrame` that has exactly `numPartitions` partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions.
Col(String)	Selects column based on the column name.
Collect()	Returns an array that contains all rows in this `DataFrame`.
ColRegex(String)	Selects column based on the column name specified as a regex.
Columns()	Returns all column names.
Count()	Returns the number of rows in the `DataFrame`.
CreateGlobalTempView(String)	Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
CreateOrReplaceGlobalTempView(String)	Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
CreateOrReplaceTempView(String)	Creates or replaces a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that created this `DataFrame`.
CreateTempView(String)	Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that created this `DataFrame`.
CrossJoin(DataFrame)	Explicit Cartesian join with another `DataFrame`.
Cube(Column[])	Create a multi-dimensional cube for the current `DataFrame` using the specified columns.
Cube(String, String[])	Create a multi-dimensional cube for the current `DataFrame` using the specified columns.
Describe(String[])	Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
Distinct()	Returns a new Dataset that contains only the unique rows from this `DataFrame`. This is an alias for DropDuplicates().
Drop(Column)	Returns a new `DataFrame` with a column dropped. This is a no-op if the `DataFrame` doesn't have a column with an equivalent expression.
Drop(String[])	Returns a new `DataFrame` with columns dropped. This is a no-op if schema doesn't contain column name(s).
DropDuplicates()	Returns a new `DataFrame` that contains only the unique rows from this `DataFrame`. This is an alias for Distinct().
DropDuplicates(String, String[])	Returns a new `DataFrame` with duplicate rows removed, considering only the subset of columns.
DTypes()	Returns all column names and their data types as an IEnumerable of Tuples.
Except(DataFrame)	Returns a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame`.
ExceptAll(DataFrame)	Returns a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame` while preserving the duplicates.
Explain(Boolean)	Prints the plans (logical and physical) to the console for debugging purposes.
Explain(String)	Prints the plans (logical and physical) with a format specified by a given explain mode.
Filter(Column)	Filters rows using the given condition.
Filter(String)	Filters rows using the given SQL expression.
First()	Returns the first row. Alis for Head().
GroupBy(Column[])	Groups the DataFrame using the specified columns, so we can run aggregation on them.
GroupBy(String, String[])	Groups the DataFrame using the specified columns.
Head()	Returns the first row.
Head(Int32)	Returns the first `n` rows.
Hint(String, Object[])	Specifies some hint on the current `DataFrame`.
Intersect(DataFrame)	Returns a new `DataFrame` containing rows only in both this `DataFrame` and another `DataFrame`.
IntersectAll(DataFrame)	Returns a new `DataFrame` containing rows only in both this `DataFrame` and another `DataFrame` while preserving the duplicates.
IsEmpty()	Returns true if this DataFrame is empty.
IsLocal()	Returns true if the Collect() and Take() methods can be run locally without any Spark executors.
IsStreaming()	Returns true if this `DataFrame` contains one or more sources that continuously return data as it arrives.
Join(DataFrame)	Join with another `DataFrame`.
Join(DataFrame, Column, String)	Join with another `DataFrame`, using the given join expression.
Join(DataFrame, IEnumerable<String>, String)	Equi-join with another `DataFrame` using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the `crossJoin` method.
Join(DataFrame, String)	Inner equi-join with another `DataFrame` using the given column.
Limit(Int32)	Returns a new `DataFrame` by taking the first `number` rows.
LocalCheckpoint(Boolean)	Returns a locally checkpointed version of this `DataFrame`.
Na()	Returns a `DataFrameNaFunctions` for working with missing data.
Observe(String, Column, Column[])	Define (named) metrics to observe on the Dataset. This method returns an 'observed' DataFrame that returns the same result as the input, with the following guarantees: It will compute the defined aggregates(metrics) on all the data that is flowing through the Dataset at that point. It will report the value of the defined aggregate columns as soon as we reach a completion point.A completion point is either the end of a query(batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point. Please note that continuous execution is currently not supported.
OrderBy(Column[])	Returns a new Dataset sorted by the given expressions.
OrderBy(String, String[])	Returns a new Dataset sorted by the given expressions.
Persist()	Persist this DataFrame with the default storage level MEMORY_AND_DISK.
Persist(StorageLevel)	Persist this DataFrame with the given storage level.
PrintSchema()	Prints the schema to the console in a nice tree format.
PrintSchema(Int32)	Prints the schema up to the given level to the console in a nice tree format.
RandomSplit(Double[], Nullable<Int64>)	Randomly splits this `DataFrame` with the provided weights.
Repartition(Column[])	Returns a new `DataFrame` partitioned by the given partitioning expressions, using `spark.sql.shuffle.partitions` as number of partitions.
Repartition(Int32)	Returns a new `DataFrame` that has exactly `numPartitions` partitions.
Repartition(Int32, Column[])	Returns a new `DataFrame` partitioned by the given partitioning expressions into `numPartitions`. The resulting `DataFrame` is hash partitioned.
RepartitionByRange(Column[])	Returns a new `DataFrame` partitioned by the given partitioning expressions, using `spark.sql.shuffle.partitions` as number of partitions. The resulting Dataset is range partitioned.
RepartitionByRange(Int32, Column[])	Returns a new `DataFrame` partitioned by the given partitioning expressions into `numPartitions`. The resulting `DataFrame` is range partitioned.
Rollup(Column[])	Create a multi-dimensional rollup for the current `DataFrame` using the specified columns.
Rollup(String, String[])	Create a multi-dimensional rollup for the current `DataFrame` using the specified columns.
Sample(Double, Boolean, Nullable<Int64>)	Returns a new `DataFrame` by sampling a fraction of rows (without replacement), using a user-supplied seed.
Schema()	Returns the schema associated with this `DataFrame`.
Select(Column[])	Selects a set of column based expressions.
Select(String, String[])	Selects a set of columns. This is a variant of Select() that can only select existing columns using column names (i.e. cannot construct expressions).
SelectExpr(String[])	Selects a set of SQL expressions. This is a variant of Select() that accepts SQL expressions.
Show(Int32, Int32, Boolean)	Displays rows of the `DataFrame` in tabular form.
Sort(Column[])	Returns a new `DataFrame` sorted by the given expressions.
Sort(String, String[])	Returns a new `DataFrame` sorted by the specified column, all in ascending order.
SortWithinPartitions(Column[])	Returns a new `DataFrame` with each partition sorted by the given expressions.
SortWithinPartitions(String, String[])	Returns a new `DataFrame` with each partition sorted by the given expressions.
Stat()	Returns a `DataFrameStatFunctions` for working statistic functions support.
StorageLevel()	Get the DataFrame's current StorageLevel().
Summary(String[])	Computes specified statistics for numeric and string columns.
Tail(Int32)	Returns the last `n` rows in the `DataFrame`.
Take(Int32)	Returns the first `n` rows in the `DataFrame`.
ToDF()	Converts this strongly typed collection of data to generic `DataFrame`.
ToDF(String[])	Converts this strongly typed collection of data to generic `DataFrame` with columns renamed.
ToJSON()	Returns the content of the DataFrame as a DataFrame of JSON strings.
ToLocalIterator()	Returns an iterator that contains all of the rows in this `DataFrame`. The iterator will consume as much memory as the largest partition in this `DataFrame`.
ToLocalIterator(Boolean)	Returns an iterator that contains all of the rows in this `DataFrame`. The iterator will consume as much memory as the largest partition in this `DataFrame`. With prefetch it may consume up to the memory of the 2 largest partitions.
Transform(Func<DataFrame,DataFrame>)	Concise syntax for chaining custom transformations.
Union(DataFrame)	Returns a new `DataFrame` containing union of rows in this `DataFrame` and another `DataFrame`.
UnionByName(DataFrame)	Returns a new `DataFrame` containing union of rows in this `DataFrame` and another `DataFrame`, resolving columns by name.
Unpersist(Boolean)	Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Where(Column)	Filters rows using the given condition. This is an alias for Filter().
Where(String)	Filters rows using the given SQL expression. This is an alias for Filter().
WithColumn(String, Column)	Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name.
WithColumnRenamed(String, String)	Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain `existingName`.
WithWatermark(String, String)	Defines an event time watermark for this DataFrame. A watermark tracks a point in time before which we assume no more late data is going to arrive.
Write()	Interface for saving the content of the non-streaming Dataset out into external storage.
WriteStream()	Interface for saving the content of the streaming Dataset out into external storage.
WriteTo(String)	Create a write configuration builder for v2 sources.

Applies to

DataFrame Class

Definition

Properties

Methods

Applies to

Feedback

Additional resources