Related
I need to transform an array column in my dataframe, array is called 'cities' and has the type Array(City) and I want put city name in upper case.
Structure:
val cities: StructField = StructField("cities", ArrayType(CityType), nullable = true)
def CityType: StructType =
StructType(
Seq(
StructField(code, StringType, nullable = true),
StructField(name, StringType, nullable = true)
)
)
Code I tried:
.withColumn(
newColumn,
forall(
col(cities),
(col: Column) =>
struct(
Array(
col(code),
upper(col(name))
): _*
)
)
)
Error says
cannot resolve 'forall(...
There is no such thing called forall. You can use transform instead:
// sample data
val df = spark.sql("select array(struct('1' as code, 'abc' as name), struct('2' as code, 'def' as name)) cities")
import org.apache.spark.sql.Column
val df2 = df.withColumn(
"newcol",
transform(
col("cities"),
(c: Column) => struct(c("code"), upper(c("name")))
)
)
df2.show
+--------------------+--------------------+
| cities| newcol|
+--------------------+--------------------+
|[[1, abc], [2, def]]|[[1, ABC], [2, DEF]]|
+--------------------+--------------------+
I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?
Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)
This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()
Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]
for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)
I have a dataframe, I have a list of values (possibly list string) and I want to create a new column in my dataframe and add those list values as column values to this new column. I tried
val x = List("def", "cook", "abc")
val c_df = null
x.foldLeft(c_df)((df, column) => df.withColumn("newcolumnname" , lit(column)))
but it throws StackOverflow exception, I also tried iterating over list of string values and adding to dataframe but result value is a list of dataframe but all i want is a single dataframe.
Please help!
here is the sample input and output dataframe:
You can try below code.
Create First DataFrame with Index.
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.orderBy("Col2")
df = spark.createDataFrame([("a", 10), ("b", 20), ("c", 30)], ["Col1", "Col2"])
df1 = df.withColumn("index", row_number().over(w))
df1.show()
Create Another DataFrame from List of Values.
from pyspark.sql.types import * newdf = spark.createDataFrame(['x','y', 'z'], StringType()) newdf.show()
Add Index column to DF created from List of values in step 2.
w = Window.orderBy("value")
df2 = newdf.withColumn("index", row_number().over(w))
df2.show()
Join the DataFrame df1 and df2 based on index.
df1.join(df2, "index").show()
There is a function array in Spark 1.4 or later that takes an array of Columns and returns a new Column. Function lit takes a Scala value and returns a Column type.
import spark.implicits._
val df = Seq(1, 2, 3).toDF("col1")
df.withColumn("new_col", array(lit("def"), lit("cook"), lit("abc"))).show
+----+----------------+
|col1| new_col|
+----+----------------+
| 1|[def, cook, abc]|
| 2|[def, cook, abc]|
| 3|[def, cook, abc]|
+----+----------------+
In Spark 2.2.0, there is a function typedLit that takes Scala types and returns a Column type. this function can handle parameterized scala types e.g.: List, Seq and Map.
val newDF = df.withColumn("new_col", typedLit(List("def", "cook", "abc")))
newDF.show()
newDF.printSchema()
+----+----------------+
|col1| new_col|
+----+----------------+
| 1|[def, cook, abc]|
| 2|[def, cook, abc]|
| 3|[def, cook, abc]|
+----+----------------+
root
|-- col1: integer (nullable = false)
|-- new_col: array (nullable = false)
| |-- element: string (containsNull = true)
This is what you wanted to do ? You can add when to conditionally add different set of lists to each row.
I have a dataframe in (py)Spark, where 1 of the columns is from the type 'map'. That column I want to flatten or split into multiple columns which should be added to the original dataframe. I'm able to unfold the column with flatMap, however I loose the key to join the new dataframe (from the unfolded column) with the original dataframe.
My schema is like this:
rroot
|-- key: string (nullable = true)
|-- metric: map (nullable = false)
| |-- key: string
| |-- value: float (valueContainsNull = true)
As you can see, the column 'metric' is a map-field. This is the column that I want to flatten. Before flattening it looks like:
+----+---------------------------------------------------+
|key |metric |
+----+---------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6)|
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2)|
|as3d|Map(metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0)|
+----+---------------------------------------------------+
To convert that field to columns I do
df2.select('metric').rdd.flatMap(lambda x: x).toDF().show()
which gives
+------------------+-----------------+-----------------+
| metric1| metric2| metric3|
+------------------+-----------------+-----------------+
|1.2999999523162842|6.300000190734863|7.599999904632568|
| 1.5| 2.0|2.200000047683716|
| 2.200000047683716|4.300000190734863| 9.0|
+------------------+-----------------+-----------------+
However I don't see the key , therefore I don't know how to add this data to the original dataframe.
What I want is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
My question thus is: How can i get df2 back to df (given that i originally don't know df and only have df2)
To make df2:
rdd = sc.parallelize([('123k', 1.3, 6.3, 7.6),
('d23d', 1.5, 2.0, 2.2),
('as3d', 2.2, 4.3, 9.0)
])
schema = StructType([StructField('key', StringType(), True),
StructField('metric1', FloatType(), True),
StructField('metric2', FloatType(), True),
StructField('metric3', FloatType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
from pyspark.sql.functions import lit, col, create_map
from itertools import chain
metric = create_map(list(chain(*(
(lit(name), col(name)) for name in df.columns if "metric" in name
)))).alias("metric")
df2 = df.select("key", metric)
from pyspark.sql.functions import explode
# fetch column names of the original dataframe from keys of MapType 'metric' column
col_names = df2.select(explode("metric")).select("key").distinct().sort("key").rdd.flatMap(lambda x: x).collect()
exprs = [col("key")] + [col("metric").getItem(k).alias(k) for k in col_names]
df2_to_original_df = df2.select(*exprs)
df2_to_original_df.show()
Output is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
I can select a certain key from a maptype by doing:
df.select('maptypecolumn'.'key')
In my example I did it as follows:
columns= df2.select('metric').rdd.flatMap(lambda x: x).toDF().columns
for i in columns:
df2= df2.withColumn(i,lit(df2.metric[i]))
You can access key and value for example like this:
from pyspark.sql.functions import explode
df.select(explode("custom_dimensions")).select("key")
I am using the spark libraries in Scala. I have created a DataFrame using
val searchArr = Array(
StructField("log",IntegerType,true),
StructField("user", StructType(Array(
StructField("date",StringType,true),
StructField("ua",StringType,true),
StructField("ui",LongType,true))),true),
StructField("what",StructType(Array(
StructField("q1",ArrayType(IntegerType, true),true),
StructField("q2",ArrayType(IntegerType, true),true),
StructField("sid",StringType,true),
StructField("url",StringType,true))),true),
StructField("where",StructType(Array(
StructField("o1",IntegerType,true),
StructField("o2",IntegerType,true))),true)
)
val searchSt = new StructType(searchArr)
val searchData = sqlContext.jsonFile(searchPath, searchSt)
I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited:
http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)
So far I tried a few things without much luck
val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())
Any ideas/examples of how to use explode on an array?
Did you try with an UDF on field "what"? Something like that could be useful:
val explode = udf {
(aStr: GenericRowWithSchema) =>
aStr match {
case null => ""
case _ => aStr.getList(0).get(0).toString()
}
}
val newDF = df.withColumn("newColumn", explode(col("what")))
where:
getList(0) returns "q1" field
get(0) returns the first element of "q1"
I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).
I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.
from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType
schema = StructType([
StructField("log", IntegerType()),
StructField("user", StructType([
StructField("date", StringType()),
StructField("ua", StringType()),
StructField("ui", LongType())])),
StructField("what", StructType([
StructField("q1", ArrayType(IntegerType())),
StructField("q2", ArrayType(IntegerType())),
StructField("sid", StringType()),
StructField("url", StringType())])),
StructField("where", StructType([
StructField("o1", IntegerType()),
StructField("o2", IntegerType())]))
])
data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
Output:
+---+-------------------+--------------------------+------+
|log|user |what |where |
+---+-------------------+--------------------------+------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+
With what.q1 exploded:
df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)
Output:
+---+-------------------+--------------------------+------+----------------+
|log|user |what |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3 |
+---+-------------------+--------------------------+------+----------------+