I'm trying to write some data into BigQuery using Spark Scala, My spark df looks like,
root
|-- id: string (nullable = true)
|-- cost: double (nullable = false)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- settled: string (nullable = true)
| | |-- constant: string (nullable = true)
|-- status: string (nullable = true)
I tried to change the struct of the data frame.
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("cost", DoubleType, true),
StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
StructField("status", StringType, true)))
val actualDf = spark.createDataFrame(results, schema)
But it didn't work. When this writes into the BigQuery, Column names look like as follows,
id, cost, nodes.list.element.settled, nodes.list.element.constant, status
Is there a possible way to change these column names as,
id, cost, settled, constant, status
You can explode nodes array to get flatten structure of columns, then write dataframe to bigquery.
Example:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+
Related
I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.
If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)
I wanted to create a predefined Schema in spark/scala so that I can read the json files accordingly.
Structure of the Schema is as below :
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- price: string (nullable = true)
| | |-- qty: long (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
|-- primaryKeys: string (nullable = true)
|-- state: string (nullable = true)
I was able to create the schema but I am stuck at one place where the elements have two sub elements. This is what I have tried
import org.apache.spark.sql.types._
val testSchema = StructType(
List(
StructField("primaryKeys", StringType, true),
StructField("state", IntegerType, true),
StructField("email",ArrayType(StringType,true),true),
StructField("id",StringType,true),
StructField("name",StringType,true),
StructField("id2",StringType,true),
StructField("price",StringType,true),
StructField("qty",StringType,true),
StructField("window",ArrayType(StringType,true),true)
))
I am not able to figure out how start and end can be included inside that window element.
it is a nested struct, so it should be
StructType(StructField("end",StringType,true), StructField("start",StringType,true))
btw, you can get schema from case classes as follow:
import org.apache.spark.sql.catalyst.ScalaReflection
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val schema =ScalaReflection.schemaFor[FullArrayCols].dataType.asInstanceOf[StructType]
The way I figured out to make it work is as under :
val arrayStructureSchema = new StructType()
.add("primaryKeys",StringType,true)
.add("runtime", StringType, true)
.add("email", ArrayType(StringType))
.add("id", StringType)
.add("id1", StringType)
.add("id2", StringType)
.add("otherId", StringType)
.add("qty", StringType)
.add("rqty", StringType)
.add("sqty", StringType)
.add("window", new StructType()
.add("end",StringType)
.add("start",StringType))
Structure of the Schema to be created:
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Code to create schema:
val prodSchema = StructType(Array(StructField("col1", StringType), StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))))
Error:
found : Array[org.apache.spark.sql.types.StructField]
required: org.apache.spark.sql.types.DataType
StructField("col2",ArrayType(Array(StructField("element",StructType(Array(StructField("col2_1",StringType)))))))
Any suggestions on how to correct this schema error.
I think you can write it like this:
val prodSchema =
StructType(
List(
StructField("col1", BooleanType),
StructField("col2", ArrayType(
StructType(
List(
StructField("col2_1", BooleanType),
StructField("col2_2",StringType)
)
)
))
)
)
prodSchema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Try this:
val schema = StructType(Seq(
StructField("col1",BooleanType,false),
StructField("col2",ArrayType(StructType(Seq(
StructField("col2_1",BooleanType,true),
StructField("col2_2",StringType,true)
)))
)))
You could use the Schema DSL to create the schema:
val col2 = new StructType().add($"col2_1".boolean).add($"col2_2".string)
val schema = new StructType()
.add($"col1".boolean)
.add($"col2".array(col2))
schema.printTreeString()
root
|-- col1: boolean (nullable = true)
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col2_1: boolean (nullable = true)
| | |-- col2_2: string (nullable = true)
Hope it helps.
I have the two dataframes created in spark
xml_df:
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
and nodes_df:
root
|-- nodes: struct (nullable = false)
| |-- _id: string (nullable = true)
| |-- _label: string (nullable = true)
the xml_df will have always just one rows, as described as follows:
+----------------+------+
|_defaultedgetype| _mode|
+----------------+------+
| undirected|static|
+----------------+------+
and the nodes_df data:
+-----+
|nodes|
+-----+
|[1,1]|
|[2,2]|
|[3,3]|
|[4,4]|
|[5,5]|
+-----+
I used the struct function in the nodes_df to put _id and _label inside the struct. Based on that, i would like to add a third column in the xml_df dataframe that contains the struct created in the nodes_df dataframe.
I tried to use a join function creating a literal id for each entry in nodes_df, but the column result as null.
Any light please?
I found why my join was not working.
I needed to use aggregation for the nodes column, so then i was able to proper join the dataframes.
i created an id for the xml_df:
StructType(List(StructField("id",IntegerType, true),
StructField("_defaultedgetype",StringType, true),
StructField("_mode",StringType, true)))
and the same for the nodes_df
val nodes_schema = StructType(List(
StructField("id",IntegerType, true),
StructField("_id",StringType, true),
StructField("_label",StringType, true)
))
i used the id 666 for both of then and used aggregation in the nodes_df
nodes_df = nodes_df.groupBy("id").agg(collect_set("nodes").as("node"))
and joined with xml_df:
xml_df = xml_df.join(nodes_df, Seq("id"),"right").drop("id")
the result is:
+----------------+------+--------------------+
|_defaultedgetype| _mode| node|
+----------------+------+--------------------+
| undirected|static|[[2,2], [3,3], [5...|
+----------------+------+--------------------+
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
|-- node: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- _label: string (nullable = true)
I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))