I need some help to access names within columns. I have for example the following Schema:
root
|-- id_1: string (nullable = true)
|-- array_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id_2: string (nullable = true)
| | |-- post: struct (nullable = true)
| | | |-- value: double (nullable = true)
By using
cols = df.columns
I will get a list of all names at root level,
cols = [id_1, array_1,...]
However, I would like to access the names within e.g. 'array_1'. Using
df.id_1.columns
simply returns
Column<b'array_1[columns]'>
and no names. Any way to access names within arrays? Same issue arise with structs. This would help me loop/make functions easier. If it is possible to avoid various modules it would be beneficial.
Thanks
You can use schema of dataframe to look column names. Use StructType and StructField apis. In example scala-spark code(optimize this code for your needs):
import org.apache.spark.sql.types._
case class A(a: Int, b: String)
val df = Seq(("a", Array(A(1, "asd"))), ("b", Array(A(2, "dsa")))).toDF("str_col", "arr_col")
println(df.schema)
> res19: org.apache.spark.sql.types.StructType = StructType(StructField(str_col,StringType,true), StructField(arr_col,ArrayType(StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)),true),true))
val fields = df.schema.fields
println(fields(0).name)
> res22: String = str_col
println(fields(1).dataType.asInstanceOf[ArrayType].elementType)
> res23: org.apache.spark.sql.types.DataType = StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
.....
Related
I have a dataset with following schema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- subEntities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- status: string (nullable = true)
| | |-- subEntityId: long (nullable = true)
| | |-- subEntityName: string (nullable = true)
dataset.select($"id", $"name", $"subEntities.subEntityId", $"subEntities.subEntityName") put subEntityId and subEntityName into separate arrays. How to select multiple columns and put them into single array?
If working on Spark >= 2.4 you can use the transform function to generate an array which contains a subset of the original array's fields:
import org.apache.spark.sql.functions.expr
dataset.withColumn("newArray", expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))"))
// or with select
dataset.select(
$"id",
$"name",
expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))").as("newArray")
)
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
To extract value out of your array
Below is working example
import org.apache.spark.sql.functions._
object ExplodeArrauy {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(bean57("1",Array(bean55("aaa",2),bean55("aaa1",21))),
bean57("2",Array(bean55("bbb",3),bean55("bbb3",31)))).toDF
df
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
.show()
}
}
case class bean57(id:String,subEntities:Array[bean55])
case class bean55(status: String,subEntityId:Long)
I am currently having some problems with creating a Spark Row object and converting it to a spark dataframe. What i am trying to achieve is,
I have a two lists of custom types that look more or less like the classes below,
case class MyObject(name:String,age:Int)
case class MyObject2(xyz:String,abc:Double)
val listOne = List(MyObject("aaa",22),MyObject("sss",223)),
val listTwo = List(MyObject2("bbb",23),MyObject2("abc",2332))
Using these two lists I want to create a Dataframe which has one row and two fields (fieldOne and fieldTwo),
fieldOne --> is a List of StructType (similar to MyObject)
fieldTwo --> is a list of StructType (similar to MyObject2)
In order to achieve this i created my custom structtypes for MyObject, MyObject2 and my ResultType.
val myObjSchema = StructType(List(
StructField("name",StringType),
StructField("age",IntegerType)
))
val myObjSchema2 = StructType(List(
StructField("xyz",StringType),
StructField("abc",DoubleType)
))
val myRecType = StructType(
List(
StructField("myField",ArrayType(myObjSchema)),
StructField("myField2",ArrayType(myObjSchema2))
)
)
I populated my data within the spark Row object and created a dataframe
val data = Row(
List(MyObject("aaa",22),MyObject("sss",223)),
List(MyObject2("bbb",23),MyObject2("abc",2332))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
when i call printSchema on the dataframe, the output is exactly what i would expect
root
|-- myField: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
|-- myField2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- xyz: string (nullable = true)
| | |-- abc: double (nullable = true)
However when i do a show, i get a runtime exception
Caused by: java.lang.RuntimeException: spark_utilities.example.MyObject is not a valid external type for schema of struct<name:string,age:int>
It looks like something is wrong with the Row object, can you please explain what is going wrong here?
Thanks a lot for your help!
ps: I know i can create a custom case class like case class PH(ls:List[MyObject],ls2:List[MyObject2]) populate it and convert it to a dataset. But due to some limitations i cannot use this approach and would like to solve it in the way mentioned above.
You can not simply insert your case class objects inside a row, you need to convert those objects to rows
val data = Row(
List(Row("aaa",22.toInt),Row("sss",223.toInt)),
List(Row("bbb",23d),Row("abc",2332d))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(data)),myRecType
)
I grouped by few columns and am getting WrappedArray out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
Getting a dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
then printing the schema gives us: final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
Sample current output:
I am trying to convert to this:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
Desired example output (from the picture above):
-1.0,0.0
-1.0,0.0
In the case where collect_list will always only return one value, use first instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy step.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
Try col(x).getItem:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
Try split
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))
I have an UDF which returns an array of tuples:
val df = spark.range(1).toDF("i")
val myUDF = udf((l:Long) => {
Seq((1,2))
})
df.withColumn("udf_result",myUDF($"i"))
.printSchema
gives
root
|-- i: long (nullable = false)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
I want to rename the elements of the struct to something meaningful instead of _1 and _2, how can this be achieved? Note that I'm aware that returning a Seq of case-classes would let me allow to give proper field names, but using Spark-Notebook (REPL) with Yarn we have many issues using case classes, so I'm looking for a solution without case-classes.
I'm using Spark 2 but with untyped DataFrames, the solution should also be applicable for Spark 1.6
It is possible to cast the output of the udf. E.g. to rename the structfields to x and y, you can do:
type-safe:
val schema = ArrayType(
StructType(
Array(
StructField("x",IntegerType),
StructField("y",IntegerType)
)
)
)
df.withColumn("udf_result",myUDF($"i").cast(schema))
or unsafe, but shorter using string-argument to cast
df.withColumn("udf_result",myUDF($"i").cast("array<struct<x:int,y:int>>"))
both will give the schema
root
|-- i: long (nullable = false)
|-- udf_result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: integer (nullable = true)
| | |-- y: integer (nullable = true)
I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated
First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}
Next, to parse the JSON correctly from the above JSON, you have to use following code:
val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd
val df = spark.read.json(rdd)
Above code will give you following output:
df.show(false)
+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload |
+----------------+-------------------------------------+
|[null,abc] |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+
With following schema:
df.printSchema
root
|-- clientAttributes: struct (nullable = true)
| |-- backfillId: string (nullable = true)
| |-- clientPrimaryKey: string (nullable = true)
|-- escapedJsonPayload: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemId: string (nullable = true)
| | | |-- itemName: string (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
I hope this helps !