I grouped by few columns and am getting WrappedArray out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
Getting a dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
then printing the schema gives us: final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
Sample current output:
I am trying to convert to this:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
Desired example output (from the picture above):
-1.0,0.0
-1.0,0.0
In the case where collect_list will always only return one value, use first instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy step.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
Try col(x).getItem:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
Try split
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))
Related
I need some help to access names within columns. I have for example the following Schema:
root
|-- id_1: string (nullable = true)
|-- array_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id_2: string (nullable = true)
| | |-- post: struct (nullable = true)
| | | |-- value: double (nullable = true)
By using
cols = df.columns
I will get a list of all names at root level,
cols = [id_1, array_1,...]
However, I would like to access the names within e.g. 'array_1'. Using
df.id_1.columns
simply returns
Column<b'array_1[columns]'>
and no names. Any way to access names within arrays? Same issue arise with structs. This would help me loop/make functions easier. If it is possible to avoid various modules it would be beneficial.
Thanks
You can use schema of dataframe to look column names. Use StructType and StructField apis. In example scala-spark code(optimize this code for your needs):
import org.apache.spark.sql.types._
case class A(a: Int, b: String)
val df = Seq(("a", Array(A(1, "asd"))), ("b", Array(A(2, "dsa")))).toDF("str_col", "arr_col")
println(df.schema)
> res19: org.apache.spark.sql.types.StructType = StructType(StructField(str_col,StringType,true), StructField(arr_col,ArrayType(StructType(StructField(a,IntegerType,false), StructField(b,StringType,true)),true),true))
val fields = df.schema.fields
println(fields(0).name)
> res22: String = str_col
println(fields(1).dataType.asInstanceOf[ArrayType].elementType)
> res23: org.apache.spark.sql.types.DataType = StructType(StructField(a,IntegerType,false), StructField(b,StringType,true))
.....
I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))
I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!
You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))
I have a dataset with following schema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- subEntities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- status: string (nullable = true)
| | |-- subEntityId: long (nullable = true)
| | |-- subEntityName: string (nullable = true)
dataset.select($"id", $"name", $"subEntities.subEntityId", $"subEntities.subEntityName") put subEntityId and subEntityName into separate arrays. How to select multiple columns and put them into single array?
If working on Spark >= 2.4 you can use the transform function to generate an array which contains a subset of the original array's fields:
import org.apache.spark.sql.functions.expr
dataset.withColumn("newArray", expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))"))
// or with select
dataset.select(
$"id",
$"name",
expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))").as("newArray")
)
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
To extract value out of your array
Below is working example
import org.apache.spark.sql.functions._
object ExplodeArrauy {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(bean57("1",Array(bean55("aaa",2),bean55("aaa1",21))),
bean57("2",Array(bean55("bbb",3),bean55("bbb3",31)))).toDF
df
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
.show()
}
}
case class bean57(id:String,subEntities:Array[bean55])
case class bean55(status: String,subEntityId:Long)
I have my Input as below.
val inputJson ="""[{"color": "red","value": "#f00"},{"color": "blue","value": "#00f"}]"""
I need to convert JSON val to ARRAY
My output should as below.
val colorval=Array("red","blue")
val value=Array("#f00","#00f")
Please Kindly Help
Following solution should help you if you have large data sets.
//input data I guess you have large data
val inputJson ="""[{"color": "red","value": "#f00"},{"color": "blue","value": "#00f"}]"""
//read the json data to dataframe
val df = sqlContext.read.json(sc.parallelize(inputJson::Nil))
//apply the collecting inbuilt functions
import org.apache.spark.sql.functions.collect_list
df.select(collect_list("color").as("colorVal"), collect_list("value").as("value"))
and you should have
+-----------+------------+
|colorVal |value |
+-----------+------------+
|[red, blue]|[#f00, #00f]|
+-----------+------------+
root
|-- colorVal: array (nullable = true)
| |-- element: string (containsNull = true)
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
Create a DataFrame from the JSON and explode it. Now use collect_list() or collect_set() depending upon whether you need duplicates or not.