How can i split a string of dataframe schema into each Structs - scala

I want to split a schema of a dataframe into a collection. I am trying this, but the schema is printed out as a string. Is there anyway I can split it into a collection per StructType so that I can manipulate it (like take only array columns from the output)? I am trying to flatten a complex multi level struct + array dataframe.
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql._
val test = sqlContext.read.json(sc.parallelize(Seq("""{"a":1,"b":[2,3],"d":[2,3]}""")))
test.printSchema
val flattened = test.withColumn("b", explode($"d"))
flattened.printSchema
def identifyArrayColumns(dataFrame : DataFrame) = {
val output = for ( d <- dataFrame.collect()) yield
{
d.schema
}
output.toList
}
identifyArrayColumns(test)
Output currently is
identifyArrayColumns: (dataFrame: org.apache.spark.sql.DataFrame)List[org.apache.spark.sql.types.StructType]
res58: List[org.apache.spark.sql.types.StructType] = List(StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true)))
It is one full string, so I cannot filter only the array columns. Suppose if I do a foreach(println). I get only one line
scala> output.foreach(println)
StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true), StructField(d,ArrayType(LongType,true),true))
What I want is each StructTypes in a single element in a collection

You can simply filter the fields of the DataFrame's schema for fields with type array - no need to inspect the DataFrame's data for this:
def identifyArrayColumns(schema: StructType): List[StructField] = {
schema.fields.filter(_.dataType.typeName == "array").toList
}
NOTE that this is a "shallow" solution that would only return the array fields directly under "root", if you want to also find Arrays within Arrays / maps / structs, you'd need to recursively traverse the shcema and produce this filtered result, something like:
// can be converted into a tail-recursive method by adding another argument to accumulate results
def identifyArrayColumns(schema: StructType): List[StructField] = {
val arrays = schema.fields.filter(_.dataType.typeName == "array").toList
val deeperArrays = schema.fields.flatMap {
case f # StructField(_, s: StructType, _, _) => identifyArrayColumns(s)
case _ => List()
}
arrays ++ deeperArrays
}

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Passing struct type to methods or UDFS in spark sql dataframes

I have two dataframes and I have joined them and after joining in the joined dataframe , i have got two columns which are of type struct. Basically they are of Array[[String,Int]]. I need to derive a third column based on the elements of this struct type.
My code looks like below.
val bdf = Seq(
("a",1,1,10)
,("a",1,2,10)
,("a",1,3,10)
,("a",1,4,10)
,("b",1,1,20)
,("b",1,2,10)
,("a",2,3,10)
,("a",2,4,20)
,("a",2,5,20)
,("c",2,1,10)
,("c",2,2,20)
,("c",2,3,20)
).toDF("contract_number","linenumber","monthdel","open_quant")
val gbdf = bdf.withColumn("bmergedcol",struct(bdf("monthdel"),bdf("open_quant"))).groupBy("contract_number","linenumber").agg(collect_list("bmergedcol"))
val pl = Seq(
("a",1,"FLAT",10)
,("a",1,"FLAT",30)
,("a",1,"NFE",10)
,("b",1,"FLAT",10)
,("b",1,"NFE",10)
,("c",2,"NFE",10)
,("a",3,"NFE",20)
,("c",2,"FLAT",20)).toDF("connum","linnum","type","qnt")
import org.apache.spark.sql.functions._
val gpl = pl.withColumn("mergedcol",struct(pl("type"),pl("qnt"))).groupBy("connum","linnum").agg(collect_list("mergedcol"))
val jdf = gbdf.join(gpl,expr("((contract_number = connum) AND (linenumber = linnum ))"),"left_outer")
My output of jdf is like
I need to understand how can i pass the two struct type fields to some method and derive a third one from it?
Both array of structs should enter your UDF as Seq[Row], which you can then map into tuples by specifing the types of the structs (i think its string,int in your case). In this example I use pattern-matching on Row, but there are also other ways to do it (e.g. using Row#.getAs):
val myUDF = udf((arr1:Seq[Row],arr2:Seq[Row]) => {
// convert to tuples
val arr1Tup: Seq[(String, Int)] = arr1.map{case Row(s:String,i:Int) => (s,i)}
val arr2Tup: Seq[(String, Int)] = arr2.map{case Row(s:String,i:Int) => (s,i)}
// now do derive new quantities
})
Using the 2 Sequences of Tuples you can derive your new column
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions transforming Datasets. An UDF could be used to pass the two struct type fields to derive a result.
val customUdf = udf((col1: Seq[Row], col2: Int) => {
// This is an example.
col1(1).getAs[String]("type") + "--" + col2
})
val cdf = jdf.withColumn("custom", customUdf(jdf.col("collect_list(mergedcol)"), jdf.col("linnum")))
cdf.show(10)
In above udf col1 is Seq[Row] as it an array of struct type, If only struct type has to be accessed than simply Row should be used.

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?
The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}
I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.
Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)
val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions