Convertions String which has list of objects to a dataframe - scala

i have a string as
str=[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},
{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]
i am trying to convert it to a data frame...
+-------+--------------+
|packQty|gtin |
+-------+--------------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276 |
|20.0 |00052000 |
+-------+--------------+
i have created a schema as
val schema=new StructType()
.add("packQty",FloatType)
.add("gtin", StringType)
val df =Seq(str).toDF("testQTY")
val df2=df.withColumn("jsonData",from_json($"testQTY",schema)).select("jsonData.*")
this is returning me a data frame with only one record...
+-------+--------------+
|packQty|gtin |
+-------+--------------+
|120.0 |0005236|
+-------+--------------+
how to modify the schema so that i can get all the records
if it was a an array then i could have used explode() function to get the values, but i am getting this error.
cannot resolve 'explode(`gtins`)' due to data type mismatch: input to
function explode should be array or map type, not string;;
this is how the column is populated
+----------------------------------------------------------------------+
| gtins
|
+----------------------------------------------------------------------
|[{"packQty":120.0,"gtin":"000520"},{"packQty":10.0,”gtin":"0005200"}]
+----------------------------------------------------------------------+

Keep your schema inside ArrayType i.e ArrayType(new StructType().add("packQty",FloatType).add("gtin", StringType)) this will give you null values as schema column names are not matched with json data.
Change schema ArrayType(new StructType().add("packQty",FloatType).add("gtin", StringType)) to ArrayType(new StructType().add("A",FloatType).add("B", StringType)) & After parsing data rename the required columns.
Please check below code.
If column names are matched in both schema & JSON data.
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDF("testQTY")
json: org.apache.spark.sql.DataFrame = [testQTY: string]
scala> val schema = ArrayType(StructType(StructField("A",DoubleType,true):: StructField("B",StringType,true) :: Nil))
schema: org.apache.spark.sql.types.ArrayType = ArrayType(StructType(StructField(A,DoubleType,true), StructField(B,StringType,true)),true)
scala> json.withColumn("jsonData",from_json($"testQTY",schema)).select(explode($"jsonData").as("jsonData")).select($"jsonData.A".as("packQty"),$"jsonData.B".as("gtin")).show(false)
+-------+--------+
|packQty|gtin |
+-------+--------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276|
|20.0 |00052000|
+-------+--------+
If column names are not matched in both schema & JSON data.
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDF("testQTY")
json: org.apache.spark.sql.DataFrame = [testQTY: string]
scala> val schema = ArrayType(StructType(StructField("packQty",DoubleType,true):: StructField("gtin",StringType,true) :: Nil)) // Column names are not matched with json & schema.
schema: org.apache.spark.sql.types.ArrayType = ArrayType(StructType(StructField(packQty,DoubleType,true), StructField(gtin,StringType,true)),true)
scala> json.withColumn("jsonData",from_json($"testQTY",schema)).select(explode($"jsonData").as("jsonData")).select($"jsonData.*").show(false)
+-------+----+
|packQty|gtin|
+-------+----+
|null |null|
|null |null|
|null |null|
|null |null|
+-------+----+
Alternative way of parsing json string into DataFrame using DataSet
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDS // Creating DataSet from json string.
json: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val schema = StructType(StructField("A",DoubleType,true):: StructField("B",StringType,true) :: Nil) // Creating schema.
schema: org.apache.spark.sql.types.StructType = StructType(StructField(A,DoubleType,true), StructField(B,StringType,true))
scala> spark.read.schema(schema).json(json).select($"A".as("packQty"),$"B".as("gtin")).show(false)
+-------+--------+
|packQty|gtin |
+-------+--------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276|
|20.0 |00052000|
+-------+--------+

Specifying one more option -
val data = """[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]"""
val df2 = Seq(data).toDF("gtins")
df2.show(false)
df2.printSchema()
/**
* +--------------------------------------------------------------------------------------------------------+
* |gtins |
* +--------------------------------------------------------------------------------------------------------+
* |[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]|
* +--------------------------------------------------------------------------------------------------------+
*
* root
* |-- gtins: string (nullable = true)
*/
df2.selectExpr("inline_outer(from_json(gtins, 'array<struct<A:double, B:string>>')) as (packQty, gtin)")
.show(false)
/**
* +-------+--------+
* |packQty|gtin |
* +-------+--------+
* |120.0 |0005236 |
* |10.0 |0005200 |
* |12.0 |00042276|
* |20.0 |00052000|
* +-------+--------+
*/

Related

Extracting array elements and mapping them to case class

The following is the output I am getting after performing a groupByKey, mapGroups and then a joinWith operation on the caseclass dataset:
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[IND0001,Christopher,Black] |null |
|[IND0002,Madeleine,Kerr] |[IND0002,WrappedArray([IND0002,ACC0155,323], [IND0002,ACC0262,60])] |
|[IND0003,Sarah,Skinner] |[IND0003,WrappedArray([IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53])] |
|[IND0004,Rachel,Parsons] |[IND0004,WrappedArray([IND0004,ACC0116,965])] |
|[IND0005,Oliver,Johnston] |[IND0005,WrappedArray([IND0005,ACC0146,378], [IND0005,ACC0201,34], [IND0005,ACC0450,329])] |
|[IND0006,Carl,Metcalfe] |[IND0006,WrappedArray([IND0006,ACC0052,57], [IND0006,ACC0597,547])] |
The code is as follows:
val test = accountDS.groupByKey(_.customerId).mapGroups{ case (id, xs) => (id, xs.toSeq)}
test.show(false)
val newTest = customerDS.joinWith(test, customerDS("customerId") === test("_1"), "leftouter")
newTest.show(500,false)
Now I want to take the arrays and output them in a format as follows:
+----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
* +----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |IND0001 |Christopher|Black |[] |0 |0 |0.0 |
* |IND0002 |Madeleine |Kerr |[[IND0002,ACC0155,323], [IND0002,ACC0262,60]] |2 |383 |191.5 |
* |IND0003 |Sarah |Skinner |[[IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53]] |3 |1084 |361.3333333333333|
Note: I cannot use spark.sql.functions._ at all --> training academy rules :(
How do I get the above output which should be mapped to a case class as follows:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
I really need help with this. Stuck with this for weeks without a working solution.
Let's say you have the following DataFrame:
val sourceDf = Seq((1, Array("aa", "CA")), (2, Array("bb", "OH"))).toDF("id", "data_arr")
sourceDf.show()
// output:
+---+--------+
| id|data_arr|
+---+--------+
| 1|[aa, CA]|
| 2|[bb, OH]|
+---+--------+
and you want to convert it to the following schema:
val destSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("state", StringType, true)
))
You can do:
val destDf: DataFrame = sourceDf
.map { sourceRow =>
Row(sourceRow(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(1))
}(RowEncoder(destSchema))
destDf.show()
// output:
+---+----+-----+
| id|name|state|
+---+----+-----+
| 1| aa| CA|
| 2| bb| OH|
+---+----+-----+

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

Read csv into Dataframe with nested column

I have a csv file like this:
weight,animal_type,animal_interpretation
20,dog,"{is_large_animal=true, is_mammal=true}"
3.5,cat,"{is_large_animal=false, is_mammal=true}"
6.00E-04,ant,"{is_large_animal=false, is_mammal=false}"
And I created case class schema with the following:
package types
case class AnimalsType (
weight: Option[Double],
animal_type: Option[String],
animal_interpretation: Option[AnimalInterpretation]
)
case class AnimalInterpretation (
is_large_animal: Option[Boolean],
is_mammal: Option[Boolean]
)
I tried to load the csv into a dataframe with:
var df = spark.read.format("csv").option("header", "true").load("src/main/resources/animals.csv").as[AnimalsType]
But got the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from animal_interpretation#12: need struct type but got string;
Am I doing something wrong? What would be the proper way of doing this?
You can not assigned schema to csv json directly. You need to do transform csv String column (animal_interpretation) into Json format, As I have done in below code using UDF. if you can get input data in format like df1 then there is no need of below UDF you can continue from df1 and get final dataframe df2.
There is no need of any case class since your data header contain column and for json data you need to declare schema AnimalInterpretationSch as below
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
//Input CSV DataFrame
scala> df.show(false)
+--------+-----------+---------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+---------------------------------------+
|20 |dog |{is_large_animal=true, is_mammal=true} |
|3.5 |cat |{is_large_animal=false, is_mammal=true}|
|6.00E-04|ant |{is_large_animal=false,is_mammal=false}|
+--------+-----------+---------------------------------------+
//UDF to convert "animal_interpretation" column to Json Format
scala> def StringToJson:UserDefinedFunction = udf((data:String,JsonColumn:String) => {
| var out = data
| val JsonColList = JsonColumn.trim.split(",").toList
| JsonColList.foreach{ rr =>
| out = out.replaceAll(rr, "'"+rr+"'")
| }
| out = out.replaceAll("=", ":")
| out
| })
//All column from Json
scala> val JsonCol = "is_large_animal,is_mammal"
//New dataframe with Json format
scala> val df1 = df.withColumn("animal_interpretation", StringToJson(col("animal_interpretation"), lit(JsonCol)))
scala> df1.show(false)
+--------+-----------+-------------------------------------------+
|weight |animal_type|animal_interpretation |
+--------+-----------+-------------------------------------------+
|20 |dog |{'is_large_animal':true, 'is_mammal':true} |
|3.5 |cat |{'is_large_animal':false, 'is_mammal':true}|
|6.00E-04|ant |{'is_large_animal':false,'is_mammal':false}|
+--------+-----------+-------------------------------------------+
//Schema declarion of Json format
scala> val AnimalInterpretationSch = new StructType().add("is_large_animal", BooleanType).add("is_mammal", BooleanType)
//Accessing Json columns
scala> val df2 = df1.select(col("weight"), col("animal_type"),from_json(col("animal_interpretation"), AnimalInterpretationSch).as("jsondata")).select("weight", "animal_type", "jsondata.*")
scala> df2.printSchema
root
|-- weight: string (nullable = true)
|-- animal_type: string (nullable = true)
|-- is_large_animal: boolean (nullable = true)
|-- is_mammal: boolean (nullable = true)
scala> df2.show()
+--------+-----------+---------------+---------+
| weight|animal_type|is_large_animal|is_mammal|
+--------+-----------+---------------+---------+
| 20| dog| true| true|
| 3.5| cat| false| true|
|6.00E-04| ant| false| false|
+--------+-----------+---------------+---------+

Scala : Passing elements of a Dataframe from every row and get back the result in separate rows

In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>

How to refer to data in Spark SQL using column names?

If I have a file named shades.txt with the following data:
1 | 1 | dark red
2 | 1 | light red
3 | 2 | dark green
4 | 3 | light blue
5 | 3 | sky blue
I can store it in SparkSQL like below:
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(elem => elem.trim))
shades_data.toDF().createOrReplaceTempView("shades_data")
But when I try to run a query on it
sqlContext.sql("select * from shades_data A where A.shade_name = "dark green")
I get an error:
cannot resolve '`A.shade_name`' given input columns: [A.value];
You can use case class for your code as
case class Shades(id: Int, colorCode:Int, shadeColor: String)
Then modify your code as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|")).map(elem => Shades(elem(0).trim.toInt, elem(1).trim.toInt, elem(2).trim))
val df = shades_data.toDF()
You should have dataframe as
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|1 |1 |dark red |
|2 |1 |light red |
|3 |2 |dark green|
|4 |3 |light blue|
|5 |3 |sky blue |
+---+---------+----------+
Now you can use use filter function as
df.filter($"shadeColor" === "dark green").show(false)
which should give you
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|3 |2 |dark green|
+---+---------+----------+
Using Schema
You can create schema as
val schema = StructType(Array(StructField("id", IntegerType, true), StructField("colorCode", IntegerType, true), StructField("shadeColor", StringType, true)))
and use the schema in sqlContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.schema(schema)
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.load("shades.txt")
Or you can use createDataFrame function as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(_.trim)).map(elem => Row.fromSeq(Seq(elem(0).toInt, elem(1).toInt, elem(2))))
sqlContext.createDataFrame(shades_data, schema).show(false)
check the schema of the dataframe, you will know the column names
val dataframe = shades_data.toDF()
dataframe.printSchema()
If there are no column names defined while creating a dataframe then by default it will use _c0, _c1 ... as column names.
or else you can give column names while creating dataframes like below
val dataframe = shades_data.toDF("col1", "col2", "col3")
dataframe.printSchema()
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: string (nullable = true)