Spark orc reading string as decimal value - scala

I am reading orc file with following data
| C1 | C2 |
| 1 | 1954E7 |
My column c1 should be int and c2 should be string but spark is interpreting the c2 as decimal. I tried following code to overcome it
spark.read.option("inferSchema","false").option("header", "true").orc("path to file")
But spark orc reader still reads the data with schema even though I force it to turn off the inferschema. Is there a way to force spark not to read the schema and I apply my custom schema later after the read?

import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.{SparkSession}
// spark: SparkSession
import spark.implicits._
when C2 is string
val pathORC =
"<path>/source.orc"
case class O(C1: Int, C2: String)
val source = Seq(O(1, "1954E7")).toDF()
source.printSchema()
// root
// |-- C1: integer (nullable = false)
// |-- C2: string (nullable = true)
source.show(false)
// +---+------+
// |C1 |C2 |
// +---+------+
// |1 |1954E7|
// +---+------+
source.write.mode("overwrite").orc(pathORC)
val res = spark.read.orc(pathORC)
res.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: string (nullable = true)
res.show(false)
// +---+------+
// |C1 |C2 |
// +---+------+
// |1 |1954E7|
// +---+------+
when C2 ???
val pathORC1 =
"<path>/source1.orc"
val source1 = Seq((1, 1954e7)).toDF("C1", "C2")
source1.printSchema()
// root
// |-- C1: integer (nullable = false)
// |-- C2: double (nullable = false)
source1.show(false)
// +---+--------+
// |C1 |C2 |
// +---+--------+
// |1 |1.954E10|
// +---+--------+
source1.write.mode("overwrite").orc(pathORC1)
val res1 = spark.read.orc(pathORC1)
res1.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: double (nullable = true)
res1.show(false)
// +---+--------+
// |C1 |C2 |
// +---+--------+
// |1 |1.954E10|
// +---+--------+
val dToStr = udf( (v: Double) => { v.toString.replace(".", "") } )
val res2 = res1
.withColumn("C2", dToStr(col("C2")))
res2.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: string (nullable = true)
res2.show(false)
// +---+-------+
// |C1 |C2 |
// +---+-------+
// |1 |1954E10|
// +---+-------+

Related

Convert dataset to dataframe from an avro file

I wrote a scala script to load an avro file, and to work with the generated data (to retrieve top contributors).
The problem is that while loading the file it gives a dataset that i can not convert to dataframe cuz it contains some complex types:
val history_src = "path_to_avro_files\\frwiki*.avro"
val revisions_dataset = spark.read.format("avro").load(history_src)
//gives a dataset the we can see the data and make a take(1) without problems
val first_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[Revision]].array
.map(x=> (x.r_contributor.r_username, x.r_contributor.r_contributor_id, x.r_contributor.r_contributor_ip)))).take(1)
//gives GenericRowWithSchema cannot be cast to Revision
val second_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].toStream
.map(x=> (x.getLong(0),row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].map(c => (c.getLong(0))))))).take(1)
// gives WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
I tried with Encoders and Encoder using my case classes Below but didn't work
case class History (title: String, namespace: Long, id: Long, revisions: Array[Revision])
case class Contributor (r_username: String, r_contributor_id: Long, r_contributor_ip: String)
case class Revision(r_id: Long, r_parent_id: Long, timestamp : Long, r_contributor: Contributor, sha: String)
I can generate the schema from my revisions_dataset is like this and it gives this:
root
|-- p_title: string (nullable = true)
|-- p_namespace: long (nullable = true)
|-- p_id: long (nullable = true)
|-- p_revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- r_id: long (nullable = true)
| | |-- r_parent_id: long (nullable = true)
| | |-- r_timestamp: long (nullable = true)
| | |-- r_contributor: struct (nullable = true)
| | | |-- r_username: string (nullable = true)
| | | |-- r_contributor_id: long (nullable = true)
| | | |-- r_contributor_ip: string (nullable = true)
| | |-- r_sha1: string (nullable = true)
My goal is to have a dataframe to be able retrive the list of contributors on the revisions list and to flatten it to have a list of conributors inside the page (with the same level as the title).
Any help Please ?
import org.apache.spark.sql.functions._
val r1 = Revision(1, 1, 1, Contributor("c1", 1, "ip1"), "sha")
val r2 = Revision(1, 1, 1, Contributor("c2", 2, "ip2"), "sha")
val r3 = Revision(1, 1, 1, Contributor("c3", 3, "ip3"), "sha")
val revisions_dataset = Seq(
("title1", 0L, 1L, Array(r1, r2)),
("title1", 0L, 2L, Array(r1, r3)),
("title1", 0L, 3L, Array(r2))
).toDF("p_title", "p_namespace", "p_id", "p_revisions")
val flattened = revisions_dataset.select($"p_title", $"p_id", explode($"p_revisions").alias("p_revision"))
.withColumn("r_contributor_username", $"p_revision.r_contributor.r_username")
.withColumn("r_contributor_id", $"p_revision.r_contributor.r_contributor_id")
.withColumn("r_contributor_ip", $"p_revision.r_contributor.r_contributor_ip")
.drop("p_revision")
flattened.show(false)
Output:
+-------+----+----------------------+----------------+----------------+
|p_title|p_id|r_contributor_username|r_contributor_id|r_contributor_ip|
+-------+----+----------------------+----------------+----------------+
|title1 |1 |c1 |1 |ip1 |
|title1 |1 |c2 |2 |ip2 |
|title1 |2 |c1 |1 |ip1 |
|title1 |2 |c3 |3 |ip3 |
|title1 |3 |c2 |2 |ip2 |
+-------+----+----------------------+----------------+----------------+

add parent column name as prefix to avoid ambiguity

Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it.
Added another column with json data.
scala> val df = Seq(
(77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""),
(78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}""","""{"name":"ddd","age":40}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""","""{"name":"eee","age":50}""")
).toDF("id", "name", "colJson","personInfo")
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- colJson: string (nullable = true)
|-- personInfo: string (nullable = true)
scala> df.show(false)
+---+-------+---------------------------------------------------------------+-----------------------+
|id |name |colJson |personInfo |
+---+-------+---------------------------------------------------------------+-----------------------+
|77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
|78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
|178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
|179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
|180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
+---+-------+---------------------------------------------------------------+-----------------------+
created fromJson implicit function,You can pass multiple columns to this & It will parse & extract the columns from json.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
defined class DFHelper
scala> df.fromJson($"colJson",$"personInfo").show(false)
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|id |name |key1 |key10|key3|key4|key6|sub1 |sub2 |age|name|
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|77 |email1 |38 |null |39 |null|null|null |null |10 |aaa |
|78 |email2 |38 |null |null|39 |null|null |null |20 |bbb |
|178|email21|when string|false|null|36 |test|null |null |30 |ccc |
|179|email8 |null |null |null|null|null|qwerty|[42] |40 |ddd |
|180|email8 |null |null |null|null|null|qwerty|[42, 56, test]|50 |eee |
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
scala> df.fromJson($"colJson",$"personInfo").printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- key1: string (nullable = true)
|-- key10: boolean (nullable = true)
|-- key3: long (nullable = true)
|-- key4: long (nullable = true)
|-- key6: string (nullable = true)
|-- sub1: string (nullable = true)
|-- sub2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- age: long (nullable = true)
|-- name: string (nullable = true)
Try this-
df.show(false)
df.printSchema()
/**
* +---+-------+---------------------------------------------------------------+-----------------------+
* |id |name |colJson |personInfo |
* +---+-------+---------------------------------------------------------------+-----------------------+
* |77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
* |78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
* |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
* |179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
* |180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
* +---+-------+---------------------------------------------------------------+-----------------------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: string (nullable = true)
* |-- personInfo: string (nullable = true)
*
* #param inDF
*/
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf//.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
val p = df.fromJson($"colJson", $"personInfo")
p.show(false)
p.printSchema()
/**
* +---+-------+---------------------------------+----------+
* |id |name |colJson |personInfo|
* +---+-------+---------------------------------+----------+
* |77 |email1 |[38,, 39,,,,] |[10, aaa] |
* |78 |email2 |[38,,, 39,,,] |[20, bbb] |
* |178|email21|[when string, false,, 36, test,,]|[30, ccc] |
* |179|email8 |[,,,,, qwerty, [42]] |[40, ddd] |
* |180|email8 |[,,,,, qwerty, [42, 56, test]] |[50, eee] |
* +---+-------+---------------------------------+----------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: struct (nullable = true)
* | |-- key1: string (nullable = true)
* | |-- key10: boolean (nullable = true)
* | |-- key3: long (nullable = true)
* | |-- key4: long (nullable = true)
* | |-- key6: string (nullable = true)
* | |-- sub1: string (nullable = true)
* | |-- sub2: array (nullable = true)
* | | |-- element: string (containsNull = true)
* |-- personInfo: struct (nullable = true)
* | |-- age: long (nullable = true)
* | |-- name: string (nullable = true)
*/
// fetch columns of struct using <parent_col>.<child_col>
p.select($"colJson.key1", $"personInfo.age").show(false)
/**
* +-----------+---+
* |key1 |age|
* +-----------+---+
* |38 |10 |
* |38 |20 |
* |when string|30 |
* |null |40 |
* |null |50 |
* +-----------+---+
*/

Remove field from array.struct in Spark

I want to delete one field from array.struct as follow:
case class myObj (id: String, item_value: String, delete: String)
case class myObj2 (id: String, item_value: String)
val df2=Seq (
("1", "2","..100values", Seq(myObj ("A", "1a","1"),myObj ("B", "4r","2"))),
("1", "2","..100values", Seq(myObj ("X", "1p","11"),myObj ("V", "7w","8")))
).toDF("1","2","100fields","myArr")
val deleteColumn : (mutable.WrappedArray[myObj]=>mutable.WrappedArray[myObj2])= {
(array: mutable.WrappedArray[myObj]) => array.map(o => myObj2(o.id, o.item_value))
}
val myUDF3 = functions.udf(deleteColumn)
df2.withColumn("newArr",myUDF3($"myArr")).show(false)
Error is very clear:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array<struct<id:string,item_value:string,delete:string>>) => array<struct< id:string,item_value:string>>)
It does not match, but is that I want to do, parse from one structure to another ¿?
I am using a UDF because df.map() is not good for mapping specific column and it forces to indicates all columns. So I didn´t find best method to apply this mapping for one column.
You can rewrite your UDF that takes a Row instead of custom object as below
val deleteColumn = udf((value: Seq[Row]) => {
value.map(row => MyObj2(row.getString(0), row.getString(1)))
})
df2.withColumn("newArr", deleteColumn($"myArr"))
Output:
+---+---+-----------+---------------------+----------------+
|1 |2 |100fields |myArr |newArr |
+---+---+-----------+---------------------+----------------+
|1 |2 |..100values|[[A,1a,1], [B,4r,2]] |[[A,1a], [B,4r]]|
|1 |2 |..100values|[[X,1p,11], [V,7w,8]]|[[X,1p], [V,7w]]|
+---+---+-----------+---------------------+----------------+
Not using udf, one can easily remove fields from array of structs using dropFields together with transform.
Test input:
val df = spark.createDataFrame(Seq(("v1", "v2", "v3", "v4"))).toDF("f1", "f2", "f3", "f4")
.select(
array(
struct("f1", "f2"),
struct(col("f3").as("f1"), col("f4").as("f2")),
).as("myArr")
)
df.printSchema()
// root
// |-- myArr: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- f1: string (nullable = true)
// | | |-- f2: string (nullable = true)
Script:
val df2 = df.withColumn(
"myArr",
transform(
$"myArr",
x => x.dropFields("f2")
)
)
df2.printSchema()
// root
// |-- myArr: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- f1: string (nullable = true)

convert pyspark dataframe value to customized schema

I am receiving streaming data from Kafka. By default, the dataframe.value is of "string" type. for example, dataframe.value is
1.0,2.0,4,'a'
1.1,2.1,3,'a1'
The schema of dataframe.value:
root
|-- value: string (nullable = true)
Now I want to define a schema on this data frame. The schema I want to get an output:
root
|-- c1: double (nullable = true)
|-- c2: double (nullable = true)
|-- c3: integer (nullable = true)
|-- c4: string (nullable = true)
I define the schema and then load the data from kafka but I get error "Kafka has already defined schema can not apply the customized one".
Any help on this issue will be highly appreciated.
You can define the schema when you convert to a data frame.
from pyspark.sql.types import StringType, IntegerType, DoubleType
kafkaRdd = sc.parallelize([(1.0,2.0,4,'a'), (1.1,2.1,3,'a1')])
col_types = [DoubleType(), DoubleType(), IntegerType(), StringType()]
col_names = ["c1", "c2", "c3", "c4"]
df = kafkaRdd.toDF(col_names, col_types)
df.show()
df.printSchema()
Here is the output:
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|1.0|2.0| 4| a|
|1.1|2.1| 3| a1|
+---+---+---+---+
root
|-- c1: double (nullable = true)
|-- c2: double (nullable = true)
|-- c3: long (nullable = true)
|-- c4: string (nullable = true)

Access names of fields in struct Spark SQL

I am trying to 'lift' the fields of a struct to the top level in a dataframe, as illustrated by this example:
case class A(a1: String, a2: String)
case class B(b1: String, b2: A)
val df = Seq(B("X",A("Y","Z"))).toDF
df.show
+---+-----+
| b1| b2|
+---+-----+
| X|[Y,Z]|
+---+-----+
df.printSchema
root
|-- b1: string (nullable = true)
|-- b2: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
val lifted = df.withColumn("a1", $"b2.a1").withColumn("a2", $"b2.a2").drop("b2")
lifted.show
+---+---+---+
| b1| a1| a2|
+---+---+---+
| X| Y| Z|
+---+---+---+
lifted.printSchema
root
|-- b1: string (nullable = true)
|-- a1: string (nullable = true)
|-- a2: string (nullable = true)
This works. I would like to create a little utility method which does this for me, probably through pimping DataFrame to enable something like df.lift("b2").
To do this, I think I want a way of obtaining a list of all fields within a Struct. E.g. given "b2" as input, return ["a1","a2"]. How do I do this?
If I understand your question correctly, you want to be able to list the nested fields of column b2.
So you would need to filter on b2, access the StructType of b2 and then map the names of the columns from within the fields (StructField):
import org.apache.spark.sql.types.StructType
val nested_fields = df.schema
.filter(c => c.name == "b2")
.flatMap(_.dataType.asInstanceOf[StructType].fields)
.map(_.name)
// nested_fields: Seq[String] = List(a1, a2)
Actually you can use ".fieldNames.toList".
val nested_fields = df.schema("b2").fieldNames.toList
It returns a list of String. If you want a list of columns make a map.
I hope it helps.