I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.
Is it possible to do so?
I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.
val ds1: DataFrame = spark
.readStream
.format("csv")
.load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
println(ds1.show(2))
You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:
scala> val df = spark.read.
| option("header", "true").
| option("inferSchema", "true").
| option("timestampFormat","MM/dd/yyyy").
| csv("D:\\texts\\C2ImportCalEventSample.csv")
df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]
scala> df.printSchema
root
|-- Start Date : timestamp (nullable = true)
|-- Start Time: string (nullable = true)
|-- End Date: timestamp (nullable = true)
|-- End Time: string (nullable = true)
|-- Event Title : string (nullable = true)
|-- All Day Event: string (nullable = true)
|-- No End Time: string (nullable = true)
|-- Event Description: string (nullable = true)
|-- Contact : string (nullable = true)
|-- Contact Email: string (nullable = true)
|-- Contact Phone: string (nullable = true)
|-- Location: string (nullable = true)
|-- Category: integer (nullable = true)
|-- Mandatory: string (nullable = true)
|-- Registration: string (nullable = true)
|-- Maximum: integer (nullable = true)
|-- Last Date To Register: timestamp (nullable = true)
How can I convert a column with the data type of struct to Map or String. This is the schema:
root
|-- Col1: string (nullable = true)
|-- Col2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: integer (nullable = false)
The second column makes the problem when I want to dump the dataframe into a file. I have tried many different ways such as casting to string but it changed the values in the second column. I also tried to convert the Col2 to a map but i was not successful.
I tried to get the first value in struct(_1) through a udf but it has error:
Failed to execute user defined function($anonfun$1: (struct<_1:string,_2:int>) => string)
Select Col1, Col2._1, Col2._2 from <your table>
By spark.sql, you can try this and save it to another dataframe and then write to CSV.
In Scala we could do in this way:
val df_new = df_old.select($"Col1", $"Col2._1", $"Col3._2")
You can also * notation to expand all the columns from Struct data type.
Schema
root
|-- address: struct (nullable = false)
| |-- street: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
Expansion SQL
val df1 = df.select("address.*").show(false)
df1.printSchema
root
|-- street: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
I am trying to add a new column in each row of DataFrame like this
def addNamespace(iter: Iterator[Row]): Iterator[Row] = {
iter.map (row => {
println(row.getString(0))
// Row.fromSeq(row.toSeq ++ Array[String]("shared"))
val newseq = row.toSeq ++ Array[String]("shared")
Row(newseq: _*)
})
iter
}
def transformDf(source: DataFrame)(implicit spark: SparkSession): DataFrame = {
val newSchema = StructType(source.schema.fields ++ Array(StructField("namespace", StringType, nullable = true)))
val df = spark.sqlContext.createDataFrame(source.rdd.mapPartitions(addNamespace), newSchema)
df.show()
df
}
But I keep getting this error - Caused by: java.lang.RuntimeException: org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string on the line df.show()
Can somebody please help in figuring out this. I have searched around in multiple posts but whatever I have tried is giving me this error.
I have also tried val again = sourceDF.withColumn("namespace", functions.lit("shared")) but it has the same issue.
Schema of already read data
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- description: string (nullable = true)
| |-- activates_on: timestamp (nullable = true)
| |-- expires_on: timestamp (nullable = true)
| |-- created_by: string (nullable = true)
| |-- created_on: timestamp (nullable = true)
| |-- updated_by: string (nullable = true)
| |-- updated_on: timestamp (nullable = true)
| |-- properties: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
Caused by: java.lang.RuntimeException:
org.apache.spark.unsafe.types.UTF8String is not a valid external type
for schema of string
means its unable to understand as string type... for newly added "namespace" column.
Clearly indicates datatype mismatch error at catalyst level...
see spark code here..
override def eval(input: InternalRow): Any = {
val result = child.eval(input)
if (checkType(result)) {
result
} else {
throw new RuntimeException(s"${result.getClass.getName}$errMsg")
}
}
and error message is s" is not a valid external type for schema of ${expected.catalogString}"
So UTF String is not real string you need to encode/decode it before passing it as string type otherwise catalyst will not able to understand what you are passing.
How to fix it ?
Below are the SO content which will address how to encode/decode to/from utfstring to string and viceversa... you may need to apply suitable solution for this.
https://stackoverflow.com/a/5943395/647053
string decode utf-8
Note :
This online UTF-8 encoder/decoder tool is very handy to put sample data and convert that to string. try this first....
I have an RDD that has been created from some JSON, each record in the RDD contains key/value pairs. My RDD looks like:
myRdd.foreach(println)
{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":153,"id":8697389197662617,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637852762},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]
I would to convert each record to a row in a spark dataframe, the nested fields in trackingInfo should be there own columns and the type list should be its own column also.
So far I've tired to split it using a case class :
case class Event(
sequence: String,
id: String,
trackingInfo:String,
location:String,
row:String,
trackId: String,
listrequestId: String,
videoId:String,
rank: String,
requestId: String,
`type`:String,
time: String)
val dataframeRdd = myRdd.map(line => line.split(",")).
map(array => Event(
array(0).split(":")(1),
array(1).split(":")(1),
array(2).split(":")(1),
array(3).split(":")(1),
array(4).split(":")(1),
array(5).split(":")(1),
array(6).split(":")(1),
array(7).split(":")(1),
array(8).split(":")(1),
array(9).split(":")(1),
array(10).split(":")(1),
array(11).split(":")(1)
))
However I keep getting java.lang.ArrayIndexOutOfBoundsException: 1 errors.
What is the best way to do this ? As you can see record number 5 has a slight difference in the ordering of some attributes. Is it possible to parse based on attribute names instead of splitting on "," etc.
I'm using Spark 1.6.x
Your json rdd seems to be invalid jsons. You need to convert them to valid jsons as
val validJsonRdd = myRdd.map(x => x.replace(",1],", ",").replace("}]", "}"))
then you can use the sqlContext to read the valid rdd jsons into a dataframe as
val df = sqlContext.read.json(validJsonRdd)
which should give you dataframe ( i used the invalid json you provided in the question)
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|id |sequence|time |trackingInfo |type |
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|8697344444103393|89 |527636408955|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389197662617|153 |527637852762|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697389381205360|155 |527637858607|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697374208897843|136 |527637405129|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
|8697413135394406|189 |527638558756|[null,PostPlay/Next,0,284929d9-6147-4924-a19f-4a308730354c-3348447,0,14272744,80075830] |[Play, Action, Session]|
|8697373887446384|130 |527637394083|[cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585,Browse,0,ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171,0,14170286,80000778]|[Play, Action, Session]|
+----------------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
and the schema for the dataframe is
root
|-- id: long (nullable = true)
|-- sequence: long (nullable = true)
|-- time: long (nullable = true)
|-- trackingInfo: struct (nullable = true)
| |-- listId: string (nullable = true)
| |-- location: string (nullable = true)
| |-- rank: long (nullable = true)
| |-- requestId: string (nullable = true)
| |-- row: long (nullable = true)
| |-- trackId: long (nullable = true)
| |-- videoId: long (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
I hope the answer is helpful
You can use sqlContext.read.json(myRDD.map(_._2)) to read json into a dataframe
I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}
As I create a data frame by reading json file, it is creating data frame as below
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]
As we can see "escapedJsonPayload" is String and I need it to be Struct.
Note: I got similar question in StackOverflow and followed it (How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?) but it is giving me "[_corrupt_record: string]"
I have tried below steps
val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])
Any help would be appreciated
First of all the JSON you have provided is of wrong format (syntactically). The corrected JSON is as follows:
{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}
Next, to parse the JSON correctly from the above JSON, you have to use following code:
val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd
val df = spark.read.json(rdd)
Above code will give you following output:
df.show(false)
+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload |
+----------------+-------------------------------------+
|[null,abc] |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+
With following schema:
df.printSchema
root
|-- clientAttributes: struct (nullable = true)
| |-- backfillId: string (nullable = true)
| |-- clientPrimaryKey: string (nullable = true)
|-- escapedJsonPayload: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemId: string (nullable = true)
| | | |-- itemName: string (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
I hope this helps !