I'm trying out the Scala Spark creating dataframes.
But this returns an empty DF Schema and empty DF.
Can someone tell me what is the issue here? Thanks!
val simpleData = Seq(Row("James","","Smith","36636"),
Row("Michael","Rose","","40288"),
Row("Robert","","Williams","42114"),
Row("Maria","Anne","Jones","39192"),
Row("Jen","Mary","Brown","")
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true)
))
val df = spark.createDataFrame(
sc.parallelize(simpleData),simpleSchema)
logger.info(s"df printschema: ${df.printSchema()}")
logger.info(s"df show: ${df.show}")
```
df.printSchema() will print the schema of the dataframe to output and not return a value that would be printed by your logger. You may access the schema as a StructType using df.schema.
Try changing
logger.info(s"df printschema: ${df.printSchema()}")
to
logger.info(s"df printschema: ${df.schema.simpleString}")
or
logger.info(s"df printschema: ${df.schema.json}")
for the json representation.
Let me know if this works for you.
Related
I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0
Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)
I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.
I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?
I have an input file which looks much like csv but with custom header:
FIELDS-START
field1
field2
field3
FIELDS-END-DATA-START
val1,2,3
val2,4,5
DATA-END
Task:
To read data to a typed dataframe, schema is obtained dynamically, example for this specific file:
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) ::
StructField("field3", IntegerType, true) :: Nil
)
So bacause of custom header I can't use spark csv reader. Other thing I tried:
val file = spark.sparkContext.textFile(...)
val data: RDD[List[String]] = file.filter(_.contains(",")).map(_.split(',').toList)
val df: DataFrame = spark.sqlContext.createDataFrame(data.map(Row.fromSeq(_)), schema)
It fails with runtime exception
java.lang.String is not a valid external type for schema of int which is because createDataFrame doesn't do any casting.
NOTE: Schema is obtained at runtime
Thanks in advance!
I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?
The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'Tââ'HH:mm:ss'Z'")
}