I have an input file which looks much like csv but with custom header:
FIELDS-START
field1
field2
field3
FIELDS-END-DATA-START
val1,2,3
val2,4,5
DATA-END
Task:
To read data to a typed dataframe, schema is obtained dynamically, example for this specific file:
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) ::
StructField("field3", IntegerType, true) :: Nil
)
So bacause of custom header I can't use spark csv reader. Other thing I tried:
val file = spark.sparkContext.textFile(...)
val data: RDD[List[String]] = file.filter(_.contains(",")).map(_.split(',').toList)
val df: DataFrame = spark.sqlContext.createDataFrame(data.map(Row.fromSeq(_)), schema)
It fails with runtime exception
java.lang.String is not a valid external type for schema of int which is because createDataFrame doesn't do any casting.
NOTE: Schema is obtained at runtime
Thanks in advance!
Related
I'm trying out the Scala Spark creating dataframes.
But this returns an empty DF Schema and empty DF.
Can someone tell me what is the issue here? Thanks!
val simpleData = Seq(Row("James","","Smith","36636"),
Row("Michael","Rose","","40288"),
Row("Robert","","Williams","42114"),
Row("Maria","Anne","Jones","39192"),
Row("Jen","Mary","Brown","")
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true)
))
val df = spark.createDataFrame(
sc.parallelize(simpleData),simpleSchema)
logger.info(s"df printschema: ${df.printSchema()}")
logger.info(s"df show: ${df.show}")
```
df.printSchema() will print the schema of the dataframe to output and not return a value that would be printed by your logger. You may access the schema as a StructType using df.schema.
Try changing
logger.info(s"df printschema: ${df.printSchema()}")
to
logger.info(s"df printschema: ${df.schema.simpleString}")
or
logger.info(s"df printschema: ${df.schema.json}")
for the json representation.
Let me know if this works for you.
I am trying to save a parquet Spark dataframe with partitioning to the temporary directory for unit tests, however, for some reason partitions are not created. The data itself is saved into the directory and can be used for tests.
Here is the method I have created for that:
def saveParquet(df: DataFrame, partitions: String*): String = {
val path = createTempDir()
df.repartition(1).parquet(path)(partitions: _*)
path
}
val feedPath: String = saveParquet(feedDF.select(feed.schema), "processing_time")
This method works for various dataframe with various schemas but for some reason does not generate partitions for this one. I have logged out the resulting path and it looks like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515
But it should look like this:
/var/folders/xg/fur_diuhg83b2ba15ih2rt822000dhst/T/testutils-samples8512758291/jf81n7bsj-95hs-573n-b73h-7531ug04515/processing_time=1591714800000/part-some-random-numbersnappy.parquet
I have checked that the data and all the columns are read just fine before partitioning, as soon as partition call is created this problem occurs. Also, I ran a regex on directories which failed with match error on test samples - s".*processing_time=([0-9]+)/.*parquet".r
So what could be the reason of this problem? How else can I partition the dataframe?
Dataframe schema looks like this:
val schema: StructType = StructType(
Seq(
StructField("field1", StringType),
StructField("field2", LongType),
StructField("field3", StringType),
StructField("field4Id", IntegerType, nullable = true),
StructField("field4", FloatType, nullable = true),
StructField("field5Id", IntegerType, nullable = true),
StructField("field5", FloatType, nullable = true),
StructField("field6Id", IntegerType, nullable = true),
StructField("field6", FloatType, nullable = true),
//partition keys
StructField("processing_time", LongType)
)
)
I am seeing the error Cannot have map type columns in DataFrame which calls set operations when using Spark MapType.
Below is the sample code I wrote to reproduce it. I understand this is happening because the MapType objects are not hashable but I have an use case where I need to do the following.
val schema1 = StructType(Seq(
StructField("a", MapType(StringType, StringType, true)),
StructField("b", StringType, true)
))
val df = spark.read.schema(schema1).json("path")
val filteredDF = df.filter($"b" === "apple")
val otherDF = df.except(filteredDF)
Any suggestions for workarounds?
I have created a schema with following code
val schema= new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
Created a RDD from
val data = spark.sparkContext.textFile("cities.txt")
Converted to RDD of Row to apply schema
val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))
val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)
This gives me an error
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
You don't need a schema to create a Row, you need the schema when you create the DataFrame. You also need to introduce some logic how to convert your splitted line (which produces 3 strings) into integers:
here a minimal solution without exception-handling:
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)
val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
Row(
city,
female.toInt,
male.toInt
)
}
)
val citiesDF = sqlContext.createDataFrame(cities, schema)
I normally use case-classes to create a dataframe, because spark can infer the schema from the case class:
// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int])
val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data
import sqlContext.implicits._
val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
MyRow(
Some(city),
Some(female.toInt),
Some(male.toInt)
)
}
).toDF()
I am trying to read parquet file in dataframe with following code:
val data = spark.read.schema(schema)
.option("dateFormat", "YYYY-MM-dd'T'hh:mm:ss").parquet(<file_path>)
data.show()
Here's the schema:
def schema: StructType = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", DateType, false)
))
When I try to execute data.show(), it throws the following exception:
Caused by: java.lang.ClassCastException: [B cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.MutableInt.update(SpecificInternalRow.scala:74)
at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:240)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.set(ParquetRowConverter.scala:159)
at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addBinary(ParquetRowConverter.scala:89)
at org.apache.parquet.column.impl.ColumnReaderImpl$2$6.writeValue(ColumnReaderImpl.java:324)
at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:372)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
Apparently, it's because of date format and DateType in my schema. If I change DateType to StringType, it works fine and outputs the following:
+--------------------+--------------------+----------------------+
| id| text| created_date|
+--------------------+--------------------+----------------------+
|id..................|text................|2017-01-01T00:08:09Z|
I want to read created_date into DateType, do I need to change anything else?
The following works under Spark 2.1. Note the change of the date format and the usage of TimestampType instead of DateType.
val schema = StructType(Array[StructField](
StructField("id", StringType, false),
StructField("text", StringType, false),
StructField("created_date", TimestampType, false)
))
val data = spark
.read
.schema(schema)
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss'Z'")
.parquet("s3a://thisisnotabucket")
In older versions of Spark (I can confirm this works under 1.5.2), you can create a UDF to do the conversion for you in SQL.
def cvtDt(d: String): java.sql.Date = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Date(fmt.parseDateTime(d).getMillis)
}
def cvtTs(d: String): java.sql.Timestamp = {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss'Z'")
new java.sql.Timestamp(fmt.parseDateTime(d).getMillis)
}
sqlContext.udf.register("mkdate", cvtDt(_: String))
sqlContext.udf.register("mktimestamp", cvtTs(_: String))
sqlContext.read.parquet("s3a://thisisnotabucket").registerTempTable("dttest")
val query = "select *, mkdate(created_date), mktimestamp(created_date) from dttest"
sqlContext.sql(query).collect.foreach(println)
NOTE: I did this in the REPL, so I had to create the DateTimeFormat pattern on every call to the cvt* methods to avoid serialization issues. If you're doing this is an application, I recommend extracting the formatter into an object.
object DtFmt {
val fmt = org.joda.time.format.DateTimeFormat.forPattern("yyyy-MM-dd'Tââ'HH:mm:ss'Z'")
}