Unable to convert an RDD[Row] to a DataFrame - scala

For the following code - in which a DataFrame is converted to RDD[Row] and data for a new column is appended via mapPartitions:
// df is a DataFrame
val dfRdd = df.rdd.mapPartitions {
val bfMap = df.rdd.sparkContext.broadcast(factorsMap)
iter =>
val locMap = bfMap.value
iter.map { r =>
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
}
}
The output is correct for the RDD[Row] with another column:
println("**dfrdd\n" + dfRdd.take(5).mkString("\n"))
**dfrdd
[ArrayBuffer(0021BEC286CC, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 148818)]
[ArrayBuffer(0021BEE7C556, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 26908)]
[ArrayBuffer(8C7F3BFD4B82, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 99942)]
[ArrayBuffer(0021BEC8F8B8, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 53994)]
[ArrayBuffer(10EA59F10C8B, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 1427)]
Let us try to convert the RDD[Row] back to a DataFrame:
val newSchema = df.schema.add(StructField("userf",IntegerType))
Now let us create the updated DataFrame:
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
Is the new schema looking correct?
newSchema.show()
root
|-- user: string (nullable = true)
|-- score: long (nullable = true)
|-- programType: string (nullable = true)
|-- source: string (nullable = true)
|-- item: string (nullable = true)
|-- playType: string (nullable = true)
|-- userf: integer (nullable = true)
Notice we do see the new userf column..
However it does not work:
println("df2: " + df2.take(1))
Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.collection.mutable.ArrayBuffer is not a
valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true) AS user#28
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
So: what detail is missing here?
Note: I am not interested in different approaches: e.g. withColumn or Datasets.. Let us please consider only the approach:
convert to RDD
add new data element to each row
update the schema for the new column
convert the new RDD+schema back to DataFrame

There seems to be a small mistake calling Row's constructor:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
The signature of this "constructor" (apply method, actually) is:
def apply(values: Any*): Row
When you pass a Seq[Any], it is treated as a single value of type Seq[Any]. You want to pass the elements of this Sequence, therefore you should use:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq: _*)
Once this is fixed, the Rows will match the schema you built, and you'll get the expected result.

Related

Getting Error while encoding from reading text file

I have a pipe delimited file I need to strip the first two rows off of. So I read it into and RDD, exclude the first two rows, and make it into a data frame.
val rdd = spark.sparkContext.textFile("/path/to/file")
val rdd2 = rdd.mapPartitionsWithIndex{ (id_x, iter) => if (id_x == 0) iter.drop(2) else iter }
val rdd3 = rdd2.map(_.split("\\|")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val df = spark.sqlContext.createDataFrame(rdd3,schema)
That all works. I can do show on the dataframe, still works. However, when I attempt to save this as a parquet file, I get a huge stack trace with errors like:
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]),
0, col1), StringType), true, false) AS col1#5637
This repeats for every column in the data frame, as far I can tell. What am I doing wrong? Is it something to the the UTF8 bit?

Casting type of columns in a dataframe

My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.
I read the file as a dataframe:
var df = spark.read.csv(originalPath);
but when I print schema it gives me all the columns as Strings.
I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.
df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));
I appreciate any help to solve the issue of casting.
Thanks.
DataFrames are immutable. Your code creates new DataFrame for each value and discards it.
It is best to use map and select:
val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)
but you could foldLeft:
df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))
or even (please don't) mutable reference:
var df = Seq(("1", "2", "3")).toDF
df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))
Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. During loading the file infer this schema to automatically convert your dataframe columns from string to integer. No explicit conversion required in this case.
import org.apache.spark.sql.types._
val csvSchema = StructType(Array(
StructField("_c0", IntegerType, true),
StructField("_c1", IntegerType, true),
StructField("_c2", IntegerType, true),
StructField("_c3", IntegerType, true)))
val df = spark.read.schema(csvSchema).csv(originalPath)
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

select / drop does not really drop the column?

I think I don't undestand how select or drop are working.
I am exploding a dataset and I don't want some of the columns to be copied to the newly generated entries.
val ds = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:15:00", "ASC_a", "1"),
("2017-01-01 06:19:00", "start", "2"),
("2017-01-01 06:22:00", "ASC_b", "2"),
("2017-01-01 06:30:00", "end", "2"),
("2017-01-01 10:45:00", "ASC_a", "3"),
("2017-01-01 10:50:00", "start", "3"),
("2017-01-01 11:22:00", "ASC_c", "4"),
("2017-01-01 11:31:00", "end", "5" )
)).toDF("timestamp", "status", "msg")
ds.show()
val foo = ds.select($"timestamp", $"msg")
val bar = ds.drop($"status")
foo.printSchema()
bar.printSchema()
println("foo " + foo.where($"status" === "end").count)
println("bar" + bar.where($"status" === "end").count)
Output:
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
foo 2
bar 2
Why do I still get an output of 2 for both though I
a) did not select status
b) dropped status
EDIT:
println("foo " + foo.where(foo.col("status") === "end").count) says that there is no column status. Should this not be the same as println("foo " + foo.where($"status" === "end").count)?
Why do I still get an output of 2 for both
Because optimizer is free to reorganize the execution plan. In fact if you check it:
== Physical Plan ==
*Project [_1#4 AS timestamp#8, _3#6 AS msg#10]
+- *Filter (isnotnull(_2#5) && (_2#5 = end))
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true) AS _1#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#5, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true) AS _3#6]
+- Scan ExternalRDDScan[obj#3]
you'll see that filter is pushed down as early as possible and executed before project. So it is equivalent to:
SELECT _1 AS timetsatmp, _2 AS msg
FROM ds WHERE _2 IS NOT NULL AND _2 = 'end'
Arguably it is a minor bug, and code should be translated as
SELECT * FROM (
SELECT _1 AS timetsatmp, _2 AS msg FROM ds
) WHERE _2 IS NOT NULL AND _2 = 'end'
and throw an exception.

How to merge multiple binary files into parquet in spark?

I have several binary files that I need to merge into one parquet files in spark. I have tried this but it doesn't work.
val rdd = spark.sparkContext.binaryFiles("/mypath/*").map{case (filePath, content) =>
Row("file_name" -> filePath, "content" -> content.toArray())
}
val schema = new StructType().add(StructField("file_name", StringType, true)).add(StructField("content", ArrayType(ByteType), true))
spark.createDataFrame(rdd, schema).show(1)
I got this stack trace:
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true) AS file_name#55
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
...
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:276)
... 17 more
Like the error says, you declared the column to be a StringType but the actual data is a tuple
Row("file_name" -> filePath, "content" -> content.toArray())
Here you create a row of two tuples but your schema is a string column and an array column. Your assumption that you need to create a tuple with the column name is incorrect, you just use the data. Column names come when you apply a schema.
use Row(filePath, content.toArray()) instead to match your schema, or alter your schema to accept the tuples.

Spark createDataFrame failing with ArrayOutOfBoundsException

I'm pretty new to Spark and am having a problem converting an RDD to a DataFrame. What I'm trying to do is take a log file, convert it to JSON using an existing jar (returns a string), and then make that resulting json into a dataframe. Here is what I have so far:
val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
val txfm = new JsonParser //*jar to parse logs to json*//
partition.map(line => {
Row(txfm.parseLine(line))
})
})
When I run a take(2) on this I get something like:
[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]
My problem comes here. I create a schema and try to create the df
val schema = StructType(Array(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true),...))
val jsonDf = sqlSession.createDataFrame(jsonRows, schema)
And the returned error is
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
Can someone tell me what I'm doing wrong here? Most of the SO answers I've found say I can use either createDataFrame or toDF(), but I've had no luck with either. I also tried converting the RDD to a JavaRDD, but that also did not work. Appreciate any insight you can give.
your defined schema is for RDD like:
{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}
if you can change your RDD to make data as
{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}
an use this schema:
val schema = StructType(Seq(
StructField("logs",ArrayType( StructType(Seq(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true), ...))
))
))
sqlContext.read.schema(schema).json(jsonRows)