How to merge multiple binary files into parquet in spark?

How to merge multiple binary files into parquet in spark? - scala

I have several binary files that I need to merge into one parquet files in spark. I have tried this but it doesn't work.
val rdd = spark.sparkContext.binaryFiles("/mypath/*").map{case (filePath, content) =>
Row("file_name" -> filePath, "content" -> content.toArray())
}
val schema = new StructType().add(StructField("file_name", StringType, true)).add(StructField("content", ArrayType(ByteType), true))
spark.createDataFrame(rdd, schema).show(1)
I got this stack trace:
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true) AS file_name#55
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, file_name)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
...
Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:276)
... 17 more

Like the error says, you declared the column to be a StringType but the actual data is a tuple
Row("file_name" -> filePath, "content" -> content.toArray())
Here you create a row of two tuples but your schema is a string column and an array column. Your assumption that you need to create a tuple with the column name is incorrect, you just use the data. Column names come when you apply a schema.
use Row(filePath, content.toArray()) instead to match your schema, or alter your schema to accept the tuples.

Related

What is Error: java.util.Collections$UnmodifiableRandomAccessList is not a valid

EDIT: Still no fix, but I know what value being returned from the cypher is causing the error:
[43.4171031, 37.5049815, 43.4171031, 37.5049815]
it is coming from a bbox as a spatial query:
val query = "call blah blah yield node return node.fromDate as fromDate, node.bbox as bbox ORDER BY node.toDateFormatLong DESC";
It does NOT like that return node.bbox as bbox, I have to take that out for the query to work.
if I do I get the data frame back. If I don't I get the error:
defined class DateLayerData
defined class ChangeScoreObj
changeScoreMap: java.util.Map[Integer,ChangeScoreObj] = {}
doCalculation: (lat1: BigDecimal, lon1: BigDecimal, lat2: BigDecimal, lon2: BigDecimal, radius: Double)Unit
main: (args: Array[String])Unit
MinLat: 34.6 minlon 40.9 maxlat: 34.7 maxlon: 41
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1370)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.first(RDD.scala:1367)
at org.neo4j.spark.Neo4j.loadDataFrame(Neo4j.scala:280)
at doCalculation(<console>:114)
at main(<console>:89)
... 81 elided
So guessing I am using Option wrong?
val initialDf2 = neo.cypher(query).loadDataFrame //this seems to fail on empty collection
initialDf2.take(1).headOption.map(_.getString(1)).foreach(println)
This line above is ran inside a loop. Part of a doCalculation function which is called over and over again with different values etc.
Trying to load a data frame that once back from NEO4j looks like this inside NEO4j:
-1 “Detected” 1 20161104 3318 37.5049815 43.4171031 20161023 “filename.val” 9.2 "23OCT16" [43.4171031, 37.5049815, 43.4171031, 37.5049815]
So I make the query call to get one of many rows that looks like the above one:
try {
val initialDf2 = neo.cypher(query).loadDataFrame
val someVal = initialDf2.collectAsList()
val detectt = someVal.get(0).getString(1) //try to get the second field
println(detectt)
} catch {
case e: Exception => e.printStackTrace
}
I do have a try catch because sometimes the query sent to the cypher returns nothing (I have no idea how else to handle that)
17/09/18 08:44:48 ERROR TaskSetManager: Task 0 in stage 298.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 298.0 failed 1 times, most recent failure: Lost task 0.0 in stage 298.0 (TID 298, localhost, executor driver): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.Collections$UnmodifiableRandomAccessList is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, altitude), DoubleType) AS altitude#1678
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, detect_type), StringType), true) AS detect_type#1679
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, gtype), LongType) AS gtype#1680L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, toDateFormatLong), LongType) AS toDateFormatLong#1681L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, change_area), LongType) AS change_area#1682L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 5, latitude), DoubleType) AS latitude#1683
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 6, longitude), DoubleType) AS longitude#1684
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 7, fromDateFormatLong), LongType) AS fromDateFormatLong#1685L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 8, iids), StringType), true) AS iids#1686
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 9, detect_strength), DoubleType) AS detect_strength#1687
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 10, fromDate), StringType), true) AS fromDate#1688
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 11, bbox), StringType), true) AS bbox#1689
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)

Spark/Neo4j throws error: RuntimeException: java.util.Collections$UnmodifiableRandomAccessList is not a valid external type for schema of string

The exact query:
call spatial.bbox('geom', {lat:37.5,lon:43.4}, {lat:37.6,lon:43.5}) yield node return node.altitude as altitude, node.detect_type as detect_type, node.gtype as gtype, node.toDateFormatLong as toDateFormatLong, node.change_area as change_area, node.latitude as latitude, node.longitude as longitude, node.fromDateFormatLong as fromDateFormatLong, node.iids as iids, node.detect_strength as detect_strength, node.fromDate as fromDate, node.bbox as bbox ORDER BY node.toDateFormatLong DESC
Example data set:
╒══════════╤═════════════╤═══════╤══════════════════╤═════════════╤══════════╤═══════════╤════════════════════╤═════════════════════════════════════════════════════════════════════╤═════════════════╤══════════╤═════════════════════════════════════════════╕
│"altitude"│"detect_type"│"gtype"│"toDateFormatLong"│"change_area"│"latitude"│"longitude"│"fromDateFormatLong"│"iids" │"detect_strength"│"fromDate"│"bbox" │
╞══════════╪═════════════╪═══════╪══════════════════╪═════════════╪══════════╪═══════════╪════════════════════╪═════════════════════════════════════════════════════════════════════╪═════════════════╪══════════╪═════════════════════════════════════════════╡
│-1 │"Arrival" │1 │20161104 │16981 │37.5608649│43.4297988 │20161023 │"23OCT16S1A89377_09_IW1_09_pp_1231_04NOV16S1A90776_09_123_31_TT_QQQQ”│7.2 │"23OCT16" │[43.4297988,37.5608649,43.4297988,37.5608649]│
├──────────┼─────────────┼───────┼──────────────────┼─────────────┼──────────┼───────────┼────────────────────┼─────────────────────────────────────────────────────────────────────┼─────────────────┼──────────┼─────────────────────────────────────────────┤
│-1 │"Arrival" │1 │20161104 │3123 │37.56749 │43.4807208 │20161023 │"23OCT16S1A89377_09_IW1_09_pp_1231_04NOV16S1A90776_09_124_32_TT_QQQQ"│7.5 │"23OCT16" │[43.4807208,37.56749,43.4807208,37.56749] │
├──────────┼─────────────┼───────┼──────────────────┼─────────────┼──────────┼───────────┼────────────────────┼─────────────────────────────────────────────────────────────────────┼─────────────────┼──────────┼─────────────────────────────────────────────┤
that I call
try {
val initialDf2 = neo.cypher(query).loadDataFrame
val someVal = initialDf2.collectAsList()
} catch {
case e: Exception => e.printStackTrace
}
I get this error:
17/09/18 08:44:48 ERROR TaskSetManager: Task 0 in stage 298.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 298.0 failed 1 times, most recent failure: Lost task 0.0 in stage 298.0 (TID 298, localhost, executor driver): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.Collections$UnmodifiableRandomAccessList is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, altitude), DoubleType) AS altitude#1678
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, detect_type), StringType), true) AS detect_type#1679
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, gtype), LongType) AS gtype#1680L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, toDateFormatLong), LongType) AS toDateFormatLong#1681L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, change_area), LongType) AS change_area#1682L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 5, latitude), DoubleType) AS latitude#1683
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 6, longitude), DoubleType) AS longitude#1684
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 7, fromDateFormatLong), LongType) AS fromDateFormatLong#1685L
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 8, iids), StringType), true) AS iids#1686
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 9, detect_strength), DoubleType) AS detect_strength#1687
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 10, fromDate), StringType), true) AS fromDate#1688
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 11, bbox), StringType), true) AS bbox#1689
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
Data comes back if I don't include the bbox.
In the neo4j browser, I can run the problem query and the results come back:
-1 “Detected” 1 20161104 3318 37.5049815 43.4171031 20161023 “filename.val” 9.2 "23OCT16" [43.4171031, 37.5049815, 43.4171031, 37.5049815]
It is that secondary list, I might have to return node.bbox.somevalue1 as bbbox1, but no idea what the exact syntax would be....
I think this is a similar issue to what I was having...
Neo4j spark connector loadDataFrame gives error
and solved by:
https://github.com/neo4j-contrib/neo4j-spark-connector/issues/40
It just seems like it wants more from what I am returning.

Array properties are not yet supported by the connector according to the maintainers (see https://neo4j-users.slack.com/archives/C0N7LHVS9/p1534429756000100)
There are two workarounds:
Use UNWIND in the query and then collect it back on spark
Convert the array to a string using a REDUCE operation
I prefer the 2nd approach but I am not sure how well it scales with big data.
So your query would look something like:
RETURN REDUCE(s = HEAD(bbox), n IN TAIL(bbox) | s + ', ' + n) AS bbox,
And the code that processes it:
neo4j.cypher(QUERY).as[String].map(bbox => bbox.split(', ')

loadDataFrame you need decleard the dataframe schema with the ( fieldName and fieldtype )
like this:
val rawGraphnode=neo.cypher("MATCH (n:person)where (n.duration <>0) RETURN n.user as user,n.other as other,n.direction as direction,n.duration as duration,n.timestamp as timestamp")
.loadDataFrame(schema = ("user","object"),("other","object"),("direction","string"),("duration","String"),("timestamp","String"))
rawGraphnode.printSchema()
rawGraphnode.show(10)

Spark createDataFrame failing with ArrayOutOfBoundsException

I'm pretty new to Spark and am having a problem converting an RDD to a DataFrame. What I'm trying to do is take a log file, convert it to JSON using an existing jar (returns a string), and then make that resulting json into a dataframe. Here is what I have so far:
val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
val txfm = new JsonParser //*jar to parse logs to json*//
partition.map(line => {
Row(txfm.parseLine(line))
})
})
When I run a take(2) on this I get something like:
[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]
My problem comes here. I create a schema and try to create the df
val schema = StructType(Array(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true),...))
val jsonDf = sqlSession.createDataFrame(jsonRows, schema)
And the returned error is
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
Can someone tell me what I'm doing wrong here? Most of the SO answers I've found say I can use either createDataFrame or toDF(), but I've had no luck with either. I also tried converting the RDD to a JavaRDD, but that also did not work. Appreciate any insight you can give.

your defined schema is for RDD like:
{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}
if you can change your RDD to make data as
{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}
an use this schema:
val schema = StructType(Seq(
StructField("logs",ArrayType( StructType(Seq(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true), ...))
))
))
sqlContext.read.schema(schema).json(jsonRows)

Unable to convert an RDD[Row] to a DataFrame

For the following code - in which a DataFrame is converted to RDD[Row] and data for a new column is appended via mapPartitions:
// df is a DataFrame
val dfRdd = df.rdd.mapPartitions {
val bfMap = df.rdd.sparkContext.broadcast(factorsMap)
iter =>
val locMap = bfMap.value
iter.map { r =>
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
}
}
The output is correct for the RDD[Row] with another column:
println("**dfrdd\n" + dfRdd.take(5).mkString("\n"))
**dfrdd
[ArrayBuffer(0021BEC286CC, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 148818)]
[ArrayBuffer(0021BEE7C556, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 26908)]
[ArrayBuffer(8C7F3BFD4B82, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 99942)]
[ArrayBuffer(0021BEC8F8B8, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 53994)]
[ArrayBuffer(10EA59F10C8B, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 1427)]
Let us try to convert the RDD[Row] back to a DataFrame:
val newSchema = df.schema.add(StructField("userf",IntegerType))
Now let us create the updated DataFrame:
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
Is the new schema looking correct?
newSchema.show()
root
|-- user: string (nullable = true)
|-- score: long (nullable = true)
|-- programType: string (nullable = true)
|-- source: string (nullable = true)
|-- item: string (nullable = true)
|-- playType: string (nullable = true)
|-- userf: integer (nullable = true)
Notice we do see the new userf column..
However it does not work:
println("df2: " + df2.take(1))
Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.collection.mutable.ArrayBuffer is not a
valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true) AS user#28
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
So: what detail is missing here?
Note: I am not interested in different approaches: e.g. withColumn or Datasets.. Let us please consider only the approach:
convert to RDD
add new data element to each row
update the schema for the new column
convert the new RDD+schema back to DataFrame

There seems to be a small mistake calling Row's constructor:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
The signature of this "constructor" (apply method, actually) is:
def apply(values: Any*): Row
When you pass a Seq[Any], it is treated as a single value of type Seq[Any]. You want to pass the elements of this Sequence, therefore you should use:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq: _*)
Once this is fixed, the Rows will match the schema you built, and you'll get the expected result.

java.lang.String is not a valid external type for schema of string

I'm trying to load some csv data into a spark cluster and run some queries on it, but i'm running into problems getting the data loaded.
See code sample below - I've generated a header and am trying to parse the columns, but the process fails when running against the (large, column rich) data set with an obfuscated error message: 'java.lang.String is not a valid external type for schema of string'
This doesn't seem to be solved elsewhere on the internet - any one know what the problem might be?
(I originally thought this might be related to null or empty fields being loaded, but the process fails after some time, and the source data is very very sparse)
var headers = StructType(header_clean.split(",").map(fieldName ⇒ StructField(fieldName, StringType, true)))
var contentRdd = contentNoHeader.map(k => k.split(",")).map(
p => Row(p.map( x => x.replace("\"", "").trim)))
contentRdd.createOrReplaceTempView("someView")
val domains = spark.sql("SELECT DISTINCT domain FROM someView")
For reference, bottom of error log (very spammy, lots of columns
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 87, pageUrl), StringType), true) AS pageUrl#377
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 87, pageUrl), StringType), true) :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true] :
+- 87 :- null +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 87, pageUrl), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 87, pageUrl), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 87, pageUrl)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537) at org.apache.spark.sql.SparkSession$$anonfun$5.apply(SparkSession.scala:537) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) ... 3 more Caused by: java.lang.RuntimeException: [Ljava.lang.String; is not a valid external type for schema of string at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:276) ... 17 more

I solved this problem by split the element of Row. You can do this:
StructType(header_clean.split(",").map(fieldName ⇒StructField(fieldName, StringType, true)))
var contentRdd = contentNoHeader.map(k => k.split(",")).map(
p => {
val ppp = p.map( x => x.replace("\"", "").trim)
Row(ppp(0),ppp(1),ppp(2))
})

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to merge multiple binary files into parquet in spark? - scala

Related

What is Error: java.util.Collections$UnmodifiableRandomAccessList is not a valid

Spark/Neo4j throws error: RuntimeException: java.util.Collections$UnmodifiableRandomAccessList is not a valid external type for schema of string

Spark createDataFrame failing with ArrayOutOfBoundsException

Unable to convert an RDD[Row] to a DataFrame

java.lang.String is not a valid external type for schema of string

Categories

Resources