I think I don't undestand how select or drop are working.
I am exploding a dataset and I don't want some of the columns to be copied to the newly generated entries.
val ds = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:15:00", "ASC_a", "1"),
("2017-01-01 06:19:00", "start", "2"),
("2017-01-01 06:22:00", "ASC_b", "2"),
("2017-01-01 06:30:00", "end", "2"),
("2017-01-01 10:45:00", "ASC_a", "3"),
("2017-01-01 10:50:00", "start", "3"),
("2017-01-01 11:22:00", "ASC_c", "4"),
("2017-01-01 11:31:00", "end", "5" )
)).toDF("timestamp", "status", "msg")
ds.show()
val foo = ds.select($"timestamp", $"msg")
val bar = ds.drop($"status")
foo.printSchema()
bar.printSchema()
println("foo " + foo.where($"status" === "end").count)
println("bar" + bar.where($"status" === "end").count)
Output:
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
foo 2
bar 2
Why do I still get an output of 2 for both though I
a) did not select status
b) dropped status
EDIT:
println("foo " + foo.where(foo.col("status") === "end").count) says that there is no column status. Should this not be the same as println("foo " + foo.where($"status" === "end").count)?
Why do I still get an output of 2 for both
Because optimizer is free to reorganize the execution plan. In fact if you check it:
== Physical Plan ==
*Project [_1#4 AS timestamp#8, _3#6 AS msg#10]
+- *Filter (isnotnull(_2#5) && (_2#5 = end))
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true) AS _1#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#5, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true) AS _3#6]
+- Scan ExternalRDDScan[obj#3]
you'll see that filter is pushed down as early as possible and executed before project. So it is equivalent to:
SELECT _1 AS timetsatmp, _2 AS msg
FROM ds WHERE _2 IS NOT NULL AND _2 = 'end'
Arguably it is a minor bug, and code should be translated as
SELECT * FROM (
SELECT _1 AS timetsatmp, _2 AS msg FROM ds
) WHERE _2 IS NOT NULL AND _2 = 'end'
and throw an exception.
Related
I am trying to read data from a table that is in a csv file. It does not have a header so when I try and query the table using Spark SQL, all the results are null.
I have tried creating a schema struct, and while it does display when I do printschema(), when I try and ( select * from tableName ) it does not work, all values are null. I have also tried the StructType() and .add( colName ) instead of StructField and that yielded the same results.
val schemaStruct1 = StructType(
StructField( "AgreementVersionID", IntegerType, true )::
StructField( "ProgramID", IntegerType, true )::
StructField( "AgreementID", IntegerType, true )::
StructField( "AgreementVersionNumber", IntegerType, true )::
StructField( "AgreementStatusID", IntegerType, true )::
StructField( "AgreementEffectiveDate", DateType, true )::
StructField( "AgreementEffectiveDateDay", IntegerType, true )::
StructField( "AgreementEndDate", DateType, true )::
StructField( "AgreementEndDateDay", IntegerType, true )::
StructField( "MasterAgreementNumber", IntegerType, true )::
StructField( "MasterAgreementEffectiveDate", DateType, true )::
StructField( "MasterAgreementEffectiveDateDay", IntegerType, true )::
StructField( "MasterAgreementEndDate", DateType, true )::
StructField( "MasterAgreementEndDateDay", IntegerType, true )::
StructField( "SalesContactName", StringType, true )::
StructField( "RevenueSubID", IntegerType, true )::
StructField( "LicenseAgreementContractTypeID", IntegerType, true )::Nil
)
val df1 = session.read
.option( "header", true )
.option( "delimiter", "," )
.schema( schemaStruct1 )
.csv( LicenseAgrmtMaster )
df1.printSchema()
df1.createOrReplaceTempView( "LicenseAgrmtMaster" )
Printing this schema gives me this schema which is correct
root
|-- AgreementVersionID: integer (nullable = true)
|-- ProgramID: integer (nullable = true)
|-- AgreementID: integer (nullable = true)
|-- AgreementVersionNumber: integer (nullable = true)
|-- AgreementStatusID: integer (nullable = true)
|-- AgreementEffectiveDate: date (nullable = true)
|-- AgreementEffectiveDateDay: integer (nullable = true)
|-- AgreementEndDate: date (nullable = true)
|-- AgreementEndDateDay: integer (nullable = true)
|-- MasterAgreementNumber: integer (nullable = true)
|-- MasterAgreementEffectiveDate: date (nullable = true)
|-- MasterAgreementEffectiveDateDay: integer (nullable = true)
|-- MasterAgreementEndDate: date (nullable = true)
|-- MasterAgreementEndDateDay: integer (nullable = true)
|-- SalesContactName: string (nullable = true)
|-- RevenueSubID: integer (nullable = true)
|-- LicenseAgreementContractTypeID: integer (nullable = true)
which is correct however trying to query this gives me a table yielding only null values even though the table is not filled with nulls. I need to be able to read this table in order to join to another to complete a stored procedure
I would suggest go with steps below then you can change your code based on your need
val df = session.read.option( "delimiter", "," ).csv("<Path of your file/dir>")
val colum_names = Seq("name","id")// this is example define exact number of columns
val dfWithHeader = df.toDF(colum_names:_*)
// now you have header here and data should be also here check the type or you can cast
I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()
To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
I'm pretty new to Spark and am having a problem converting an RDD to a DataFrame. What I'm trying to do is take a log file, convert it to JSON using an existing jar (returns a string), and then make that resulting json into a dataframe. Here is what I have so far:
val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
val txfm = new JsonParser //*jar to parse logs to json*//
partition.map(line => {
Row(txfm.parseLine(line))
})
})
When I run a take(2) on this I get something like:
[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]
My problem comes here. I create a schema and try to create the df
val schema = StructType(Array(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true),...))
val jsonDf = sqlSession.createDataFrame(jsonRows, schema)
And the returned error is
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
Can someone tell me what I'm doing wrong here? Most of the SO answers I've found say I can use either createDataFrame or toDF(), but I've had no luck with either. I also tried converting the RDD to a JavaRDD, but that also did not work. Appreciate any insight you can give.
your defined schema is for RDD like:
{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}
if you can change your RDD to make data as
{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}
an use this schema:
val schema = StructType(Seq(
StructField("logs",ArrayType( StructType(Seq(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true), ...))
))
))
sqlContext.read.schema(schema).json(jsonRows)
This is my data:
scala> data.printSchema
root
|-- 1.0: string (nullable = true)
|-- 2.0: string (nullable = true)
|-- 3.0: string (nullable = true)
This doesn't work :(
scala> data.select("2.0").show
Exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];;
'Project ['2.0]
+- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617]
+- LocalRelation [_1#5608, _2#5609, _3#5610]
...
Try this at home (I'm running on the shell v_2.1.0.5)!
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select("2.0").show
You can use backticks to escape the dot, which is reserved for accessing columns for struct type:
data.select("`2.0`").show
+---+
|2.0|
+---+
| , |
+---+
The problem is you can not add dot character in the column name while selecting from dataframe. You can have a look at this question, kind of similar.
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select(sanitize("2.0")).show
def sanitize(input: String): String = s"`$input`"
For the following code - in which a DataFrame is converted to RDD[Row] and data for a new column is appended via mapPartitions:
// df is a DataFrame
val dfRdd = df.rdd.mapPartitions {
val bfMap = df.rdd.sparkContext.broadcast(factorsMap)
iter =>
val locMap = bfMap.value
iter.map { r =>
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
}
}
The output is correct for the RDD[Row] with another column:
println("**dfrdd\n" + dfRdd.take(5).mkString("\n"))
**dfrdd
[ArrayBuffer(0021BEC286CC, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 148818)]
[ArrayBuffer(0021BEE7C556, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 26908)]
[ArrayBuffer(8C7F3BFD4B82, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 99942)]
[ArrayBuffer(0021BEC8F8B8, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 53994)]
[ArrayBuffer(10EA59F10C8B, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 1427)]
Let us try to convert the RDD[Row] back to a DataFrame:
val newSchema = df.schema.add(StructField("userf",IntegerType))
Now let us create the updated DataFrame:
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
Is the new schema looking correct?
newSchema.show()
root
|-- user: string (nullable = true)
|-- score: long (nullable = true)
|-- programType: string (nullable = true)
|-- source: string (nullable = true)
|-- item: string (nullable = true)
|-- playType: string (nullable = true)
|-- userf: integer (nullable = true)
Notice we do see the new userf column..
However it does not work:
println("df2: " + df2.take(1))
Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.collection.mutable.ArrayBuffer is not a
valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true) AS user#28
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
So: what detail is missing here?
Note: I am not interested in different approaches: e.g. withColumn or Datasets.. Let us please consider only the approach:
convert to RDD
add new data element to each row
update the schema for the new column
convert the new RDD+schema back to DataFrame
There seems to be a small mistake calling Row's constructor:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
The signature of this "constructor" (apply method, actually) is:
def apply(values: Any*): Row
When you pass a Seq[Any], it is treated as a single value of type Seq[Any]. You want to pass the elements of this Sequence, therefore you should use:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq: _*)
Once this is fixed, the Rows will match the schema you built, and you'll get the expected result.