select / drop does not really drop the column?

select / drop does not really drop the column? - scala

I think I don't undestand how select or drop are working.
I am exploding a dataset and I don't want some of the columns to be copied to the newly generated entries.
val ds = spark.sparkContext.parallelize(Seq(
("2017-01-01 06:15:00", "ASC_a", "1"),
("2017-01-01 06:19:00", "start", "2"),
("2017-01-01 06:22:00", "ASC_b", "2"),
("2017-01-01 06:30:00", "end", "2"),
("2017-01-01 10:45:00", "ASC_a", "3"),
("2017-01-01 10:50:00", "start", "3"),
("2017-01-01 11:22:00", "ASC_c", "4"),
("2017-01-01 11:31:00", "end", "5" )
)).toDF("timestamp", "status", "msg")
ds.show()
val foo = ds.select($"timestamp", $"msg")
val bar = ds.drop($"status")
foo.printSchema()
bar.printSchema()
println("foo " + foo.where($"status" === "end").count)
println("bar" + bar.where($"status" === "end").count)
Output:
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
root
|-- timestamp: string (nullable = true)
|-- msg: string (nullable = true)
foo 2
bar 2
Why do I still get an output of 2 for both though I
a) did not select status
b) dropped status
EDIT:
println("foo " + foo.where(foo.col("status") === "end").count) says that there is no column status. Should this not be the same as println("foo " + foo.where($"status" === "end").count)?

Why do I still get an output of 2 for both
Because optimizer is free to reorganize the execution plan. In fact if you check it:
== Physical Plan ==
*Project [_1#4 AS timestamp#8, _3#6 AS msg#10]
+- *Filter (isnotnull(_2#5) && (_2#5 = end))
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true) AS _1#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._2, true) AS _2#5, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._3, true) AS _3#6]
+- Scan ExternalRDDScan[obj#3]
you'll see that filter is pushed down as early as possible and executed before project. So it is equivalent to:
SELECT _1 AS timetsatmp, _2 AS msg
FROM ds WHERE _2 IS NOT NULL AND _2 = 'end'
Arguably it is a minor bug, and code should be translated as
SELECT * FROM (
SELECT _1 AS timetsatmp, _2 AS msg FROM ds
) WHERE _2 IS NOT NULL AND _2 = 'end'
and throw an exception.

Related

How to add header to csv table scala spark

I am trying to read data from a table that is in a csv file. It does not have a header so when I try and query the table using Spark SQL, all the results are null.
I have tried creating a schema struct, and while it does display when I do printschema(), when I try and ( select * from tableName ) it does not work, all values are null. I have also tried the StructType() and .add( colName ) instead of StructField and that yielded the same results.
val schemaStruct1 = StructType(
StructField( "AgreementVersionID", IntegerType, true )::
StructField( "ProgramID", IntegerType, true )::
StructField( "AgreementID", IntegerType, true )::
StructField( "AgreementVersionNumber", IntegerType, true )::
StructField( "AgreementStatusID", IntegerType, true )::
StructField( "AgreementEffectiveDate", DateType, true )::
StructField( "AgreementEffectiveDateDay", IntegerType, true )::
StructField( "AgreementEndDate", DateType, true )::
StructField( "AgreementEndDateDay", IntegerType, true )::
StructField( "MasterAgreementNumber", IntegerType, true )::
StructField( "MasterAgreementEffectiveDate", DateType, true )::
StructField( "MasterAgreementEffectiveDateDay", IntegerType, true )::
StructField( "MasterAgreementEndDate", DateType, true )::
StructField( "MasterAgreementEndDateDay", IntegerType, true )::
StructField( "SalesContactName", StringType, true )::
StructField( "RevenueSubID", IntegerType, true )::
StructField( "LicenseAgreementContractTypeID", IntegerType, true )::Nil
)
val df1 = session.read
.option( "header", true )
.option( "delimiter", "," )
.schema( schemaStruct1 )
.csv( LicenseAgrmtMaster )
df1.printSchema()
df1.createOrReplaceTempView( "LicenseAgrmtMaster" )
Printing this schema gives me this schema which is correct
root
|-- AgreementVersionID: integer (nullable = true)
|-- ProgramID: integer (nullable = true)
|-- AgreementID: integer (nullable = true)
|-- AgreementVersionNumber: integer (nullable = true)
|-- AgreementStatusID: integer (nullable = true)
|-- AgreementEffectiveDate: date (nullable = true)
|-- AgreementEffectiveDateDay: integer (nullable = true)
|-- AgreementEndDate: date (nullable = true)
|-- AgreementEndDateDay: integer (nullable = true)
|-- MasterAgreementNumber: integer (nullable = true)
|-- MasterAgreementEffectiveDate: date (nullable = true)
|-- MasterAgreementEffectiveDateDay: integer (nullable = true)
|-- MasterAgreementEndDate: date (nullable = true)
|-- MasterAgreementEndDateDay: integer (nullable = true)
|-- SalesContactName: string (nullable = true)
|-- RevenueSubID: integer (nullable = true)
|-- LicenseAgreementContractTypeID: integer (nullable = true)
which is correct however trying to query this gives me a table yielding only null values even though the table is not filled with nulls. I need to be able to read this table in order to join to another to complete a stored procedure

I would suggest go with steps below then you can change your code based on your need
val df = session.read.option( "delimiter", "," ).csv("<Path of your file/dir>")
val colum_names = Seq("name","id")// this is example define exact number of columns
val dfWithHeader = df.toDF(colum_names:_*)
// now you have header here and data should be also here check the type or you can cast

Creating a Spark Vector Column with createDataFrame

I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()

To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib

Spark createDataFrame failing with ArrayOutOfBoundsException

I'm pretty new to Spark and am having a problem converting an RDD to a DataFrame. What I'm trying to do is take a log file, convert it to JSON using an existing jar (returns a string), and then make that resulting json into a dataframe. Here is what I have so far:
val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
val txfm = new JsonParser //*jar to parse logs to json*//
partition.map(line => {
Row(txfm.parseLine(line))
})
})
When I run a take(2) on this I get something like:
[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]
My problem comes here. I create a schema and try to create the df
val schema = StructType(Array(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true),...))
val jsonDf = sqlSession.createDataFrame(jsonRows, schema)
And the returned error is
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
Can someone tell me what I'm doing wrong here? Most of the SO answers I've found say I can use either createDataFrame or toDF(), but I've had no luck with either. I also tried converting the RDD to a JavaRDD, but that also did not work. Appreciate any insight you can give.

your defined schema is for RDD like:
{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}
if you can change your RDD to make data as
{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}
an use this schema:
val schema = StructType(Seq(
StructField("logs",ArrayType( StructType(Seq(
StructField("pwh",StringType,true),
StructField("sVe",StringType,true), ...))
))
))
sqlContext.read.schema(schema).json(jsonRows)

Cannot resolve column (numeric column name) in Spark Dataframe

This is my data:
scala> data.printSchema
root
|-- 1.0: string (nullable = true)
|-- 2.0: string (nullable = true)
|-- 3.0: string (nullable = true)
This doesn't work :(
scala> data.select("2.0").show
Exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];;
'Project ['2.0]
+- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617]
+- LocalRelation [_1#5608, _2#5609, _3#5610]
...
Try this at home (I'm running on the shell v_2.1.0.5)!
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select("2.0").show

You can use backticks to escape the dot, which is reserved for accessing columns for struct type:
data.select("`2.0`").show
+---+
|2.0|
+---+
| , |
+---+

The problem is you can not add dot character in the column name while selecting from dataframe. You can have a look at this question, kind of similar.
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select(sanitize("2.0")).show
def sanitize(input: String): String = s"`$input`"

Unable to convert an RDD[Row] to a DataFrame

For the following code - in which a DataFrame is converted to RDD[Row] and data for a new column is appended via mapPartitions:
// df is a DataFrame
val dfRdd = df.rdd.mapPartitions {
val bfMap = df.rdd.sparkContext.broadcast(factorsMap)
iter =>
val locMap = bfMap.value
iter.map { r =>
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
}
}
The output is correct for the RDD[Row] with another column:
println("**dfrdd\n" + dfRdd.take(5).mkString("\n"))
**dfrdd
[ArrayBuffer(0021BEC286CC, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 148818)]
[ArrayBuffer(0021BEE7C556, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 26908)]
[ArrayBuffer(8C7F3BFD4B82, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 99942)]
[ArrayBuffer(0021BEC8F8B8, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 53994)]
[ArrayBuffer(10EA59F10C8B, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 1427)]
Let us try to convert the RDD[Row] back to a DataFrame:
val newSchema = df.schema.add(StructField("userf",IntegerType))
Now let us create the updated DataFrame:
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
Is the new schema looking correct?
newSchema.show()
root
|-- user: string (nullable = true)
|-- score: long (nullable = true)
|-- programType: string (nullable = true)
|-- source: string (nullable = true)
|-- item: string (nullable = true)
|-- playType: string (nullable = true)
|-- userf: integer (nullable = true)
Notice we do see the new userf column..
However it does not work:
println("df2: " + df2.take(1))
Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.collection.mutable.ArrayBuffer is not a
valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true) AS user#28
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
: : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
So: what detail is missing here?
Note: I am not interested in different approaches: e.g. withColumn or Datasets.. Let us please consider only the approach:
convert to RDD
add new data element to each row
update the schema for the new column
convert the new RDD+schema back to DataFrame

There seems to be a small mistake calling Row's constructor:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq)
The signature of this "constructor" (apply method, actually) is:
def apply(values: Any*): Row
When you pass a Seq[Any], it is treated as a single value of type Seq[Any]. You want to pass the elements of this Sequence, therefore you should use:
val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
Row(newseq: _*)
Once this is fixed, the Rows will match the schema you built, and you'll get the expected result.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

select / drop does not really drop the column? - scala

Related

How to add header to csv table scala spark

Creating a Spark Vector Column with createDataFrame

Spark createDataFrame failing with ArrayOutOfBoundsException

Cannot resolve column (numeric column name) in Spark Dataframe

Unable to convert an RDD[Row] to a DataFrame

Categories

Resources