pyspark, aggregations results in an error - pyspark

I can print dataframe fine before aggregation
(Pdb) df_interesting.printSchema()
root
|-- userId: long (nullable = true)
|-- screen_index: integer (nullable = true)
|-- type: string (nullable = true)
|-- time_delta: float (nullable = true)
|-- app_open_index: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)
(pdb) df_interesting.show(n=2)
+------+------------+------+----------+--------------+--------------------+
|userId|screen_index| type|time_delta|app_open_index| timestamp|
+------+------------+------+----------+--------------+--------------------+
|214431| 7|screen| 60.0| 13|2020-07-31 07:52:...|
|398910| 3|screen| 60.0| 2|2020-07-29 11:43:...|
+------+------------+------+----------+--------------+--------------------+
However, after aggregation, show() results in an error..
(Pdb) df_interesting.groupBy('app_open_index').agg(F.max("screen_index").alias("max_screen_index")).show(n=2)
[Stage 1:> (0 + 2) / 2]20/08/13 18:07:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: The value (Buffer()) of the type (scala.collection.convert.Wrappers.JListWrapper) cannot be converted to the string type
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:290)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:285)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:248)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
Edit
I tried to single out the column, and here's some progress
(Pdb) df_interesting = df_interesting.select(col('data.userId').alias('userId'))
(Pdb) df_interesting.count()
[Stage 0:> (0 + 2) / 2]20/08/13 18:59:12 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'data.properties.priceObj' not found; typically this occurs \
with arrays which are not mapped as single value

Related

Schema changes when writing Dataframe containing Vector

I am writing a Spark dataframe where one of the column is of Vector datatype as ORC. When I load back the dataframe the schema changes.
var df : DataFrame = spark.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
df.printSchema
df.write.mode(SaveMode.Overwrite).orc("/some/path")
val newDF = spark.read.orc("/some/path")
newDF.printSchema
The output of df.printSchema is
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
The output of newDF.printSchema is
|-- label: double (nullable = true)
|-- features: struct (nullable = true)
| |-- type: byte (nullable = true)
| |-- size: integer (nullable = true)
| |-- indices: array (nullable = true)
| | |-- element: integer (containsNull = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
What is the issue here? I am using Spark 2.2.0 with Scala 2.11.8

Explode array of structs to columns in Spark

I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;
Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

How to get the first row data of each list?

my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()

SparkSQL Timestamp query failure

I put some log files into sql tables through Spark and my schema looks like this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|-- cs_method: string (nullable = true)
|-- cs_uri_stem: string (nullable = true)
|-- cs_query: string (nullable = true)
|-- sc_status: integer (nullable = false)
|-- sc_bytes: integer (nullable = false)
|-- cs_bytes: integer (nullable = false)
|-- time_taken: integer (nullable = false)
|-- User_Agent: string (nullable = true)
|-- Referrer: string (nullable = true)
As you can notice I created a timestamp field which I read is supported by Spark (Date wouldn't work as far as I understood). I would love to use for queries like "where timestamp>(2012-10-08 16:10:36.0)" but when I run it I keep getting errors.
I tried these 2 following sintax forms:
For the second one I parse a string so Im sure Im actually pass it in a timestamp format.
I use 2 functions: parse and date2timestamp.
Any hint on how I should handle timestamp values?
Thanks!
1)
scala> sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)").collect
java.lang.RuntimeException: [1.55] failure: ``)'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)
^
2)
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp="+date2timestamp(formatTime3.parse("2012-10-08 16:10:36.0"))).collect
java.lang.RuntimeException: [1.54] failure: ``UNION'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=2012-10-08 16:10:36.0
^
I figured that the problem was the precision of the timestamp first of all and also the string that I pass representing the timestamp has to be casted as a String
So this query works now:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestampLog as String) <= '2012-10-08 16:10:36'")
You forgot the quotation marks.
Try something with this syntax:
L.timestamp = '2012-07-16 00:00:00'
Alternatively, try
L.timestamp = CAST('2012-07-16 00:00:00' AS TIMESTAMP)
Cast the string representation of the timestamp to timestamp. cast('2012-10-10 12:00:00' as timestamp) Then you can do comparison as timestamps, not strings. Instead of:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as String) <= '2012-10-08 16:10:36'")
try
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp <= cast('2012-10-08 16:10:36' as timestamp)")
Sadly this didn't work for me. I am using Apache Spark 1.4.1. The following code is my solution:
Date date = new Date();
String query = "SELECT * FROM Logs as l where l.timestampLog <= CAST('" + new java.sql.Timestamp(date.getTime()) + "' as TIMESTAMP)";
sqlContext.sql(query);
Casting the timestampLog as string did not throw any errors but returned no data.