I can print dataframe fine before aggregation
(Pdb) df_interesting.printSchema()
root
|-- userId: long (nullable = true)
|-- screen_index: integer (nullable = true)
|-- type: string (nullable = true)
|-- time_delta: float (nullable = true)
|-- app_open_index: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)
(pdb) df_interesting.show(n=2)
+------+------------+------+----------+--------------+--------------------+
|userId|screen_index| type|time_delta|app_open_index| timestamp|
+------+------------+------+----------+--------------+--------------------+
|214431| 7|screen| 60.0| 13|2020-07-31 07:52:...|
|398910| 3|screen| 60.0| 2|2020-07-29 11:43:...|
+------+------------+------+----------+--------------+--------------------+
However, after aggregation, show() results in an error..
(Pdb) df_interesting.groupBy('app_open_index').agg(F.max("screen_index").alias("max_screen_index")).show(n=2)
[Stage 1:> (0 + 2) / 2]20/08/13 18:07:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: The value (Buffer()) of the type (scala.collection.convert.Wrappers.JListWrapper) cannot be converted to the string type
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:290)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:285)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:248)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
Edit
I tried to single out the column, and here's some progress
(Pdb) df_interesting = df_interesting.select(col('data.userId').alias('userId'))
(Pdb) df_interesting.count()
[Stage 0:> (0 + 2) / 2]20/08/13 18:59:12 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'data.properties.priceObj' not found; typically this occurs \
with arrays which are not mapped as single value
Related
I am writing a Spark dataframe where one of the column is of Vector datatype as ORC. When I load back the dataframe the schema changes.
var df : DataFrame = spark.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
df.printSchema
df.write.mode(SaveMode.Overwrite).orc("/some/path")
val newDF = spark.read.orc("/some/path")
newDF.printSchema
The output of df.printSchema is
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
The output of newDF.printSchema is
|-- label: double (nullable = true)
|-- features: struct (nullable = true)
| |-- type: byte (nullable = true)
| |-- size: integer (nullable = true)
| |-- indices: array (nullable = true)
| | |-- element: integer (containsNull = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
What is the issue here? I am using Spark 2.2.0 with Scala 2.11.8
I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;
Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")
I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)
my DataFrame like this :
+------------------------+----------------------------------------+
|ID |probability |
+------------------------+----------------------------------------+
|583190715ccb64f503a|[0.49128147201958017,0.5087185279804199]|
|58326da75fc764ad200|[0.42143416087939345,0.5785658391206066]|
|583270ff17c76455610|[0.3949217100212508,0.6050782899787492] |
|583287c97ec7641b2d4|[0.4965059792664432,0.5034940207335569] |
|5832d7e279c764f52e4|[0.49128147201958017,0.5087185279804199]|
|5832e5023ec76406760|[0.4775830044196701,0.52241699558033] |
|5832f88859cb64960ea|[0.4360509428173421,0.563949057182658] |
|58332e6238c7643e6a7|[0.48730029128352853,0.5126997087164714]|
and I get the column of probability using
val proVal = Data.select("probability").rdd.map(r => r(0)).collect()
proVal.foreach(println)
the result is :
[0.49128147201958017,0.5087185279804199]
[0.42143416087939345,0.5785658391206066]
[0.3949217100212508,0.6050782899787492]
[0.4965059792664432,0.5034940207335569]
[0.49128147201958017,0.5087185279804199]
[0.4775830044196701,0.52241699558033]
[0.4360509428173421,0.563949057182658]
[0.48730029128352853,0.5126997087164714]
but I want to get the first column of data for each row, like this:
0.49128147201958017
0.42143416087939345
0.3949217100212508
0.4965059792664432
0.49128147201958017
0.4775830044196701
0.4360509428173421
0.48730029128352853
how can this be done?
The input is standard random forest input, above the input is val Data = predictions.select("docID", "probability")
predictions.printSchema()
root
|-- docID: string (nullable = true)
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
|-- indexedLabel: double (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
|-- predictedLabel: string (nullable = true)
and I want to get the first value of the "probability" column
You can use the Column.apply method to get the n-th item on an array column - in this case the first column (using index 0):
import sqlContext.implicits._
val proVal = Data.select($"probability"(0)).rdd.map(r => r(0)).collect()
BTW, if you're using Spark 1.6 or higher, you can also use the Dataset API for a cleaner way to convert the dataframe into Doubles:
val proVal = Data.select($"probability"(0)).as[Double].collect()
I put some log files into sql tables through Spark and my schema looks like this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|-- cs_method: string (nullable = true)
|-- cs_uri_stem: string (nullable = true)
|-- cs_query: string (nullable = true)
|-- sc_status: integer (nullable = false)
|-- sc_bytes: integer (nullable = false)
|-- cs_bytes: integer (nullable = false)
|-- time_taken: integer (nullable = false)
|-- User_Agent: string (nullable = true)
|-- Referrer: string (nullable = true)
As you can notice I created a timestamp field which I read is supported by Spark (Date wouldn't work as far as I understood). I would love to use for queries like "where timestamp>(2012-10-08 16:10:36.0)" but when I run it I keep getting errors.
I tried these 2 following sintax forms:
For the second one I parse a string so Im sure Im actually pass it in a timestamp format.
I use 2 functions: parse and date2timestamp.
Any hint on how I should handle timestamp values?
Thanks!
1)
scala> sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)").collect
java.lang.RuntimeException: [1.55] failure: ``)'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)
^
2)
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp="+date2timestamp(formatTime3.parse("2012-10-08 16:10:36.0"))).collect
java.lang.RuntimeException: [1.54] failure: ``UNION'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=2012-10-08 16:10:36.0
^
I figured that the problem was the precision of the timestamp first of all and also the string that I pass representing the timestamp has to be casted as a String
So this query works now:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestampLog as String) <= '2012-10-08 16:10:36'")
You forgot the quotation marks.
Try something with this syntax:
L.timestamp = '2012-07-16 00:00:00'
Alternatively, try
L.timestamp = CAST('2012-07-16 00:00:00' AS TIMESTAMP)
Cast the string representation of the timestamp to timestamp. cast('2012-10-10 12:00:00' as timestamp) Then you can do comparison as timestamps, not strings. Instead of:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as String) <= '2012-10-08 16:10:36'")
try
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp <= cast('2012-10-08 16:10:36' as timestamp)")
Sadly this didn't work for me. I am using Apache Spark 1.4.1. The following code is my solution:
Date date = new Date();
String query = "SELECT * FROM Logs as l where l.timestampLog <= CAST('" + new java.sql.Timestamp(date.getTime()) + "' as TIMESTAMP)";
sqlContext.sql(query);
Casting the timestampLog as string did not throw any errors but returned no data.