I am trying to transpose a huge dataframe (100Mx20K). As the dataframe is spread over multiple nodes and difficult to collect on the driver, I would like to do the transpose through conversion through mllib matrices. The idea seems to have been tested elsewhere, so the opted procedure was as follows:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.parquet("temp/test.parquet").select("H1","H2","H3","H4")
val matrixColumns = df.columns
val rdd = df.select(array(matrixColumns:_*).as("arr")).as[Array[Int]].rdd
.zipWithIndex()
.map{ case(arr, index) => IndexedRow(index, Vectors.dense(arr.map(_.toDouble)))}
val dm = new IndexedRowMatrix(rdd).toBlockMatrix().toLocalMatrix()
I noticed a possible type and tried substitution:
orig:
val rdd = df.select(array(matrixColumns:_*).as("arr"))....
modified:
val rdd = df.select(Array(matrixColumns:_*)).as("arr")...
However, neither works for me and the above change throws error:
scala> df.select(Array(matrixColumns:_*)).as("arr")
^
error: overloaded method select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1]): org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
I am unsure if there is a version issue (I am using Spark 3.3.0) or if the problem is elsewhere. I would be grateful for any help in fixing the above error.
Change the select invocation to:
df.select(matrixColumns.head, matrixColumns: _*)
or
import org.apache.spark.sql.functions.col
df.select(matrixColumns.map(col(_)):_*)
I want to convert an org.apache.spark.sql.DataFrame to org.apache.spark.rdd.RDD[(String, String)] in Databricks. Can anyone help?
Background (and a better solution is also welcome): I have a Kafka stream which (after some steps) becomes a 2 column data frame. I would like to put this into a Redis cache, first column as a key and second column as a value.
More specifically the type of the input is this: lastContacts: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: bigint]. I try to put into Redis as follows:
sc.toRedisKV(lastContacts)(redisConfig)
The error message looks like this:
notebook:20: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[(String, String)]
sc.toRedisKV(lastContacts)(redisConfig)
I already played around with some ideas (like function .rdd) but none helped.
You can use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element.
For example:
val df = Seq(("table1",432),
("table2",567),
("table3",987),
("table1",789)).
toDF("tablename", "Code").toDF()
df.show()
+---------+----+
|tablename|Code|
+---------+----+
| table1| 432|
| table2| 567|
| table3| 987|
| table1| 789|
+---------+----+
val rddDf = df.map(r => (r(0), r(1))).rdd // Type:RDD[(Any,Any)]
OR
val rdd = df.map(r => (r(0).toString, r(1).toString)).rdd //Type: RDD[(String,String)]
Please refer https://community.hortonworks.com/questions/106500/error-in-spark-streaming-kafka-integration-structu.html regarding AnalysisException: Queries with streaming sources must be executed with writeStream.start()
You need to wait for the termination of the query using query.awaitTermination()
To prevent the process from exiting while the query is active.
I am reading a CSV file in dataframe1 and then filter some columns in dataframe2, during selecting columns for dataframe2 from dataframe1 I want to apply my function on the column value. Like
import utilities._
val Logs = sqlContext.read
.format("csv")
.option("header", "true")
.load("dbfs:/mnt/records/Logs/2016.07.17/2016.07.17.{*}.csv")
val Log = Logs.select(
"key1",
utility.stringToGuid("username"),
"key2",
"key3",
"startdatetime",
"enddatetime")
display(Log)
so here I am calling utility.stringToGuid("username"). And it is giving me error:
notebook:5: error: overloaded method value select with alternatives:
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
So actually I found the answer to my question. Actually I was passing the string "username" to the utility function instead of passing the column value of "username".
So in argument it should be like utility.stringToGuid($"username"). In scala $"" is used to send the column enter code here value and in python col() is used.
I'm learning Spark in Scala coming from heavy Python abuse and I'm getting a java.lang.NullPointerException because I'm doing things the python way.
I have say 3 dataframes of shape 4x2 each, first column is always an index 0,1,2,3 and the second column is some binary feature. The end goal is to have a 4x4 dataframe with a join of all of individual ones. In python I would first define some master df and then loop over the intermediate ones, assigning at each loop the resulting joined dataframe to the master dataframe variable name (ugly):
dataframes = [temp1, temp2, temp3]
df = pd.DataFrame(index=[0,1,2,3]) # Master df
for temp in dataframes:
df = df.join(temp)
In Spark this doesnt play well:
q = "select * from table"
val df = sql(q) Works obviously
scala> val df = df.join(sql(q))
<console>:33: error: recursive value df needs type
val df = df.join(sql(q))
Ok so:
scala> val df:org.apache.spark.sql.DataFrame = df.join(sql(q))
java.lang.NullPointerException
... 50 elided
I think its highly likely that I'm not doing it the functional way. So I tried (uglyest!):
scala> :paste
// Entering paste mode (ctrl-D to finish)
sql(q).
join(sql(q), "device_id").
join(sql(q), "device_id").
join(sql(q), "device_id")
// Exiting paste mode, now interpreting.
res128: org.apache.spark.sql.DataFrame = [device_id: string, devtype: int ... 3 more fields]
This just looks ugly and inelegant and beginner. What would be a proper functional Scala way to achieve this?
foldLeft:
val dataframes: Seq[String] = ???
val df: Dataset[Row] = ???
dataframes.foldLeft(df)((acc, q) => acc.join(sql(q)))
And if you're looking for imperative equivalent of your Python code:
var dataframes: Seq[String] = ??? // IMPORTANT: var
for (q <- dataframes ) { df = df.join(sql(q)) }
Even simpler,
val dataframes: Seq[String] = ???
dataframes.reduce(_ join _)
I have a DataFrame that is created as follows:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
This is the content of data.txt:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
Is it possible to filter df in order to get the sum of 3rd column values for 123 for the last N days starting from now? I am interested in a flexible solution so that N could be defined as a parameter.
For example, if today would be 2016-11-16 and N would be equal to 5, then the sum of 3rd column values for 124 would be equal to 2.
This is my current solution:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
but it does not seem to work properly. 1. The line where I define column names ["key","date","qty"] does not compile for Scala 2.10.6 and Spark 1.6.2. 2. Also it returns a dataframe, while I need Int. Should I just do toString.toInt?
Both of the following won't compile :
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
The first one won't because it's a incorrect syntax and as for the second, it is because, like the error says, it's not a member, in other terms, the action is not supported.
The later one will compile with Spark 2.x but the following solution would also apply or you'll have a DataFrame with one column of type ArrayType.
Now let's solve the issue :
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
You can apply any filter you want and compute the aggregation needed, e.g :
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS: The above code has been tested in Spark 1.6.2 and 2.0.0