I have existing code as below in Scala/Spark, where I am supposed to replace repartition() to coalesce(). But after changing to coalesce its not compiling and saying datatype mismatch as it is considering it as Column Name.
How could I change existing code to Coalesce (with Column names) or there is no way to do it?
As I am new to Scala any suggestion would help and appreciate it. Do let me know if need any more details. Thanks!
val accountList = AccountList(MAPR_DB, src_accountList).filterByAccountType("GAMMA")
.fetchOnlyAccountsToProcess.df
.repartition($"Account", $"SecurityNo", $"ActivityDate")
val accountNos = broadcast(accountList.select($"AccountNo", $"Account").distinct)
Related
I need to assign one DF columns value using other DF columns. For this i wrote below
DF1.withColumm("hic_num",lit(DF2.select("hic_num")))
And got error:
sparkRuntimeException the feature is not supported:literal for [HICN:string] of class org.apache.spark.sql.Dataset.
Please help me with the above.
lit stands for literal and is, as the name suggests, a literal, or a constant. A column is not a constant.
You can do: .withColumn("hic_num2", col("hic_num")), you do not have to wrap this within a lateral.
Also, in your example, you are trying to create a new column called hic_num with the value of hic_num which does not make sense.
I'm kind of new in Spark Scala (also in StackOv so it's a pleasure to be part of the community). There's a query that I can not perform:
I have this DF where I wish everytime column "number" has the value of "446118", the column "type" change to "P" and all the other values in type, stay the same way. This without using withColumn function. I know could be with function where but I don't know how to use where when it's about changing the value of one column depending of another one.
I would be grateful if someone help me.
enter image description here
I am currently referring Spark in Action Book in that, I came across using same column in different ways.
val postsIdBody = postsDf.select('id, 'body)
val postsIdBody = postsDf.select($"id", $"body")
val postsIdBody = postsDf.select("id", "body")
we are able to get similar results. Is there any much difference between those? Can anyone clearly explain in what situations we need to implement each type of those.
Thanks in advance
I'm sure the book includes this, but by importing the implicits package in Scala, you can use these symbols to create Column objects without otherwise typing out new Column(name)
You would use column objects rather than strings because you can do ordering and aliasing easier within the dataframe API
I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.
Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.