update dataframe with row_number column - scala

I have a 243MB dataset. I need to update my Dataframe with row_number
and I tried using the below methods:
import org.apache.spark.sql.functions._
df.withColumn("Rownumber",functions.monotonically_increasing_id())
Now the row_number getting wrong after 248352 rows, after that row_number comes 8589934592 like this.
and also I used,
df.registerTempTable("table")
val query = s"select *,ROW_NUMBER() OVER (order by Year) as Rownumber from table"
val z = hiveContext.sql(query)
Using this method, I got the answer but this take more time. Hence I can't use this method.
Same is the problem with df.rdd.zipwithIndex
What is the best way to solve this in spark-scala ? i'm using spark 2.3.0.

Related

Spark SQL: Generate a row id column with auto increments in CONSECUTIVE integer

I have a databricks notebook written in Scala. And I have this dataframe generated like this:
val df = spark.sql("SELECT ColumnName FROM TableName")
I want to add another column RowID that will automatically populate the rows with integers. I don't want to use the row_number() function. I need CONSECUTIVE integers starting from 1. Is there any other way?
I checked this answer but it does not help me to generate consecutive integers. And monotonically_increasing_id is not working for me. Is this function valid for databricks? Do we need to import some modules?
Thanks!

How to optimize broadcast join in spark Scala?

I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.

How to use CROSS JOIN and CROSS APPLY in Spark SQL

I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL.
select Table1.Column1,Table2.Column2,Table3.Column3
from Table1 CROSS JOIN Table2 CROSS APPLY Table3
I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance.
First set the below property in spark conf
spark.sql.crossJoin.enabled=true
then dataFrame1.join(dataFrame2) will do Cross/Cartesian join,
we can use below query also for doing the same
sqlContext.sql("select * from table1 CROSS JOIN table2 CROSS JOIN table3...")
Set Spark Configuration ,
var sparkConf: SparkConf = null
sparkConf = new SparkConf()
.set("spark.sql.crossJoin.enabled", "true")
Explicit Cross Join in spark 2.x using crossJoin Method
crossJoin(right: Dataset[_]): DataFrame
var df_new = df1.crossJoin(df2);
Note : Cross joins are one of the most time consuming joins and often should be avoided.

Sort in descending order, using hive table in spark scala

I have a Hive table with account numbers and most recent updated dates. Not every account is updated each day, so I can't simply select all records from a certain day. I need to group by account number and then sort in descending order to take the most recent 2 days for each account. My script so far:
sc.setLogLevel("ERROR")
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df1 = sqlContext.sql("FROM mydb.mytable SELECT account_num, last_updated")
val DFGrouped = df1.groupBy("account_num").orderBy(desc("data_dt"))
I'm getting error on the orderBy:
value orderBy is not a member of org.apache.spark.sql.GroupedData
Any idea on what I should be doing here?
Grouping will not work here because this is a form of the top N by group problem.
You need to use Spark SQL window functions, in particular, rank() with partition by account ID and order by date descending, followed by selecting the rows with rank <= 2.