I had scenario where I will join 2 dataframe's then want to calculate usage of 2 columns.Presently the logic in SQL want to convert to spark dataframe
bml.Usage*COALESCE(u.ConValue,1) as acUsage where bml is a table inner joined with 'u' table ..While in data frames bml.join(u,seq("id"),"inner").Select(COALESCE ???) how to perform this operation either through UDF
Related
when df1 and df2 has the same rows and
df1 and df2 has no duplicated value
what is the complexity for join operation df1.join(df2)?
my quess is to take O(n^2)
and is it possible to sort both the data frame and make it better performance?
if it's not what is the way to make a join faster im pyspark?
Even if df1 and df2 have same set of rows and if they are not partitioned, for joining them spark has to partition both the data frames on the join key. For spark 2.3 onwards, sort-merge joins the default join workhorse which would require both the data frames to be partitioned and sorted by the join key and then the join is performed. Both the data frames also have to be colocated for sort-merge join.
and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?
Yes, if you see that a particular data frame is used again and again in a join using the same join key then you can repartition the data frame on the join key and cache it for further use. Please refer below link for more details
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/
I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.
I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.
I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job
Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above query performs cartesian product join in sparkSQL and it takes forever to complete.
Can I achieve the same functionality by some other means.
I tried broadcasting the smaller RDD
EDIT:
spark version 1.4.1
No. of executors 2
Memmory/executor 5g
No. of cores 5
You can not avoid Cartesian product as spark sql donot support Non-equi
links