random number generator SparkSQL ?
For example:
Netezza: sequence number
mysql: sequence number
Thanks.
Sequence in spark sql is in spark 1.6 its select monotonically_increasing_id() from table , spark 1.6 is due to get released
Spark Sql already have random functions there is one blog.
Or for number of rows spark sql also have row_number() function.
Related
I have a databricks notebook written in Scala. And I have this dataframe generated like this:
val df = spark.sql("SELECT ColumnName FROM TableName")
I want to add another column RowID that will automatically populate the rows with integers. I don't want to use the row_number() function. I need CONSECUTIVE integers starting from 1. Is there any other way?
I checked this answer but it does not help me to generate consecutive integers. And monotonically_increasing_id is not working for me. Is this function valid for databricks? Do we need to import some modules?
Thanks!
I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.
I am trying to pivot a column which has more than 10000 distinct values. The default limit in Spark for maximum number of distinct values is 10000 and I am receiving this error
The pivot column COLUMN_NUM_2 has more than 10000 distinct values, this could indicate an error. If this was intended, set spark.sql.pivotMaxValues to at least the number of distinct values of the pivot column
How do I set this in PySpark?
You have to add / set this parameter in the Spark interpreter.
I am working with Zeppelin notebooks on an EMR (AWS) cluster, had the same error message as you and it worked after I added the parameter in the interpreter.
Hope this helps...
I had scenario where I will join 2 dataframe's then want to calculate usage of 2 columns.Presently the logic in SQL want to convert to spark dataframe
bml.Usage*COALESCE(u.ConValue,1) as acUsage where bml is a table inner joined with 'u' table ..While in data frames bml.join(u,seq("id"),"inner").Select(COALESCE ???) how to perform this operation either through UDF
Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above query performs cartesian product join in sparkSQL and it takes forever to complete.
Can I achieve the same functionality by some other means.
I tried broadcasting the smaller RDD
EDIT:
spark version 1.4.1
No. of executors 2
Memmory/executor 5g
No. of cores 5
You can not avoid Cartesian product as spark sql donot support Non-equi
links