I am new to Spark and Scala and trying to figure out how to roll up hierarchical data. The input data map looks like this:
key
value
e1
e2
e2
e3
e4
e2
e5
e4
e6
e4
I am trying to find a way to roll up the input map into a data frame like below:
id
level1
level2
level3
e1
e2
e3
null
e2
e3
null
null
e3
null
null
null
e4
e2
null
null
e5
e4
e2
e3
e6
e4
e2
e3
I need to use Scala and Spark.
Any help will be appreciated.
Thanks
Since you are interested in 3 levels you could perform a left-join twice to achieve this. Initially I have prepared df1 by using a union to achieve all possible keys and an aggregate to achieve the first level before renaming the columns.
//rename columns and ensure that all value and keys exist as ids
val df1 = df.union(df.selectExpr("value as key","NULL as value"))
.groupBy("key")
.agg(max("value").as("value"))
.withColumnRenamed("key","id")
.withColumnRenamed("value","level1")
//first repetition for level 2
val level2 = df1.alias("l1").join(
df1.alias("l2"),
col("l1.level1")==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level1 as level2")
//second repetition for level 3
val level3 = level2.alias("l1").join(
df1.alias("l2"),
col("l1.level2")==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level1 as level3")
level3.show()
The example above was written in such a way which I hope will make it easier to identify a pattern between each repetition. i.e. you could loop to your desired level or depth using scala eg:
var df1 = df.union(df.selectExpr("value as key","NULL as value"))
.groupBy("key")
.agg(max("value").as("value"))
.withColumnRenamed("key","id")
.withColumnRenamed("value","level1")
.cache()
for( depth <- 1 to 3){
df1 = df1.alias("l1").join(
df1.alias("l2"),
col("l1.level"+depth)==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level"+depth+" as level"+(depth+1))
}
df1.show()
Related
I am trying to Convert a DataFrame into RDD and Splitting them into Specific number of Columns based on Number of Columns in DataFrame Dynamically and Elegantly
i.e
This is a sample data from a table in hive employee
Id Name Age State City
123 Bob 34 Texas Dallas
456 Stan 26 Florida Tampa
val temp_df = spark.sql("Select * from employee")
val temp2_rdd = temp_df.rdd.map(x => (x(0),x(1),x(2),x(3))
I am looking to generate the tem2_rdd dynamically based on the number of columns from the table.
It should not be hard coded as i did.
As the maximum size of tuple is 22 in scala, any other collection that can hold the rdd efficiently.
Coding Language : Spark Scala
Please advise.
Instead of extracting and transforming each element using index you can use toSeq method of Row object.
val temp_df = spark.sql("Select * from employee")
// RDD[List[Any]]
val temp2_rdd = temp_df.rdd.map(_.toSeq.toList)
// RDD[List[String]]
val temp3_rdd = temp_df.rdd.map(_.toSeq.map(_.toString).toList)
I have imported a csv file into a dataframe in Azure Databricks using scala.
--------------
A B C D E
--------------
a1 b1 c1 d1 e1
a2 b2 c2 d2 e2
--------------
Now I want to perform hash on some selective columns and add the result as a new column to that dataframe.
--------------------------------
A B B2 C D D2 E
--------------------------------
a1 b1 hash(b1) c1 d1 hash(d1) e1
a2 b2 hash(b2) c2 d2 hash(d2) e2
--------------------------------
This is the code I have:
val data_df = spark.read.format("csv").option("header", "true").option("sep", ",").load(input_file)
...
...
for (col <- columns) {
if (columnMapping.keys.contains((col))){
val newColName = col + "_token"
// Now here I want to add a new column to data_df and the content would be hash of the current value
}
}
// And here I would like to upload selective columns (B, B2, D, D2) to a SQL database
Any help will be highly appreciated.
Thank you!
Try this -
val colsToApplyHash = Array("B","D")
val hashFunction:String => String = <ACTUAL HASH LOGIC>
val hash = udf(hashFunction)
val finalDf = colsToApplyHash.foldLeft(data_df){
case(acc,colName) => acc.withColumn(colName+"2",hash(col(colName)))
}
I have a dataframe which contains 4 columns.
Dataframe sample
id1 id2 id3 id4
---------------
a1 a2 a3 a4
b1 b2 b3 b4
b1 b2 b3 b4
c1 c2 c3 c4
b2
c1
a3
a4
c1
d4
There are 2 types of data in a row either all the columns have data or only one column.
I want to perform distinct function on all the columns such as while comparing the values between rows, it will only compare the value which is present in a row and don't consider the null values.
Output dataframe should be
id1 id2 id3 id4
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d4
I have looked multiple examples of UDAF in spark. But not able to modified according.
you can use filter for all the columns as below
df.filter($"id1" =!= "" && $"id2" =!= "" && $"id3" =!= "" && $"id4" =!= "")
and you should get your final dataframe.
The above code is for static four columned dataframe. If you have more than four columns above method would become hectic as you would have to write too many logic checkings.
the solution to that would be to use a udf function as below
import org.apache.spark.sql.functions._
def checkIfNull = udf((co : mutable.WrappedArray[String]) => !(co.contains(null) || co.contains("")))
df.filter(checkIfNull(array(df.columns.map(col): _*))).show(false)
I hope the answer is helpful
It is possible to take advantage of that dropDuplicates is order dependent to solve this, see the answer here. However, it is not very efficient, there should be a more efficient solution.
First remove all duplicates with distinct(), then iteratively order by each column and drop it's duplicates. The columns are ordered in descending order as nulls then will be put last.
Example with four static columns:
val df2 = df.distinct()
.orderBy($"id1".desc).dropDuplicates("id1")
.orderBy($"id2".desc).dropDuplicates("id2")
.orderBy($"id3".desc).dropDuplicates("id3")
.orderBy($"id4".desc).dropDuplicates("id4")
I have a spark dataframe(input_dataframe_1), data in this dataframe looks like as below:
id value
1 Ab
2 Ai
3 aB
I have another spark dataframe(input_dataframe_2), data in this dataframe looks like as below:
name value
x ab
y iA
z aB
I want to join both dataframe and join condition should be case insensitive, below is the join condition I am using:
output = input_dataframe_1.join(input_dataframe_2,['value'])
How can I make join condition case insensitive?
from pyspark.sql.functions import lower
#sample data
input_dataframe_1 = sc.parallelize([(1, 'Ab'), (2, 'Ai'), (3, 'aB')]).toDF(["id", "value"])
input_dataframe_2 = sc.parallelize([('x', 'ab'), ('y', 'iA'), ('z', 'aB')]).toDF(["name", "value"])
output = input_dataframe_1.\
join(input_dataframe_2, lower(input_dataframe_1.value)==lower(input_dataframe_2.value)).\
drop(input_dataframe_2.value)
output.show()
Expecting you are doing an inner join, find below solution:
Create input dataframe 1
val inputDF1 = spark.createDataFrame(Seq(("1","Ab"),("2","Ai"),("3","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Create input dataframe 2
val inputDF2 = spark.createDataFrame(Seq(("x","ab"),("y","iA"),("z","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Joining both dataframes on lower(value) column
inputDF1.join(inputDF2,lower(inputDF1.col("value"))===lower(inputDF2.col("value"))).show
id
value
id
value
1
Ab
z
aB
1
Ab
x
ab
3
aB
z
aB
3
aB
x
ab
I am learning apache spark and scala language. So some help please. I get 3 columns (c1, c2 and c3) from querying cassandra and get it in a dataframe in the scala code.. I have to bin(bin size = 3) (statistics, like in histogram ) c1 and find mean of c2 and c3 in the c1 bins. Are there any pre built functions that I can use to do this instead of traditional for loops and if conditions to achieve this?
Try this
val modifiedRDD = rdd.map{case(c1, c2, c3) => ((c1), (c2, c3, 1))}
val reducedRDD = modifiedRDD.reduceByKey{case(x, y) => (x._1+y._1, x._2+y._2, x._3+y._3)}
val finalRDD = reducedRDD.map{case((c1), (totalC2, totalC3, count)) => (c1, totalC2/count, totalC3/count)}