How to get distinct count value in scala - scala

I want to find the distinct values from this query in scala
select
key,
count(distinct suppKey)
from
file
group by
key ;
I write this code in scala, but didn't working.
val count= file.map(line=> (line.split('|')(0),line.split('|')(1)).distinct().count())
I make split, because key is in the first row in file, and suppkey in the second.
File:
1|52|3956|337.0
1|77|4069|357.8
1|7|14|35.2
2|3|8895|378.4
2|3|4969|915.2
2|3|8539|438.3
2|78|3025|306.3
Expected output:
1|3
2|2

Instead of a file, for simpler testing, I use a String:
scala> val s="""1|52|3956|337.0
| 1|77|4069|357.8
| 1|7|14|35.2
| 2|3|8895|378.4
| 2|3|4969|915.2
| 2|3|8539|438.3
| 2|78|3025|306.3"""
scala> s.split("\n").map (line => {val sp = line.split ('|'); (sp(0), sp(1))}).distinct.groupBy (_._1).map (e => (e._1, e._2.size))
res198: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 3)
Imho, we need a groupBy to specify what to group over, and to count groupwise.

Done in spark REPL. test.txt is the file with the text you've provided
val d = sc.textFile("test.txt")
d.map(x => (x.split("\\|")(0), x.split("\\|")(1))).distinct.countByKey
scala.collection.Map[String,Long] = Map(2 -> 2, 1 -> 3)

Related

getting the values of a column with keys - spark scala

I have a map[String,String] like this
val map1 = Map( "S" -> 1 , "T" -> 2, "U" -> 3)
and a Dataframe with a column called mappedcol ( type array[string] ). Here are the first and second rows of the column : [S,U] , [U,U] and I would like to map every row of this column to get the value of the key so I would have [1,3] instead of [S,U] and [3,3] instead of [U,U]. How can I do this effectively?
Thanks
The map can be tranformed into an SQL expression based on transform
and when:
var ex = "transform(value, v -> case ";
for ((k,v) <- map1) ex += s"when v = '${k}' then ${v} "
ex += "else 99 end)"
ex now contains the string
transform(value, v -> case when v = 'S' then 1 when v = 'T' then 2 when v = 'U' then 3 else 99 end)
This expression can now be used to calculate a new column:
import org.apache.spark.sql.functions._
df.withColumn("result", expr(ex)).show();
Output:
+---+------+------+
| id| value|result|
+---+------+------+
| 1|[S, U]|[1, 3]|
| 2|[U, U]|[3, 3]|
+---+------+------+

I want to create variable from dataframe and need to use in spark scala code

I want to create variable from dataframe and need to use in spark scala code (i want to go for each row 1 by 1 and use column value in variable each time, can someone help ?
Here is y dataframe :
\+---+--------------------+------------------------------------------------------------------+---------------------------+-------------------------------------------------------------------------+----------+
|id |table1_name |table_1_path |table2_name |table_2_path |key_column|
\+---+--------------------+------------------------------------------------------------------+---------------------------+-------------------------------------------------------------------------+----------+
|1 |orders-201019-002101|C:/Users/USER/Desktop/Notes/datset/week11/orders-201019-002101.csv|orders-201019-002101 - Copy|C:/Users/USER/Desktop/Notes/datset/week11/orders-201019-002101 - Copy.csv|order_id |
|2 |orders-201019-002101|C:/Users/USER/Desktop/Notes/datset/week11/orders-201019-002101.csv|orders-201019-002101 - Copy|C:/Users/USER/Desktop/Notes/datset/week11/orders-201019-002101 - Copy.csv|order_id |
\+---+--------------------+------------------------------------------------------------------+---------------------------+-------------------------------------------------------------------------+----------+
I tried using list but it seems very difficult in scala
You can convert the DF to a list, then use map to apply a scheme to the list. That way you can access the df data as variables.
Example:
case class schemeExemple(
valueOne: Int,
valueTwo: Int
)
val values = Seq((0,1),(0,1),(10,1),(2,1),(4,1))
val df = spark.createDataFrame(values)
val dfList = df.collect().toList.map(x => schemeExemple(x.getInt(0), x.getInt(1)))
dfList.foreach(x => {
println(s"Print value one -> ${x.valueOne}")
println(s"Print value Two -> ${x.valueTwo}")
println("-----------")
})
Output:
Print value one -> 0
Print value Two -> 1
-----------
Print value one -> 0
Print value Two -> 1
-----------
Print value one -> 10
Print value Two -> 1
-----------
Print value one -> 2
Print value Two -> 1
-----------
Print value one -> 4
Print value Two -> 1
-----------

How to convert Dataframe to Map with key as one of column value?

I am a newbie in Spark/Scala and my problem statement is
I have a dataframe like below:
Col1 | Col2
a 1
a 2
a 3
b 4
b 5
i want to create a map like this
a-> [1,2,3]
b-> [4,5]
I am facing issue in combining col2 values based on col1 value and then creating a map with key as col1 value.
Use map with collect_list.
val aggdf = df.groupBy($"col1").agg(map($"col1",collect_list($"col2")).alias("mapped"))
aggdf.select($"mapped").show()
you can do it like this:
val df = Seq(
("a",1),
("a",2),
("a",3),
("b",4),
("b",5)
).toDF("col1","col2")
val map: Map[String, Seq[Int]] = df.groupBy($"col1")
.agg(collect_list($"col2"))
.as[(String,Seq[Int])]
.collect().toMap
gives
Map(b -> List(4, 5), a -> List(1, 2, 3))
But be aware that will blow up for large datasets
How about this:
val x = df.withColumn("x", array("col2"))
.groupBy("col1")
.agg(collect_list("x"))
x.show()
+----+---------------+
|col1|collect_list(x)|
+----+---------------+
| b| [[4], [5]]|
| a|[[1], [2], [3]]|
+----+---------------+
Not really as you wanted, but we are a step closer :)

Scala - How to convert Spark DataFrame to Map

How to conver Spark DataFrame to Map like below : I want to convert into Map and then Json. Pivot didn't worked to reshape the cplumn so
Any help will be appreciated to convert as a Map like below.
Input DataFrame :
+-----+-----+-------+--------------------+
|col1 |col2 |object | values |
+-------------------+--------------------+
|one | two | main |[101 -> A, 202 -> B]|
+-------------------+--------------------+
Expected Output DataFrame :
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|col1 |col2 |object | values | newMap |
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|one | two |main |[101 -> A, 202 -> B]|[col1 -> one, col2 -> two, object -> main, main -> [101 -> A, 202 -> B]]|
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
tried like below, but no success :
val toMap = udf((col1: String, col2: String, object: String, values: Map[String, String])) => {
col1.zip(values).toMap // need help for logic
// col1 -> col1_value, col2 -> col2_values, object -> object_value, object_value -> [values_of_Col_Values].toMap
})
df.withColumn("newMap", toMap($"col1", $"col2", $"object", $"values"))
I am stuck to format the code properly and get the output, please help either in Scala or Spark.
It's quit straight forward. Apparently the precondition is, you must have all the columns with same type otherwise you will get spark error.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("Foo", "L", "10"), ("Boo", "XL", "20"))
.toDF("brand", "size", "sales")
//Prepare your map columns.Bit of nasty iteration work is required
var preCol: Column = null
var counter = 1
val size = df.schema.fields.length
val mapColumns = df.schema.flatMap { field =>
val res = if (counter == size)
Seq(preCol, col(field.name))
else
Seq(lit(field.name), col(field.name))
//assign the current field name for tracking and increment the counter by 1
preCol = col(field.name)
counter += 1
res
}
df.withColumn("new", map(mapColumns: _*)).show(false)
Result
+-----+----+-----+---------------------------------------+
|brand|size|sales|new |
+-----+----+-----+---------------------------------------+
|Foo |L |10 |Map(brand -> Foo, size -> L, L -> 10) |
|Boo |XL |20 |Map(brand -> Boo, size -> XL, XL -> 20)|
+-----+----+-----+---------------------------------------+

how to parallellize this in spark using spark dataset api

I am using spark-sql-2.4.1v with Java 8.
I have data columns like below
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979),
("Indus_2","Indus_2_Name","Country1", "State2",21789933),
("Indus_3","Indus_3_Name","Country1", "State3",21789978),
("Indus_4","Indus_4_Name","Country2", "State1",41789978),
("Indus_5","Indus_5_Name","Country3", "State3",27789978),
("Indus_6","Indus_6_Name","Country1", "State1",27899790),
("Indus_7","Indus_7_Name","Country3", "State1",27899790),
("Indus_8","Indus_8_Name","Country1", "State2",27899790),
("Indus_9","Indus_9_Name","Country4", "State1",27899790)
).toDF("industry_id","industry_name","country","state","revenue");
Given the below inputs list :
val countryList = Seq("Country1","Country2");
val stateMap = Map("Country1" -> {"State1","State2"}, "Country2" -> {"State2","State3"});
In spark job , for each country for each state I need to calculate few industries total revenue.
In other languages we do in for loop.
i.e.
for( country <- countryList ){
for( state <- stateMap.get(country){
// do some calculation for each state industries
}
}
In spark , what i understood we should do like this, i.e. all executors not been utilized by doing this.
so what is the correct way to handle this ?
I have added few extra rows to your sample data to differentiate aggregation. I have used scala parallel collection, For each country it will get states & then uses those values to filter the given dataframe & then do aggregation, end it will join all the result back.
scala> val df = Seq(
| ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
| ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
| ("Indus_2","Indus_2_Name","Country1", "State2",31789933),
| ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
| ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",81789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",51789978),
| ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
| ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
| ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
| ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
| ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
| ).toDF("industry_id","industry_name","country","state","revenue")
df: org.apache.spark.sql.DataFrame = [industry_id: string, industry_name: string ... 3 more fields]
scala> val countryList = Seq("Country1","Country2","Country4","Country5");
countryList: Seq[String] = List(Country1, Country2, Country4, Country5)
scala> val stateMap = Map("Country1" -> ("State1","State2"), "Country2" -> ("State2","State3"),"Country3" -> ("State31","State32"));
stateMap: scala.collection.immutable.Map[String,(String, String)] = Map(Country1 -> (State1,State2), Country2 -> (State2,State3), Country3 -> (State31,State32))
scala>
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList
.par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map{data =>
df.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
}.reduce(_ union _).show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
Edit - 1 : Separated Agg code into different function block.
scala> def processDF(data:(String,(String,String)),adf:DataFrame) = adf.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
processDF: (data: (String, (String, String)), adf: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList.
par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map(data => processDF(data,df))
.reduce(_ union _)
.show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
It really depent on what you want to do, if you don`t need to share state between states(country states), then u should create your DataFrame that each row is (country,state) and then you can control how much rows will be process parallely (num partitions and num cores).
You can use flatMapValues to create key-value pairs and then make your calculations in .map step.
scala> val data = Seq(("country1",Seq("state1","state2","state3")),("country2",Seq("state1","state2")))
scala> val rdd = sc.parallelize(data)
scala> val rdd2 = rdd.flatMapValues(s=>s)
scala> rdd2.foreach(println(_))
(country1,state1)
(country2,state1)
(country1,state2)
(country2,state2)
(country1,state3)
Here you can perform operations, I've added # to each state
scala> rdd2.map(s=>(s._1,s._2+"#")).foreach(println(_))
(country1,state1#)
(country1,state2#)
(country1,state3#)
(country2,state1#)
(country2,state2#)