Pyspark Join Tables - pyspark

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()

It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+

Related

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

Transforming all new rows into new column in Spark with Scala

I have a dataframe which has fix columns as m1_amt to m4_amt, containing data in the format below:
+------+----------+----------+----------+-----------+
|Entity| m1_amt | m2_amt | m3_amt | m4_amt |
+------+----------+----------+----------+-----------+
| ISO | 1 | 2 | 3 | 4 |
| TEST | 5 | 6 | 7 | 8 |
| Beta | 9 | 10 | 11 | 12 |
+------+----------+----------+----------+-----------+
I am trying to convert each new row into a new column as:
+----------+-------+--------+------+
| Entity | ISO | TEST | Beta |
+----------+-------+--------+------+
| m1_amt | 1 | 5 | 9 |
| m2_amt | 2 | 6 | 10 |
| m3_amt | 3 | 7 | 11 |
| m4_amt | 4 | 8 | 12 |
+----------+-------+--------+------+
How can I achieve this in Spark and Scala?
I had tried in below way:
scala> val df=Seq(("ISO",1,2,3,4),
| ("TEST",5,6,7,8),
| ("Beta",9,10,11,12)).toDF("Entity","m1_amt","m2_amt","m3_amt","m4_amt")
df: org.apache.spark.sql.DataFrame = [Entity: string, m1_amt: int ... 3 more fields]
scala> df.show
+------+------+------+------+------+
|Entity|m1_amt|m2_amt|m3_amt|m4_amt|
+------+------+------+------+------+
| ISO| 1| 2| 3| 4|
| TEST| 5| 6| 7| 8|
| Beta| 9| 10| 11| 12|
+------+------+------+------+------+
scala> val selectDf= df.selectExpr("Entity","stack(4,'m1_amt',m1_amt,'m2_amt',m2_amt,'m3_amt',m3_amt,'m4_amt',m4_amt)")
selectDf: org.apache.spark.sql.DataFrame = [Entity: string, col0: string ... 1 more field]
scala> selectDf.show
+------+------+----+
|Entity| col0|col1|
+------+------+----+
| ISO|m1_amt| 1|
| ISO|m2_amt| 2|
| ISO|m3_amt| 3|
| ISO|m4_amt| 4|
| TEST|m1_amt| 5|
| TEST|m2_amt| 6|
| TEST|m3_amt| 7|
| TEST|m4_amt| 8|
| Beta|m1_amt| 9|
| Beta|m2_amt| 10|
| Beta|m3_amt| 11|
| Beta|m4_amt| 12|
+------+------+----+
scala> selectDf.groupBy("col0").pivot("Entity").agg(concat_ws("",collect_list(col("col1")))).withColumnRenamed("col0","Entity").show
+------+----+---+----+
|Entity|Beta|ISO|TEST|
+------+----+---+----+
|m3_amt| 11| 3| 7|
|m4_amt| 12| 4| 8|
|m2_amt| 10| 2| 6|
|m1_amt| 9| 1| 5|
+------+----+---+----+
scala> df.show
+------+------+------+------+------+
|Entity|m1_amt|m2_amt|m3_amt|m4_amt|
+------+------+------+------+------+
| ISO| 1| 2| 3| 4|
| TEST| 5| 6| 7| 8|
| Beta| 9| 10| 11| 12|
+------+------+------+------+------+
scala> val df1 = df.withColumn("amt", to_json(struct(col("m1_amt"),col("m2_amt"),col("m3_amt"),col("m4_amt"))))
.withColumn("amt", regexp_replace(col("amt"), """[\\{\\"\\}]""", ""))
.withColumn("amt", explode(split(col("amt"), ",")))
.withColumn("cols", split(col("amt"), ":")(0))
.withColumn("val", split(col("amt"), ":")(1))
.select("Entity","cols","val")
scala> df1.show
+------+------+---+
|Entity| cols|val|
+------+------+---+
| ISO|m1_amt| 1|
| ISO|m2_amt| 2|
| ISO|m3_amt| 3|
| ISO|m4_amt| 4|
| TEST|m1_amt| 5|
| TEST|m2_amt| 6|
| TEST|m3_amt| 7|
| TEST|m4_amt| 8|
| Beta|m1_amt| 9|
| Beta|m2_amt| 10|
| Beta|m3_amt| 11|
| Beta|m4_amt| 12|
+------+------+---+
scala> df1.groupBy(col("cols")).pivot("Entity").agg(concat_ws("",collect_set(col("val"))))
.withColumnRenamed("cols", "Entity")
.show()
+------+----+---+----+
|Entity|Beta|ISO|TEST|
+------+----+---+----+
|m3_amt| 11| 3| 7|
|m4_amt| 12| 4| 8|
|m2_amt| 10| 2| 6|
|m1_amt| 9| 1| 5|
+------+----+---+----+

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

filter on data which are numeric

hi i have a dataframe with a column CODEARTICLE here is the dataframe
|CODEARTICLE| STRUCTURE| DES|TYPEMARK|TYP|IMPLOC|MARQUE|GAMME|TAR|
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
| GENCFFRIST|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFMARC|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFESCO|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFTNA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFEMBA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 789600010|9999999999998|xxxxxxxxxxxxxxxxx...| 7| 1| Local| | | |
| 799700040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 799701000|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 899980490|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 9| Local| | | |
| 429600010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 559970040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 679500010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500060|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
i would like to take only rows having a numeric CODEARTICLER
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val articles_Gold = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_ARTICLE")).select("CODEARTICLE", "STRUCTURE", "DES", "TYPEMARK", "TYP", "IMPLOC", "MARQUE", "GAMME", "TAR")
val filteredData =articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).filter($"test"!==null)
thank you a lot
Use na.drop:
articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).na.drop("test")
you can use .isNotNull function on the column in your filter function. You don't even need to create another column for your logic. You can simply do the following
val filteredData = articles_Gold.withColumn("CODEARTICLE",'CODEARTICLE.cast(IntegerType)).filter('CODEARTICLE.isNotNull)
I hope the answer is helpful