DataFrame transform in PySpark - pyspark

I was chargin data from a JSON file and I have this structure:
DataFrame[CodLic: string, Fecha: struct<$date:struct<$numberLong:string>>, IDBus: struct<$numberInt:string>, NumResults: struct<$numberInt:string>, ResponseTime: struct<$numberDecimal:string>, _id: struct<$oid:string>]
To charge the file, I use this code:
df = spark.read.format('json').load(pathText)
This return this dataset:
df.show(10)
+-----------+-----------------+-----------+-------------+---------------+--------------------+
| CodLic| Fecha| IDBus| NumResults| ResponseTime| _id|
+-----------+-----------------+-----------+-------------+---------------+--------------------+
| 04P|[[1536761469602]]|[680244294]| [0]| [1404]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]|[680244303]| [0]| [1420]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]|[680244314]| [0]| [1404]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]|[680244316]| [0]| [1388]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]|[680244293]| [0]| [1373]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469618]]|[680244307]| [0]| [1388]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469618]]|[680244272]| [0]| [1404]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469618]]|[680244312]| [0]| [1388]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469618]]|[680244311]| [0]| [1404]|[5b991e7de5e8d9c1...|
| 04P|[[1536761469618]]|[680244317]| [0]| [1388]|[5b991e7de5e8d9c1...|
+-----------+-----------------+-----------+-------------+---------------+--------------------+
only showing top 10 rows
How I can transform this to the next dataset?:
+-----------+-----------------+-----------+-------------+---------------+--------------------+
| CodLic| Fecha| IDBus| NumResults| ResponseTime| _id|
+-----------+-----------------+-----------+-------------+---------------+--------------------+
| 04P|[[1536761469602]]| 680244294| 0| 1404|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]| 680244303| 0| 1420|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]| 680244314| 0| 1404|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]| 680244316| 0| 1388|[5b991e7de5e8d9c1...|
| 04P|[[1536761469602]]| 680244293| 0| 1373|[5b991e7de5e8d9c1...|
+-----------+-----------------+-----------+-------------+---------------+--------------------+

Related

pyspark check whether each name has 3 data

In pyspark, I have a DataFrame as follows. I want to check whether each name has 3 action data (0, 1, 2). If there are missing, add a new row, the score column is set to 0, and the other columns are unchanged(ex: str1, str2, str3).
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
+-----+--------+--------+--------+-------+-------+
For example, name B has no action 2, add a new row data as follows
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|<---- new row data
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
+-----+--------+--------+--------+-------+-------+
It is also possible that there is only one row data for one name, and two new row data need to be added.
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
| D | str_D1 | str_D2 | str_D3 | 0| 45|
+-----+--------+--------+--------+-------+-------+
+-----+--------+--------+--------+-------+-------+
| name| str1 | str2 | str3 | action| score |
+-----+--------+--------+--------+-------+-------+
| A | str_A1 | str_A2 | str_A3 | 0| 2|
| A | str_A1 | str_A2 | str_A3 | 1| 6|
| A | str_A1 | str_A2 | str_A3 | 2| 74|
| B | str_B1 | str_B2 | str_B3 | 0| 59|
| B | str_B1 | str_B2 | str_B3 | 1| 18|
| B | str_B1 | str_B2 | str_B3 | 2| 0|
| C | str_C1 | str_C2 | str_C3 | 0| 3|
| C | str_C1 | str_C2 | str_C3 | 1| 33|
| C | str_C1 | str_C2 | str_C3 | 2| 3|
| D | str_D1 | str_D2 | str_D3 | 0| 45|
| D | str_D1 | str_D2 | str_D3 | 1| 0|<---- new row data
| D | str_D1 | str_D2 | str_D3 | 2| 0|<---- new row data
+-----+--------+--------+--------+-------+-------+
I am new to pyspark and don't know how to do this operation.
Thank you for your help.
Solution with a UDF
from pyspark.sql import functions as F, types as T
#F.udf(T.MapType(T.StringType(), T.IntegerType()))
def add_missing_values(values):
return {i: values.get(i, 0) for i in range(3)}
df = (
df.groupBy("name", "str1", "str2", "str3")
.agg(
F.map_from_entries(F.collect_list(F.struct("action", "score"))).alias("values")
)
.withColumn("values", add_missing_values(F.col("values")))
.select(
"name", "str1", "str2", "str3", F.explode("values").alias("action", "score")
)
)
df.show()
+----+------+------+------+------+-----+
|name| str1| str2| str3|action|score|
+----+------+------+------+------+-----+
| A|str_A1|str_A2|str_A3| 0| 2|
| A|str_A1|str_A2|str_A3| 1| 6|
| A|str_A1|str_A2|str_A3| 2| 74|
| B|str_B1|str_B2|str_B3| 0| 59|
| B|str_B1|str_B2|str_B3| 1| 18|
| B|str_B1|str_B2|str_B3| 2| 0|<---- new row data
| C|str_C1|str_C2|str_C3| 0| 3|
| C|str_C1|str_C2|str_C3| 1| 33|
| C|str_C1|str_C2|str_C3| 2| 3|
| D|str_D1|str_D2|str_D3| 0| 45|
| D|str_D1|str_D2|str_D3| 1| 0|<---- new row data
| D|str_D1|str_D2|str_D3| 2| 0|<---- new row data
+----+------+------+------+------+-----+
Full Spark solution :
df = (
df.groupBy("name", "str1", "str2", "str3")
.agg(
F.map_from_entries(F.collect_list(F.struct("action", "score"))).alias("values")
)
.withColumn(
"values",
F.map_from_arrays(
F.array([F.lit(i) for i in range(3)]),
F.array(
[F.coalesce(F.col("values").getItem(i), F.lit(0)) for i in range(3)]
),
),
)
.select(
"name", "str1", "str2", "str3", F.explode("values").alias("action", "score")
)
)

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

Pyspark Join Tables

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()
It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+

filter on data which are numeric

hi i have a dataframe with a column CODEARTICLE here is the dataframe
|CODEARTICLE| STRUCTURE| DES|TYPEMARK|TYP|IMPLOC|MARQUE|GAMME|TAR|
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
| GENCFFRIST|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFMARC|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFESCO|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFTNA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| GENCFFEMBA|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 789600010|9999999999998|xxxxxxxxxxxxxxxxx...| 7| 1| Local| | | |
| 799700040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 799701000|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 899980490|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 9| Local| | | |
| 429600010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 559970040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 0| Local| | | |
| 679500010|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500040|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
| 679500060|9999999999998|xxxxxxxxxxxxxxxxx...| 0| 1| Local| | | |
+-----------+-------------+--------------------+--------+---+------+------+-----+---+
i would like to take only rows having a numeric CODEARTICLER
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val articles_Gold = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_ARTICLE")).select("CODEARTICLE", "STRUCTURE", "DES", "TYPEMARK", "TYP", "IMPLOC", "MARQUE", "GAMME", "TAR")
val filteredData =articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).filter($"test"!==null)
thank you a lot
Use na.drop:
articles_Gold.withColumn("test",'CODEARTICLE.cast(IntegerType)).na.drop("test")
you can use .isNotNull function on the column in your filter function. You don't even need to create another column for your logic. You can simply do the following
val filteredData = articles_Gold.withColumn("CODEARTICLE",'CODEARTICLE.cast(IntegerType)).filter('CODEARTICLE.isNotNull)
I hope the answer is helpful