Using pyspark map-reduce methos i created an rdd. I now want to create a dataframe from this rdd.
The rdd looking like this:
(491023, ((9,), (0.07971896408231094,), 'Debt collection'))
(491023, ((2, 14, 77, 22, 6, 3, 39, 7, 0, 1, 35, 84, 10, 8, 32, 13), (0.017180308460902963, 0.02751921818456658, 0.011887861159888378, 0.00859908577494079, 0.007521091815230704, 0.006522044953782423, 0.01032297079810829, 0.018976833302472455, 0.007634289723749076, 0.003033975857850723, 0.018805184361326378, 0.011217892399539534, 0.05106916198426676, 0.007901136066759178, 0.008895262042995653, 0.006665649645210911), 'Debt collection'))
(491023, ((36, 12, 50, 40, 5, 23, 58, 76, 11, 7, 65, 0, 1, 66, 16, 99, 98, 45, 13), (0.007528732561416072, 0.017248902490279026, 0.008083896178333739, 0.008274896865005982, 0.0210032206108319, 0.02048387345320946, 0.010225319903418824, 0.017842961406992965, 0.012026753813481164, 0.005154201637708568, 0.008274127579967948, 0.0168843021403551, 0.007416385430301767, 0.009257236955148311, 0.00590385362565239, 0.011031745337733267, 0.011076277004617665, 0.01575522984526745, 0.005431270081282964), 'Vehicle loan or lease'))
As you can see in my dataframe i will must have 4 different columns. The first one should be the Int 491023, the second a tuple (i think dataframes don't have tuple type, so array also works), third another tuple and fourth a string. As you can see my tuples have different sizes.
The simplest command rdd.toDF() don't work for me. Any ideas how can i achieve that?
You can create your dataframe like below , eventually you can pass an array(ArrayType())/list
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',['12','34'],1590038340000)],[ "reg","val1","val2"])
Output
+------+--------+-------------+
| reg| val1| val2|
+------+--------+-------------+
|N110WA|[12, 34]|1590038340000|
+------+--------+-------------+
Schema
df_a.printSchema()
root
|-- reg: string (nullable = true)
|-- val1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- val2: long (nullable = true)
Related
I want to implement an functionality to remove an element from an array of struct in spark scala.For the date "2019-01-26" I want to remove the entire struct from the array column. Following is my code :
import org.apache.spark.sql.types._
val df=Seq(("123","Jack",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"))),("124","jones",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2020-06-26","200","70","ABC"),("2020-08-26","300","69","ABC"),("2020-08-26","300","69","ABC"))),("125","daniel",Seq(("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF")))).toDF("id","name","history").withColumn("history",$"history".cast("array<struct<infodate:Date,amount1:Integer,amount2:Integer,detail:string>>"))
scala> df.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- infodate: date (nullable = true)
| | |-- amount1: integer (nullable = true)
| | |-- amount2: integer (nullable = true)
| | |-- detail: string (nullable = true)
So the for the date 2019-01-26 , I want to remove the struct in which
it is present so that it is removed from the array column.I want a solution like this.
I manage to find the solution but it involves lot of hardcoding
and I'm searching for a solution/suggestion that is optimal.
Hardcoded solution:
val dfnew=df
.withColumn( "history" ,
array_except(
col("history"),
array(
struct(
lit("2019-01-26").cast(DataTypes.DateType).alias("infodate"),
lit("200").cast(DataTypes.IntegerType).alias("amount1"),
lit("70").cast(DataTypes.IntegerType).alias("amount2"),
lit("DEF").alias("detail")
)
)
)
)
Is there any way of optimally doing it with one filter condition only
on date "2019-01-26", which removes the struct/array from the array
column.
I use an expression / filter here. Obviosuly it's a string so you can replace the date with a value so that there is even less hard coding. Filters are handy expressions as they let you use SQL notation to reference sub-components of the struct.
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn( "history" ,
expr( "filter( history , x -> x.infodate != '2019-01-26' )" )
).show(10,false)
// Exiting paste mode, now interpreting.
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |history |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|123|Jack |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC]] |
|124|jones |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC], [2020-06-26, 200, 70, ABC], [2020-08-26, 300, 69, ABC], [2020-08-26, 300, 69, ABC]]|
|125|daniel|[] |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
What is wrong with my code, I am using pyspark to convert a data type of a column.
company_df=company_df.withColumn("Revenue" ,company_df("Revenue").cast(DoubleType())) \
.withColumn("GROSS_PROFIT",company_df("GROSS_PROFIT").cast(DoubleType())) \
.withColumn("Net_Income" ,company_df("Net_Income").cast(DoubleType())) \
.withColumn("Enterprise_Value" ,company_df("Enterprise_Value").cast(DoubleType())) \
I am getting error as :
AttributeError: 'DataFrame' object has no attribute 'cast'
A short, clean, scalable solution
Change some columns, leave the rest untouched
import pyspark.sql.functions as F
# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')
cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])
df.printSchema()
root
|-- Id: integer (nullable = true)
|-- Revenue: double (nullable = true)
|-- GROSS_PROFIT: double (nullable = true)
|-- Net_Income: double (nullable = true)
|-- Enterprise_Value: double (nullable = true)
If this helps
df = spark.createDataFrame([(1, 0),
(2, 1),
(3 ,1),
(4, 1),
(5, 0),
(6 ,0),
(7, 1),
(8 ,1),
(9 ,1),
(10, 1),
(11, 0),
(12, 0)],
('Time' ,'Tag1'))
df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()
Alternatively, to #wwnde's answer you could do something as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
company_df = (company_df.withColumn("Revenue_cast" , col("Revenue_cast").cast(DoubleType()))
.withColumn("GROSS_PROFIT_cast", col("GROSS_PROFIT").cast(DoubleType()))
.withColumn("Net_Income_cast" , col("Net_Income").cast(DoubleType()))
.withColumn("Enterprise_Value_cast", col("Enterprise_Value").cast(DoubleType()))
)
Or,
company_df = (company_df.withColumn("Revenue_cast" , company_df["Revenue"].cast(DoubleType()))
.withColumn("GROSS_PROFIT_cast", company_df["GROSS_PROFIT".cast(DoubleType()))
.withColumn("Net_Income_cast" , company_df["Net_Income".cast(DoubleType()))
.withColumn("Enterprise_Value_cast", company_df["Enterprise_Value"].cast(DoubleType()))
)
I have a parquet file with the following schema
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- date: date (nullable = true)
| | |-- female_executives: double (nullable = true)
| | |-- male_executives: double (nullable = true)
| | |-- female_directors: double (nullable = true)
| | |-- male_directors: double (nullable = true)
| | |-- female_executives_and_directors: double (nullable = true)
| | |-- male_executives_and_directors: double (nullable = true)
| | |-- flag: integer (nullable = true)
and df.show() returns something like the following
+----------------------------------+
| listOfMetrics |
+----------------------------------+
| [[ADD, 5394, 2...|
| [[ADD, 527, 20...|
| [[ADD, 714, 20...|
| [[ADD, 765, 20...|
| [[ADD, 996, 20...|
| [[ADD, 146, 20...|
| [[ADD, 947, 20...|
+----------------------------------+
The 'Action' column is what I am targeting. This column can contain 'DELETE' or 'ADD' so based on this, I need to separate the rows.
The approach that I took was to flatten and then separate using pyspark.sql and reconvert back to its original form but failed at the conversion step.
I have the following questions
So is there a better way to do this in pyspark?
Can we dynamically convert the dataframe to its original form after transformation?
Can we separate the rows without flattening them?
I am new to spark and finding it very difficult to get this working
Thanks
It is possible to split each array into two parts using filter: one part containing only elements with ADD and one part containing only elements with DELETE. After that each part will become a separate line using stack. The original structure can be kept and flattening the dataframe is not necessary.
I am using a slightly simplified set of test data:
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- flag: long (nullable = true)
| | |-- id: long (nullable = true)
+----------------------------------------------------------------------------------+
|listOfMetrics |
+----------------------------------------------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [DELETE, 6, 5], [DELETE, 8, 7], [ADD, 10, 9]] |
|[[ADD, 12, 11], [ADD, 14, 13], [DELETE, 16, 15], [DELETE, 18, 17], [ADD, 110, 19]]|
+----------------------------------------------------------------------------------+
The code:
df.withColumn("listOfMetrics", F.expr("""stack(2,
filter(listOfMetrics, e -> e.Action = 'ADD'),
filter(listOfMetrics, e -> e.Action = 'DELETE'))""")) \
.show(truncate=False)
Output:
+----------------------------------------------+
|listOfMetrics |
+----------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [ADD, 10, 9]] |
|[[DELETE, 6, 5], [DELETE, 8, 7]] |
|[[ADD, 12, 11], [ADD, 14, 13], [ADD, 110, 19]]|
|[[DELETE, 16, 15], [DELETE, 18, 17]] |
+----------------------------------------------+
I have a large Dataframe in scala 2.4.0, that looks like this
+--------------------+--------------------+--------------------+-------------------+--------------+------+
| cookie| updated_score| probability| date_last_score|partition_date|target|
+--------------------+--------------------+--------------------+-------------------+--------------+------+
|00000000000001074780| 0.1110987111481027| 0.27492987342938174|2019-03-29 16:00:00| 2019-04-07_10| 0|
|00000000000001673799| 0.02621894072693878| 0.2029688362968775|2019-03-19 08:00:00| 2019-04-07_10| 0|
|00000000000002147908| 0.18922034021212567| 0.3520678649755828|2019-03-31 19:00:00| 2019-04-09_12| 1|
|00000000000004028302| 0.06803669083452231| 0.23089047208736854|2019-03-25 17:00:00| 2019-04-07_10| 0|
and this schema:
root
|-- cookie: string (nullable = true)
|-- updated_score: double (nullable = true)
|-- probability: double (nullable = true)
|-- date_last_score: string (nullable = true)
|-- partition_date: string (nullable = true)
|-- target: integer (nullable = false)
then I create a partition table and insert the data into database.table_name. But when I look up at hive database and type: show partitions database.table_name I only got partition_date=0 and partition_date=1, and 0 and 1 are not values from partition_date column.
I don't know if I wrote something wrong, there are some scala concepts that I don't understand or the dataframe is too large.
I've tried differents ways to do this looking up similar questions as:
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
or
result_df.write.mode(SaveMode.Overwrite).saveAsTable("table_name")
In case it helps I provide some INFO message from scala:
Looking at this message, I think I got my result_df partitions properly.
19/07/31 07:53:57 INFO TaskSetManager: Starting task 11.0 in stage 2822.0 (TID 123456, ip-xx-xx-xx.aws.local.somewhere, executor 45, partition 11, PROCESS_LOCAL, 7767 bytes)
19/07/31 07:53:57 INFO TaskSetManager: Starting task 61.0 in stage 2815.0 (TID 123457, ip-xx-xx-xx-xyz.aws.local.somewhere, executor 33, partition 61, NODE_LOCAL, 8095 bytes)
Then, I am starting to saving the partitions as a Vector(0, 1, 2...), but I may only save 0 and 1? I don't really know.
19/07/31 07:56:02 INFO DAGScheduler: Submitting 35 missing tasks from ShuffleMapStage 2967 (MapPartitionsRDD[130590] at insertInto at evaluate_decay_factor.scala:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/31 07:56:02 INFO YarnScheduler: Adding task set 2967.0 with 35 tasks
19/07/31 07:56:02 INFO DAGScheduler: Submitting ShuffleMapStage 2965 (MapPartitionsRDD[130578] at insertInto at evaluate_decay_factor.scala:165), which has no missing parents
My code looks like this:
val createTableSQL = s"""
CREATE TABLE IF NOT EXISTS table_name (
cookie string,
updated_score float,
probability float,
date_last_score string,
target int
)
PARTITIONED BY (partition_date string)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')
"""
spark.sql(createTableSQL)
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
Given a dataframe like this:
val result = Seq(
(8, "123", 1.2, 0.5, "bat", "2019-04-04_9"),
(64, "451", 3.2, -0.5, "mouse", "2019-04-04_12"),
(-27, "613", 8.2, 1.5, "horse", "2019-04-04_10"),
(-37, "513", 4.33, 2.5, "horse", "2019-04-04_11"),
(45, "516", -3.3, 3.4, "bat", "2019-04-04_10"),
(12, "781", 1.2, 5.5, "horse", "2019-04-04_11")
I want to run: show partitions "table_name" on hive command line and get:
partition_date=2019-04-04_9
partition_date=2019-04-04_10
partition_date=2019-04-04_11
partition_date=2019-04-04_12
Instead in my output is:
partition_date=0
partition_date=1
In this simple example case it works perfectly, but with my large dataframe I get the previous output.
To change the number of partitions, use repartition(numOfPartitions)
To change the column you partition by when writing, use partitionBy("col")
example used together: final_df.repartition(40).write.partitionBy("txnDate").mode("append").parquet(destination)
Two helpful hints:
Make your repartition size equal to the number of worker cores for quickest write / repartition. In this example, I have 10 executors, each with 4 cores (40 cores total). Thus, I set it to 40.
When you are writing to a destination, don't specify anything more than the sub bucket -- let spark handle the indexing.
good destination: "s3a://prod/subbucket/"
bad destination: s"s3a://prod/subbucket/txndate=$txndate"
Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere