Hello guys i have a dataframe that is being up to date each date , each day i need to add the new qte and the new ca to the old one and update the date .
So i need to update the ones that are already existing and add the new ones.Here an example what i would like to have at the end
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here what i did
val histoCombinaison2=hist2.join(hist,Seq("article_id","pos_id"),"left")
.groupBy("article_id","pos_id").agg((hist2("qte")+hist("qte")) as ("qte"),(hist2("ca")+hist("ca")) as ("ca"),hist2("date"))
histoCombinaison2.show()
and i got the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`qte`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:218)
// import functions
import org.apache.spark.sql.functions.{coalesce, lit}
// we might not need groupBy,
// since after join, all the information will be in the same row
// so instead of using aggregate function, we simply combine the related fields as a new column.
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
// df.show()
|pos_id|article_id| date| qte| ca|
+------+----------+----------+----+----+
| 1| 1|2000-01-08| 5.0| 7.0|
| 2| 2|2000-01-08|29.4|24.0|
| 3| 3|2000-01-08| 7.0| 2.4|
| 4| 4|2000-01-08| 3.5| 1.2|
| 5| 5|2000-01-08|14.5| 1.2|
| 6| 6|2000-01-08| 2.0|1.25|
+------+----------+----------+----+----+
Thanks.
As I have mentioned your comment that you should define your schema and use it in reading csv to dataframe as
import sqlContext.implicits._
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("pos_id", LongType, true),
StructField("article_id", LongType, true),
StructField("date", DateType, true),
StructField("qte", LongType, true),
StructField("ca", DoubleType, true)
))
val hist1 = sqlContext.read
.format("csv")
.option("header", "true")
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
hist1.show
val hist2 = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
hist2.show
Then you should use when function to define the logic you need to implement as
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
when(hist2("date").isNotNull, hist2("date")).otherwise(when(hist1("date").isNotNull, hist1("date")).otherwise(lit(null))).alias("date"),
(when(hist2("qte").isNotNull, hist2("qte")).otherwise(lit(0)) + when(hist1("qte").isNotNull, hist1("qte")).otherwise(lit(0))).alias("qte"),
(when(hist2("ca").isNotNull, hist2("ca")).otherwise(lit(0)) + when(hist1("ca").isNotNull, hist1("ca")).otherwise(lit(0))).alias("ca"))
I hope the answer is helpful
Related
I have the below Data frame with me -
scala> val df1=Seq(
| ("1_10","2_20","3_30"),
| ("7_70","8_80","9_90")
| )toDF("c1","c2","c3")
scala> df1.show
+----+----+----+
| c1| c2| c3|
+----+----+----+
|1_10|2_20|3_30|
|7_70|8_80|9_90|
+----+----+----+
How to split this to different columns based on delimiter "_".
Expected output -
+----+----+----+----+----+----+
| c1| c2| c3|c1_1|c2_1|c3_1|
+----+----+----+----+----+----+
|1 |2 |3 | 10| 20| 30|
|7 |8 |9 | 70| 80| 90|
+----+----+----+----+----+----+
Also I have 50 + columns in the DF. Thanks in Advance.
Here is the good use of foldLeft. Split each column and create a new column for each splited value
val cols = df1.columns
cols.foldLeft(df1) { (acc, name) =>
acc.withColumn(name, split(col(name), "_"))
.withColumn(s"${name}_1", col(name).getItem(0))
.withColumn(s"${name}_2", col(name).getItem(1))
}.drop(cols:_*)
.show(false)
If you need the columns name exactly as you want then you need to filter the columns that ends with _1 and rename them again with foldLeft
Output:
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can use split method
split(col("c1"), '_')
This will return you ArrayType(StringType)
Then you can access items with .getItem(index) method.
That is if you have a stable number of elements after spliting if that isnt the case you will have some null values if the indexed value isnt present in the array after splitting.
Example of code:
df.select(
split(col("c1"), "_").alias("c1_items"),
split(col("c2"), "_").alias("c2_items"),
split(col("c3"), "_").alias("c3_items"),
).select(
col("c1_items").getItem(0).alias("c1"),
col("c1_items").getItem(1).alias("c1_1"),
col("c2_items").getItem(0).alias("c2"),
col("c2_items").getItem(1).alias("c2_1"),
col("c3_items").getItem(0).alias("c3"),
col("c3_items").getItem(1).alias("c3_1")
)
Since you need to do this for 50+ columns I would probably suggest to wrap this in a method for a single column + withColumn statement in this kind of way
def splitMyCol(df: Dataset[_], name: String) = {
df.withColumn(
s"${name}_items", split(col("name"), "_")
).withColumn(
name, col(s"${name}_items).getItem(0)
).withColumn(
s"${name}_1", col(s"${name}_items).getItem(1)
).drop(s"${name}_items")
}
Note I assume you do not need items to be maintained thus I drop it. Also not that due to _ in the name between two variable is s"" string you need to wrap first one in {}, while second doesnt really need {} wrapping and $ is enough.
You can wrap this then in a fold method in this way:
val result = columnsToExpand.foldLeft(df)(
(acc, next) => splitMyCol(acc, next)
)
pyspark solution:
import pyspark.sql.functions as F
df1=sqlContext.createDataFrame([("1_10","2_20","3_30"),("7_70","8_80","9_90")]).toDF("c1","c2","c3")
expr = [F.split(coln,"_") for coln in df1.columns]
df2=df1.select(*expr)
#%%
df3= df2.withColumn("clctn",F.flatten(F.array(df2.columns)))
#%% assuming all columns will have data in the same format x_y
arr_size = len(df1.columns)*2
df_fin= df3.select([F.expr("clctn["+str(x)+"]").alias("c"+str(x/2)+'_'+str(x%2)) for x in range(arr_size)])
Results:
+----+----+----+----+----+----+
|c0_0|c0_1|c1_0|c1_1|c2_0|c2_1|
+----+----+----+----+----+----+
| 1| 10| 2| 20| 3| 30|
| 7| 70| 8| 80| 9| 90|
+----+----+----+----+----+----+
Try to use select instead of foldLeft for better performance. As foldLeft might be taking longer time than select
Check this post - foldLeft,select
val expr = df
.columns
.flatMap(c => Seq(
split(col(c),"_")(0).as(s"${c}_1"),
split(col(c),"_")(1).as(s"${c}_2")
)
)
.toSeq
Result
df.select(expr:_*).show(false)
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can do like this.
var df=Seq(("1_10","2_20","3_30"),("7_70","8_80","9_90")).toDF("c1","c2","c3")
for (cl <- df.columns) {
df=df.withColumn(cl+"_temp",split(df.col(cl),"_")(0))
df=df.withColumn(cl+"_"+cl.substring(1),split(df.col(cl),"_")(1))
df=df.withColumn(cl,df.col(cl+"_temp")).drop(cl+"_temp")
}
df.show(false)
}
//Sample output
+---+---+---+----+----+----+
|c1 |c2 |c3 |c1_1|c2_2|c3_3|
+---+---+---+----+----+----+
|1 |2 |3 |10 |20 |30 |
|7 |8 |9 |70 |80 |90 |
+---+---+---+----+----+----+
Consider these two Dataframes:
+---+
|id |
+---+
|1 |
|2 |
|3 |
+---+
+---+-----+
|idz|word |
+---+-----+
|1 |bat |
|1 |mouse|
|2 |horse|
+---+-----+
I am doing a Left join on ID=IDZ:
val r = df1.join(df2, (df1("id") === df2("idz")), "left_outer").
withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r.show(false)
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |mouse |
|1 |1 |bat |
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
But what if I only want to keep the lines whose ID only have one equal IDZ? If not, I would Like to have null in ID_EMPLOYE_VENDEUR. Desired output is:
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |null | --Because the Join resulted two different lines
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
I should precise that I am working on a large DF. The solution should be not very expensive in time.
Thank you
As per you mentioned data your data is too large, so groupBy is not good option to group data and join Windows over function as below :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("idz")
val newDF = df1.withColumn("count", count("idz").over(windowSpec)).dropDuplicates("idz").withColumn("word", when(col("count") >=2 , lit(null)).otherwise(col("word"))).drop("count")
val r = df1.join(newDF, (df1("id") === newDF("idz")), "left_outer").withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r show
+---+----+------------------+
| id| idz|ID_EMPLOYE_VENDEUR|
+---+----+------------------+
| 1| 1| null|
| 3|null| null|
| 2| 2| horse|
+---+----+------------------+
You can retrieve easily the information that more than two df2's idz matched a single df1's id with a groupBy and a join.
r.join(
r.groupBy("id").count().as("g"),
$"g.id" === r("id")
)
.withColumn(
"ID_EMPLOYE_VENDEUR",
expr("if(count != 1, null, ID_EMPLOYE_VENDEUR)")
)
.drop($"g.id").drop("count")
.distinct()
.show()
Note: Both the groupBy and the join do not trigger any additional exchange step (shuffle around network) because the dataframe r is already partitioned on id (because it is the result of a join on id).
I have a difficulty in extracting email domains. I have below dataframe.
+---+----------------+
|id |email |
+---+----------------+
|1 |ii#koko.com |
|2 |lol#fsa.org |
|3 |kokojambo#mon.eu|
+---+----------------+
Now I want to have a new field for domains which I'll get:
+---+----------------+------+
|id |email |domain|
+---+----------------+------+
|1 |ii#koko.com |koko |
|2 |lol#fsa.org |fsa |
|3 |kokojambo#mon.eu|mon |
+---+----------------+------+
I tried to do something like this:
val test = df_short.withColumn("email", split($"email", "#."))
But got a false output. Can anybody direct me better?
You can simple use inbuilt regexp_extract function to get your domain name from email address.
//create an example dataframe
val df = Seq((1, "ii#koko.com"),
(2, "lol#fsa.org"),
(3, "kokojambo#mon.eu"))
.toDF("id", "email")
//original dataframe
df.show(false)
//output
// +---+----------------+
// |id |email |
// +---+----------------+
// |1 |ii#koko.com |
// |2 |lol#fsa.org |
// |3 |kokojambo#mon.eu|
// +---+----------------+
//using regex get the domain name
df.withColumn("domain",
regexp_extract($"email", "(?<=#)[^.]+(?=\\.)", 0))
.show(false)
//output
// +---+----------------+------+
// |id |email |domain|
// +---+----------------+------+
// |1 |ii#koko.com |koko |
// |2 |lol#fsa.org |fsa |
// |3 |kokojambo#mon.eu|mon |
// +---+----------------+------+
You can do like this
import org.apache.spark.sql.functions._
df.withColumn("domain", split(split(df.col("email"),"#")(1),"\\.")(0)).show
Sample Input:
+---------------+
| email|
+---------------+
|manoj#gmail.com|
| abc#ac.in|
+---------------+
Sample Output:
+---------------+------+
| email|domain|
+---------------+------+
|manoj#gmail.com| gmail|
| abc#ac.in| ac|
+---------------+------+
You can create a window to count the number of times a record has occurred in the last 7 days. However, if you try to look at the number of times the record has occurred on a millisecond level, it breaks down.
In short, the below function df.timestamp.astype('Timestamp').cast("long") only converts the timestamp up to the grain of a second to a long. It ignores the millisecond. How do you turn the entire timestamp, milliseconds included, into a long. You need the value to be a long so that it'll work with the window.
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = sqlContext.createDataFrame([
("a", "u8u", "2018-02-02 05:46:41.438357"),
("a", "u8u", "2018-02-02 05:46:41.439377"),
("a", "a3a", "2018-02-02 09:48:34.081818"),
("a", "a3a", "2018-02-02 09:48:34.095586"),
("a", "g8g", "2018-02-02 09:48:56.006206"),
("a", "g8g", "2018-02-02 09:48:56.007974"),
("a", "9k9", "2018-02-02 12:50:48.000000"),
("a", "9k9", "2018-02-02 12:50:48.100000"),
], ["person_id", "session_id", "timestamp"])
df = df.withColumn('unix_ts',df.timestamp.astype('Timestamp').cast("long"))
df = df.withColumn("DayOfWeek",F.date_format(df.timestamp, 'EEEE'))
w = Window.partitionBy('person_id','DayOfWeek').orderBy('unix_ts').rangeBetween(-86400*7,-1)
df = df.withColumn('count',F.count('unix_ts').over(w))
df.sort(df.unix_ts).show(20,False)
+---------+----------+--------------------------+----------+---------+-----+
|person_id|session_id|timestamp |unix_ts |DayOfWeek|count|
+---------+----------+--------------------------+----------+---------+-----+
|a |u8u |2018-02-02 05:46:41.438357|1517572001|Friday |0 |
|a |u8u |2018-02-02 05:46:41.439377|1517572001|Friday |0 |
|a |a3a |2018-02-02 09:48:34.081818|1517586514|Friday |2 |
|a |a3a |2018-02-02 09:48:34.095586|1517586514|Friday |2 |
|a |g8g |2018-02-02 09:48:56.006206|1517586536|Friday |4 |
|a |g8g |2018-02-02 09:48:56.007974|1517586536|Friday |4 |
|a |9k9 |2018-02-02 12:50:48.000000|1517597448|Friday |6 |
|a |9k9 |2018-02-02 12:50:48.100000|1517597448|Friday |6 |
+---------+----------+--------------------------+----------+---------+-----+
The count should be 0,1,2,3,4,5... instead of 0,0,2,2,4,4,...
You can use pyspark.sql.functions.unix_timestamp() to convert the string column to a timestamp instead of casting to long.
import pyspark.sql.functions as F
df.select(
"timestamp",
F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS").alias("unix_ts")
).show(truncate=False)
#+--------------------------+----------+
#|timestamp |unix_ts |
#+--------------------------+----------+
#|2018-02-02 05:46:41.438357|1517568839|
#|2018-02-02 05:46:41.439377|1517568840|
#|2018-02-02 09:48:34.081818|1517582995|
#|2018-02-02 09:48:34.095586|1517583009|
#|2018-02-02 09:48:56.006206|1517582942|
#|2018-02-02 09:48:56.007974|1517582943|
#|2018-02-02 12:50:48.862644|1517594710|
#|2018-02-02 12:50:49.981848|1517594830|
#+--------------------------+----------+
The second argument to unix_timestamp() is the format string. In your case, use "yyyy-MM-dd HH:mm:ss.SSSSSS".
The corresponding change applied to your code would be:
df = df.withColumn(
'unix_ts',
F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS")
)
df = df.withColumn("DayOfWeek", F.date_format(df.timestamp, 'EEEE'))
w = Window.partitionBy('person_id','DayOfWeek').orderBy('unix_ts').rangeBetween(-86400*7,-1)
df = df.withColumn('count',F.count('unix_ts').over(w))
df.sort(df.unix_ts).show(20,False)
#+---------+----------+--------------------------+----------+---------+-----+
#|person_id|session_id|timestamp |unix_ts |DayOfWeek|count|
#+---------+----------+--------------------------+----------+---------+-----+
#|a |u8u |2018-02-02 05:46:41.438357|1517568839|Friday |0 |
#|a |u8u |2018-02-02 05:46:41.439377|1517568840|Friday |1 |
#|a |g8g |2018-02-02 09:48:56.006206|1517582942|Friday |2 |
#|a |g8g |2018-02-02 09:48:56.007974|1517582943|Friday |3 |
#|a |a3a |2018-02-02 09:48:34.081818|1517582995|Friday |4 |
#|a |a3a |2018-02-02 09:48:34.095586|1517583009|Friday |5 |
#|a |9k9 |2018-02-02 12:50:48.862644|1517594710|Friday |6 |
#|a |9k9 |2018-02-02 12:50:49.981848|1517594830|Friday |7 |
#+---------+----------+--------------------------+----------+---------+-----+
Input Spark Dataframe df (OLTP):
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 | 100 |
|abc |4/6/2018 | 200 |
+----+---------+------+
|abc |4/13/2018| 300 |
+----+---------+------+
Expected DF (OLAP) :
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 | 100 |
|abc |4/6/2018 | 200 |
+----+---------+------+
|abc |4/13/2018| 100|
+----+---------+------+
| abc|4/13/2018| 200|
+----+---------+------+
| abc|4/13/2018| 300|
+----+---------+------+
my code
val df = df1.union(df1)
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 |100 |
|abc |4/6/2018 |200 |
|abc |4/13/2018|300 |
|abc |4/6/2018 |100 |
|abc |4/6/2018 |200 |
|abc |4/13/2018|300 |
+----+---------+------+
val w1 = org.apache.spark.sql.expressions.Window.orderBy("date")
val ExpectedDF = df.withColumn("previousAmount", lag("amount",1).over(w1)).withColumn("newdate", lag("date",1).over(w1))
ExpectedDF .show(false)
+----+---------+------+--------------+---------+
|name|date |amount|previousAmount|newdate |
+----+---------+------+--------------+---------+
|abc |4/13/2018|300 |null |null |
|abc |4/13/2018|300 |300 |4/13/2018|
|abc |4/6/2018 |100 |300 |4/13/2018|
|abc |4/6/2018 |200 |100 |4/6/2018 |
|abc |4/6/2018 |100 |200 |4/6/2018 |
|abc |4/6/2018 |200 |100 |4/6/2018 |
+----+---------+------+--------------+---------+
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local")
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
val ss = SparkSession.builder().master("local").appName("Excel-read-write").getOrCreate()
import ss.sqlContext.implicits._
var df1 = sqlc.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("oldRecords.csv")
df1.show(false)
println("---- df1 row count ----"+df1.count())
if(df1.count()>0){
for (i <- 0 until (df1.count().toInt)-1) {
var df2 = df1.unionAll(df1)//.union(df1)//df3
//df2.show(false)
var w1 = org.apache.spark.sql.expressions.Window.orderBy("date")
var df3 = df2.withColumn("previousAmount", lag("amount",1).over(w1)).withColumn("newdate", lag("date",1).over(w1))
// df3.show(false)
var df4 = df3.filter((df3.col("newdate").isNotNull))//(df3.col("new_date").isNotNull)
//df4.show(false)
var df5 = df4.select("name","amount","newdate").distinct()
println("-----------"+df5.show(false))
df1 = df5.withColumnRenamed("newdate", "date")
}
}
}