I am running a logistic regression modl in scala and I have a data frame like below:
|x |y |
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
I need to tranform this into something like this
|label| features |
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
I tried
val lr = new LogisticRegression()
val assembler = new VectorAssembler()
var lrModel= lr.fit(daf.withColumnRenamed("x","label").withColumnRenamed("y","features"))
Any help is appreciated.
Given the dataframe as
|x |y |
|0 |0 |
|0 |33 |
|0 |58 |
|0 |96 |
|0 |1 |
|1 |21 |
|0 |10 |
|0 |65 |
|1 |7 |
|1 |28 |
And doing as below
val assembler = new VectorAssembler()
.setInputCols(Array("x", "y"))
val output = assembler.transform(df).select($"x".cast(DoubleType).as("label"), $"features")
Would give you result as
|label|features |
|0.0 |(2,[],[]) |
|0.0 |[0.0,33.0]|
|0.0 |[0.0,58.0]|
|0.0 |[0.0,96.0]|
|0.0 |[0.0,1.0] |
|1.0 |[1.0,21.0]|
|0.0 |[0.0,10.0]|
|0.0 |[0.0,65.0]|
|1.0 |[1.0,7.0] |
|1.0 |[1.0,28.0]|
Now using LogisticRegression would be easy
val lr = new LogisticRegression()
val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
You will have output as
Coefficients: [1.5672602877378823,0.0] Intercept: -1.4055020984891717
I have a PySpark DataFrame :
From id To id Price Date
a b 20 30/05/2019
b c 5 30/05/2019
c a 20 30/05/2019
a d 10 02/06/2019
d c 5 02/06/2019
id Name
a Claudia
b Manuella
c remy
d Paul
The output that i want is :
Date Name current balance
30/05/2019 Claudia 0
30/05/2019 Manuella 15
30/05/2019 Remy -15
30/05/2019 Paul 0
02/06/2019 Claudia -10
02/06/2019 Manuella 15
02/06/2019 Remy -10
02/06/2019 Paul 5
I want to get the current balance in each day for all users.
my idea is to make a groupby per user and calculate the sum of the TO column minus the From column. But how to do it per day? especially it's cumulative and not per day?
Thank You
I took a bit of an effort to get the requirements right. Here's my version of the solution.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("b","c",5 ,"30/05/2019"),
("d","c",5 ,"02/06/2019"),
df1Columns = ["From_Id", "To_Id", "Price", "Date"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("Date",F.to_date(F.to_timestamp("Date", 'dd/MM/yyyy')).alias('Date'))
print("Actual initial data")
data2 = [
df2Columns = ["id","Name"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print("Actual initial data")
alldays_df = df1.select("Date").distinct().repartition(20)
allusers_df = df2.select("id").distinct().repartition(10)
crossjoin_df = alldays_df.crossJoin(allusers_df)
crossjoin_df = crossjoin_df.withColumn("initial", F.lit(0))
crossjoin_df = crossjoin_df.withColumnRenamed("id", "common_id").cache()
crossjoin_df.show(n=40, truncate=False)
from_sum_df = df1.groupby("Date", "From_Id").agg(F.sum("Price").alias("from_sum"))
from_sum_df = from_sum_df.withColumnRenamed("From_Id", "common_id")
from_sum_df = crossjoin_df.alias('cross').join(
from_sum_df.alias('from'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('from.from_sum', 'cross.initial').alias('from_amount') ).cache()
to_sum_df = df1.groupby("Date", "To_Id").agg(F.sum("Price").alias("to_sum"))
to_sum_df = to_sum_df.withColumnRenamed("To_Id", "common_id")
to_sum_df = crossjoin_df.alias('cross').join(
to_sum_df.alias('to'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('to.to_sum', 'cross.initial').alias('to_amount') ).cache()
joined_df = to_sum_df.join(from_sum_df, ["Date", "common_id"], how='inner')
balance_df = joined_df.withColumn("balance", F.col("to_amount") - F.col("from_amount"))
final_df = balance_df.join(df2, F.col("id") == F.col("common_id"))
final_cum_sum = final_df.withColumn('cumsum_balance', F.sum('balance').over(Window.partitionBy('common_id').orderBy('Date').rowsBetween(-sys.maxsize, 0)))
Following are all the outputs for your progressive understanding. I am not explaining the steps. You can figure them out.
Actual initial data
|From_Id|To_Id|Price|Date |
|a |b |20 |2019-05-30|
|b |c |5 |2019-05-30|
|c |a |20 |2019-05-30|
|a |d |10 |2019-06-02|
|d |c |5 |2019-06-02|
Actual initial data
|id |Name |
|a |Claudia |
|b |Manuella|
|c |Remy |
|d |Paul |
|Date |common_id|initial|
|2019-05-30|a |0 |
|2019-05-30|d |0 |
|2019-05-30|b |0 |
|2019-05-30|c |0 |
|2019-06-02|a |0 |
|2019-06-02|d |0 |
|2019-06-02|b |0 |
|2019-06-02|c |0 |
|Date |common_id|from_sum|
|2019-06-02|a |10 |
|2019-05-30|a |20 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
|Date |common_id|from_amount|
|2019-06-02|a |10 |
|2019-06-02|c |0 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
|Date |common_id|to_sum|
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
|Date |common_id|to_amount|
|2019-06-02|a |0 |
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
|Date |common_id|to_amount|from_amount|
|2019-06-02|a |0 |10 |
|2019-06-02|c |5 |0 |
|2019-05-30|a |20 |20 |
|2019-05-30|d |0 |0 |
|2019-06-02|b |0 |0 |
|2019-06-02|d |10 |5 |
|2019-05-30|c |5 |20 |
|2019-05-30|b |20 |5 |
|Date |common_id|to_amount|from_amount|balance|
|2019-06-02|a |0 |10 |-10 |
|2019-06-02|c |5 |0 |5 |
|2019-05-30|a |20 |20 |0 |
|2019-05-30|d |0 |0 |0 |
|2019-06-02|b |0 |0 |0 |
|2019-06-02|d |10 |5 |5 |
|2019-05-30|c |5 |20 |-15 |
|2019-05-30|b |20 |5 |15 |
|Date |common_id|to_amount|from_amount|balance|id |Name |
|2019-05-30|a |20 |20 |0 |a |Claudia |
|2019-06-02|a |0 |10 |-10 |a |Claudia |
|2019-05-30|b |20 |5 |15 |b |Manuella|
|2019-06-02|b |0 |0 |0 |b |Manuella|
|2019-05-30|c |5 |20 |-15 |c |Remy |
|2019-06-02|c |5 |0 |5 |c |Remy |
|2019-06-02|d |10 |5 |5 |d |Paul |
|2019-05-30|d |0 |0 |0 |d |Paul |
| Date|common_id|to_amount|from_amount|balance| id| Name|cumsum_balance|
|2019-05-30| d| 0| 0| 0| d| Paul| 0|
|2019-06-02| d| 10| 5| 5| d| Paul| 5|
|2019-05-30| c| 5| 20| -15| c| Remy| -15|
|2019-06-02| c| 5| 0| 5| c| Remy| -10|
|2019-05-30| b| 20| 5| 15| b|Manuella| 15|
|2019-06-02| b| 0| 0| 0| b|Manuella| 15|
|2019-05-30| a| 20| 20| 0| a| Claudia| 0|
|2019-06-02| a| 0| 10| -10| a| Claudia| -10|
I have the below df:
|student| vars|observed|
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
|student| vars|observed|total_count |
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
which should give you
|city |media |count|
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
This question already has answers here:
How to calculate cumulative sum using sqlContext
(4 answers)
Closed 4 years ago.
I have a DataFrame like this :
| webservice1| 80| 1|
| webservice1| 87| 2|
| webservice1| 283| 1|
| webservice2| 77| 2|
| webservice2| 80| 1|
| webservice2| 81| 1|
| webservice3| 63| 3|
| webservice3| 145| 1|
| webservice4| 167| 1|
| webservice4| 367| 2|
| webservice4| 500| 1|
and I want to get a result like this :
| webservice1| 80| 1| 1|
| webservice1| 87| 2| 3| ==> 2+1
| webservice1| 283| 1| 4| ==> 1+2+1
| webservice2| 77| 2| 2|
| webservice2| 80| 1| 3| ==> 2+1
| webservice2| 81| 1| 4| ==> 2+1+1
| webservice3| 63| 3| 3|
| webservice3| 145| 1| 4| ==> 3+1
| webservice4| 167| 1| 1|
| webservice4| 367| 2| 3| ==> 1+2
| webservice4| 500| 1| 4| ==> 1+2+1
here the result is the sum of numberOfSameTime inferior of the current responseTime
I can't find a logic to do that. Can any one help me !!
If your data is in increasing order with responseTime column for each group of webService_Name column then you can benefit from cumulative sum using Window function as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("webService_Name").orderBy("responseTime")
import org.apache.spark.sql.functions._
df.withColumn("Result", sum("numberOfSameTime").over(windowSpec)).show(false)
and you should have
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice3 |145 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
Note that the responseTime as to be number type and in increasing order for each webService_Name for the above case to work
You can use Window function available in spark and calculate the cumulative sum as below.
//dummy data
val d1 = spark.sparkContext.parallelize(Seq(
("webservice1", 80, 1),
("webservice1", 87, 2),
("webservice1", 283, 1),
("webservice2", 77, 2),
("webservice2", 80, 1),
("webservice2", 81, 1),
("webservice3", 63, 3),
("webservice3", 145, 1),
("webservice4", 167, 1),
("webservice4", 367, 2),
("webservice4", 500, 1)
//window functionn
val window = Window.partitionBy("webService_Name").orderBy($"webService_Name")
.rowsBetween(Long.MinValue, 0)
// create new column for Result
d1.withColumn("Result", sum("numberOfSameTime").over(window)).show(false)
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice3 |145 |1 |4 |
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
Hope this helps!
I have a column of lists in a spark dataframe.
|features |
How do I convert that to a spark dataframe where each element in the list is a column in the dataframe? We can assume that the lists will be the same size.
For Example,
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
What you describe is actually the invert of the VectorAssembler operation.
You can do it by converting to an intermediate RDD, as follows:
# u'2.2.0'
# your data:
# +-----------------+
# | features |
# +-----------------+
# |[0,45,63,0,0,0,0]|
# |[0,0,0,85,0,69,0]|
# |[0,89,56,0,0,0,0]|
# +-----------------+
dimensionality = 7
out = df.rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['c'+str(i+1) for i in range(dimensionality)])
# +---+----+----+----+---+----+---+
# | c1| c2| c3| c4| c5| c6| c7|
# +---+----+----+----+---+----+---+
# |0.0|45.0|63.0| 0.0|0.0| 0.0|0.0|
# |0.0| 0.0| 0.0|85.0|0.0|69.0|0.0|
# |0.0|89.0|56.0| 0.0|0.0| 0.0|0.0|
# +---+----+----+----+---+----+---+
You can use getItem:
df.withColumn("c1", df["features"].getItem(0))\
.withColumn("c2", df["features"].getItem(1))\
.withColumn("c3", df["features"].getItem(2))\
.withColumn("c4", df["features"].getItem(3))\
.withColumn("c5", df["features"].getItem(4))\
.withColumn("c6", df["features"].getItem(5))\
.withColumn("c7", df["features"].getItem(6))\
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
Here's an alternative without converting to rdd,
from pyspark.sql import functions as F
##Not incase of vectorAssembeler.
stop = df.select(F.max(F.size('features')).alias('size')).first().size ## if having a list of varying size, this might be useful.
udf1 = F.udf(lambda x : x.toArray().tolist(),ArrayType(FloatType()))
df = df.withColumn('features1',udf1('features'))
df.select(*[df.features1[i].alias('col_{}'.format(i)) for i in range(1,stop)]).show()
| 45| 63| 0| 0| 0| 0|
| 0| 0| 85| 0| 69| 0|
#desertnaut's answer can also be accomplished with dataframe and udf.
import pyspark.sql.functions as F
dimensionality = 7
column_names = ['c'+str(i+1) for i in range(dimensionality)]
splits = [F.udf(lambda val:val[i],FloatType()) for i in range(dimensionality)]
df = df.select(*[s('features').alias(j) for s,j in zip(splits,column_names)])