can't run an other paragraph in Zeppelin after VectorAssembler.transfrom - pyspark

I am using Zeppelin 0.7.1 and spark 2.1.0.
I ve got some data in the dataframe 'dataset' :
+-------+-------+-------+-------+
| index |var 1 |var 2 |var 3 |
+-------+-------+-------+-------+
| 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+
| 1 | 0 | 1 | 0 |
+-------+-------+-------+-------+
| 2 | 1 | 0 | 1 |
+-------+-------+-------+-------+
and I want, in order to make a linear regression, to put every column in one vector column :
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['var 1', 'var 2', 'var 3'], outputCol='features')
output = assembler.transform(dataset)
Well, after running this, in Zeppelin, I can't run an other paragraph. I must restart my interpreter...
If someone has an idea where the problem may come from.
Thanks !

Version 0.7.2 of Zeppelin should solve your problem.
We had the same problem, we just tested it with the same version and this upgrade and it was fine.
regards

Related

Use spark mapPartition function to iterate over dataframe row and add new column

I'm new to spark and scala. I was trying to use mapPartitions function on a Spark dataframe to iterate over dataframe rows and derive a new column based on the value of another column from the previous row.
Input Dataframe:
+------------+----------+-----------+
| order_id |person_id | create_dt |
+------------+----------+-----------+
| 1 | 1 | 2020-01-11|
| 2 | 1 | 2020-01-12|
| 3 | 1 | 2020-01-13|
| 4 | 1 | 2020-01-14|
| 5 | 1 | 2020-01-15|
| 6 | 1 | 2020-01-16|
+------------+----------+-----------+
From above dataframe, I want to use mapPartitions function and call a scala method which takes Iterator[Row] as a parameter and produces another output row with new column date_diff. The new column is derived as the date difference between create_dt column of current row and previous row
Expected output dataframe:
+------------+----------+-----------+-----------+
| order_id |person_id | create_dt | date_diff |
+------------+----------+-----------+-----------+
| 1 | 1 | 2020-01-11| NA |
| 2 | 1 | 2020-01-12| 1 |
| 3 | 1 | 2020-01-13| 1 |
| 4 | 1 | 2020-01-14| 1 |
| 5 | 1 | 2020-01-15| 1 |
| 6 | 1 | 2020-01-16| 1 |
+------------+----------+-----------+-----------+
Code I tried so far:
// Read input data
val input_data = sc.parallelize(Seq((1,1,"2020-01-11"), (2,1,"2020-01-12"), (3,1,"2020-01-13"), (4,1,"2020-01-14"), (5,1,"2020-01-15"), (6,1,"2020-01-16"))).toDF("order_id", "person_id","create_dt")
//Generate output data using mapPartitions and call getDateDiff method
val output_data = input_data.mapPartitions(getDateDiff).show()
//getDateDiff method to iterate over each row and derive the date difference
def getDateDiff(srcItr: scala.collection.Iterator[Row]) : Iterator[Row] = {
for(row <- srcItr){ row.get(2)}
/*derive date difference and generate output row*/
}
Could someone help me on how to write the getDateDiff method to get the expected output.

Dask equivalent of pyspark lead and lag function

Is it possible to receive in dask dataframe similar results which can be receive by lag or lead window functions at pyspark? I want to tranform following dataframe
+-------+
| value |
+-------+
| 1 |
| 2 |
| 3 |
+-------+
to something like this
+-------+------------+------------+
| value | prev_value | next_value |
+-------+------------+------------+
| 1 | NaN | 2 |
| 2 | 1 | 3 |
| 3 | 2 | NaN |
+-------+------------+------------+
Dask dataframe just mirrors the pandas interface. In this case the method you want is shift:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame({'a': range(5)})
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=2)
In [7]: out = ddf.assign(prev_a=ddf.a.shift(1), next_a=ddf.a.shift(-1))
In [8]: out.compute()
Out[8]:
a prev_a next_a
0 0 NaN 1.0
1 1 0.0 2.0
2 2 1.0 3.0
3 3 2.0 4.0
4 4 3.0 NaN
However, if you're trying to align rows to do some kind of windowed or rolling computation, you may be more interested in map_overlap, which would be more performant.

How merge three DataFrame in Scala

How merge 3 DataFrame in Spark-Scala? I completly don't have any Idea how I can make this. On stackOverFlow I can't found similar example.
I have 3 similar DataFrames. The same name of Column, and the same number of them. Difference is only a value on rows.
DataFrame1:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 1 |wdasd |xyzd|111|
| 1 |wd |zdfd|112|
| 1 |bdp |2gfs|113|
+----+------+----+---+
DataFrame2:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 2 |wdasd |xyzd|221|
| 2 |wd |zdfd|222|
| 2 |bdp |2gfs|223|
+----+------+----+---+
DataFrame3:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 3 |AAAA |N_AM|331|
| 3 |BBBB |NA_M|332|
| 3 |CCCC |MA_N|333|
+----+------+----+---+
And I want to this type of DataFrame
MergeDataFrame:
+----+------+----+---+
|type| Model|Name|ID |
+----+------+----+---+
| 1 |wdasd |xyzd|111|
| 1 |wd |zdfd|112|
| 1 |bdp |2gfs|113|
| 2 |wdasd |xyzd|221|
| 2 |wd |zdfd|222|
| 2 |bdp |2gfs|223|
| 3 |AAAA |N_AM|331|
| 3 |BBBB |NA_M|332|
| 3 |CCCC |MA_N|333|
+----+------+----+---+
Spark provides a union and unionAll. Looks like they are deprecating the unionAll function so I would use the union function as below:
dataFrame1.union(dataFrame2).union(dataFrame3)
Note that in order to union data frames the data frames must have the exact same column names in the exact same order.
See the spark docs here

pyspark piplineRDD fit to Dataframe column

Before everything i'm new guy in python and spark world.
I have homework from university but i stuck in one place.
I make clusterization from my data and now i have my clusters in PipelinedRDD
aftre this:
cluster = featurizedScaledRDD.map(lambda r: kmeansModelMllib.predict(r))
cluster = [2,1,2,0,0,0,1,2]
now now i have cluster and my dataframe dataDf i need fit my cluster like a new column to dataDf
i Have: i Need:
+---+---+---+ +---+---+---+-------+
| x | y | z | | x | y | z |cluster|
+---+---+---+ +---+---+---+-------+
| 0 | 1 | 1 | | 0 | 1 | 1 | 2 |
| 0 | 0 | 1 | | 0 | 0 | 1 | 1 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 2 |
| 0 | 8 | 0 | | 0 | 8 | 0 | 0 |
| 0 | 1 | 0 | | 0 | 1 | 0 | 0 |
+---+---+---+ +---+---+---+-------+
You can add index using zipWithIndex, join, and convert back to df.
swp = lambda x: (x[1], x[0])
cluster.zipWithIndex().map(swp).join(dataDf.rdd.zipWithIndex().map(swp)) \
.values().toDF(["cluster", "point"])
In some cases it should be possible to use zip:
cluster.zip(dataDf.rdd).toDF(["cluster", "point"])
You can follow with .select("cluster", "point.*") to flatten the output.

How to sort the data on multiple columns in apache spark scala?

I have data set like this which I am taking from csv file and converting it into RDD using scala.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234 | 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
How to sort the data on all columns ?
Thanks
Suppose your input RDD/DataFrame is called df.
To sort recent in descending order, Freq and Monitor both in ascending you can do:
import org.apache.spark.sql.functions._
val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))
You can use df.orderBy(...) as well, it's an alias of sort().
csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it