RxJava How to retry partial in chain flatmap - rx-java2

Observable.just()
.flatMap()
.flatMap()
<-Back to This line-----|
.flatMap() |
.flatMap() |
|
.flatMap() |
.flatMap() |
------------------->Error Occur
.flatMap()
.flatMap()
I have a chain of flatMap, in one flatMap, when it emit error, I'd like to back to partial of the flatMap chain, I try use cache and retry operator, but it can't be stopped by dispose()

You have to turn them into an inner flow, for example:
Observable.just()
.flatMap()
.flatMap(v ->
Observable.just(v)
.flatMap()
.flatMap()
.flatMap()
.flatMap()
.retry()
)
.flatMap()
.flatMap()

Related

Pyspark Dataframes: Does when(cond,value) always evaluate value?

so I am trying to conditionally apply an udf some_function() to column b1, based on the value in a1. (otherwise don't apply). Using pyspark.sql.functions.when(condition, value) and a simple udf
some_function = udf(lambda x: x.translate(...))
df = df.withColumn('c1',when(df.a1 == 1, some_function(df.b1)).otherwise(df.b1))
With this example data:
| a1| b1|
---------------
| 1|'text'|
| 2| null|
I am seeing that some_function() is always evaluated (i.e. the udf calls translate() on null and crashes), regardless of condition and applied if condition is true. To clarify, this is not about udfs handling null correctly, but when(...) always executing value, if value is an udf.
Is this behaviour intended? If so, how can I apply a method conditionally so it doesn't get executed if condition is not met?

Don't recalculate UDF [duplicate]

i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other. This is my actual code and the current version of spark is 1.5.1-cdh-5.5.2:
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
def myNextPositiveNumber :String = { (r.nextInt(Integer.MAX_VALUE) + 1 ).toString.concat("D")}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
with this code, I have this data:
scala> myNewDF.select("myNewColumn").show(10,false)
+-----------+
|myNewColumn|
+-----------+
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
+-----------+
It looks like that the udf myNextPositiveNumber is invoked only once, isn't?
update
confirmed, there is only one distinct value:
scala> myNewDF.select("myNewColumn").distinct.show(50,false)
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
...
+-----------+
|myNewColumn|
+-----------+
|889488717D |
+-----------+
what do I am doing wrong?
Update 2: finally, with the help of #user6910411 I have this code:
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
val accum = sc.accumulator(1)
def myNextPositiveNumber():String = {
accum+=1
accum.value.toString.concat("D")
}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
myNewDF.select("myNewColumn").count
// 63385686
update 3
Actual code generates data like this:
scala> mydf.select("myNewColumn").show(5,false)
17/02/22 11:01:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+-----------+
|myNewColumn|
+-----------+
|2D |
|2D |
|2D |
|2D |
|2D |
+-----------+
only showing top 5 rows
It looks like the udf function is invoked only once, isn't? I need a new random element in that column.
update 4 #user6910411
i have this actual code that increases the id but it is not concatenating the final char, it is weird. This is my code:
import org.apache.spark.sql.functions.udf
val mydf = sqlContext.read.parquet("some.parquet")
mydf.cache
def myNextPositiveNumber():String = monotonically_increasing_id().toString().concat("D")
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",expr(myNextPositiveNumber))
scala> myNewDF.select("myNewColumn").show(5,false)
17/02/22 12:00:02 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_4_0]
+-----------+
|myNewColumn|
+-----------+
|0 |
|1 |
|2 |
|3 |
|4 |
+-----------+
I need something like:
+-----------+
|myNewColumn|
+-----------+
|1D |
|2D |
|3D |
|4D |
+-----------+
Spark >= 2.3
It is possible to disable some optimizations using asNondeterministic method:
import org.apache.spark.sql.expressions.UserDefinedFunction
val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic
Please make sure you understand the guarantees before using this option.
Spark < 2.3
Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:
rand - Generate a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
randn - Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
and transform the output to obtain required distribution for example:
(rand * Integer.MAX_VALUE).cast("bigint").cast("string")
You can make use of monotonically_increasing_id to generate random values.
Then you can define a UDF to append any string to it after casting it to String as monotonically_increasing_id returns Long by default.
scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
+-----+
|names|
+-----+
| Ron|
| John|
|Steve|
|Brawn|
| Rock|
| Rick|
+-----+
scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))
scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
+-----+---+
|names| ID|
+-----+---+
| Ron| 0D|
| John| 1D|
|Steve| 2D|
|Brawn| 3D|
| Rock| 4D|
| Rick| 5D|
+-----+---+

Combine array of maps into single map in pyspark dataframe

Is there a function similar to the collect_list or collect_set to aggregate a column of maps into a single map in a (grouped) pyspark dataframe? For example, this function might have the following behavior:
>>>df.show()
+--+---------------------------------+
|id| map |
+--+---------------------------------+
| 1| Map(k1 -> v1)|
| 1| Map(k2 -> v2)|
| 1| Map(k3 -> v3)|
| 2| Map(k5 -> v5)|
| 3| Map(k6 -> v6)|
| 3| Map(k7 -> v7)|
+--+---------------------------------+
>>>df.groupBy('id').agg(collect_map('map')).show()
+--+----------------------------------+
|id| collect_map(map) |
+--+----------------------------------+
| 1| Map(k1 -> v1, k2 -> v2, k3 -> v3)|
| 2| Map(k5 -> v5)|
| 3| Map(k6 -> v6, k7 -> v7)|
+--+----------------------------------+
It probably wouldn't be too difficult to produce the desired result using one of the other collect_ aggregations and a udf, but it seems like something like this should already exist.
I know it is probably poor form to provide an answer to your own question before others have had a chance to answer, but in case someone is looking for a udf based version, here is one possible answer.
from pyspark.sql.functions import udf,collect_list
from pyspark.sql.types import MapType,StringType
combineMap=udf(lambda maps: {key:f[key] for f in maps for key in f},
MapType(StringType(),StringType()))
df.groupBy('id')\
.agg(collect_list('map')\
.alias('maps'))\
.select('id',combineMap('maps').alias('combined_map')).show()
The suggested solution with concat_map deosn't work and this solution doesn't use UDFs.
For spark>=2.4
(df
.groupBy(f.col('id'))
.agg(f.collect_list(f.col('map')).alias('maps'),
.select('id',
f.expr('aggregate(slice(maps, 2, size(maps)), maps[0], (acc, element) -> map_concat(acc, element))').alias('mapsConcatenated')
)
)
collect_list ignores the null values so no need to worry about them when using map_concat in aggregate function.
it's map_concat in the pyspark version >= 2.4

About how to add a new column to an existing DataFrame with random values in Scala

i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other. This is my actual code and the current version of spark is 1.5.1-cdh-5.5.2:
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
def myNextPositiveNumber :String = { (r.nextInt(Integer.MAX_VALUE) + 1 ).toString.concat("D")}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
with this code, I have this data:
scala> myNewDF.select("myNewColumn").show(10,false)
+-----------+
|myNewColumn|
+-----------+
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
+-----------+
It looks like that the udf myNextPositiveNumber is invoked only once, isn't?
update
confirmed, there is only one distinct value:
scala> myNewDF.select("myNewColumn").distinct.show(50,false)
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
...
+-----------+
|myNewColumn|
+-----------+
|889488717D |
+-----------+
what do I am doing wrong?
Update 2: finally, with the help of #user6910411 I have this code:
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
val accum = sc.accumulator(1)
def myNextPositiveNumber():String = {
accum+=1
accum.value.toString.concat("D")
}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
myNewDF.select("myNewColumn").count
// 63385686
update 3
Actual code generates data like this:
scala> mydf.select("myNewColumn").show(5,false)
17/02/22 11:01:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+-----------+
|myNewColumn|
+-----------+
|2D |
|2D |
|2D |
|2D |
|2D |
+-----------+
only showing top 5 rows
It looks like the udf function is invoked only once, isn't? I need a new random element in that column.
update 4 #user6910411
i have this actual code that increases the id but it is not concatenating the final char, it is weird. This is my code:
import org.apache.spark.sql.functions.udf
val mydf = sqlContext.read.parquet("some.parquet")
mydf.cache
def myNextPositiveNumber():String = monotonically_increasing_id().toString().concat("D")
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",expr(myNextPositiveNumber))
scala> myNewDF.select("myNewColumn").show(5,false)
17/02/22 12:00:02 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_4_0]
+-----------+
|myNewColumn|
+-----------+
|0 |
|1 |
|2 |
|3 |
|4 |
+-----------+
I need something like:
+-----------+
|myNewColumn|
+-----------+
|1D |
|2D |
|3D |
|4D |
+-----------+
Spark >= 2.3
It is possible to disable some optimizations using asNondeterministic method:
import org.apache.spark.sql.expressions.UserDefinedFunction
val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic
Please make sure you understand the guarantees before using this option.
Spark < 2.3
Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:
rand - Generate a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
randn - Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
and transform the output to obtain required distribution for example:
(rand * Integer.MAX_VALUE).cast("bigint").cast("string")
You can make use of monotonically_increasing_id to generate random values.
Then you can define a UDF to append any string to it after casting it to String as monotonically_increasing_id returns Long by default.
scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
+-----+
|names|
+-----+
| Ron|
| John|
|Steve|
|Brawn|
| Rock|
| Rick|
+-----+
scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))
scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
+-----+---+
|names| ID|
+-----+---+
| Ron| 0D|
| John| 1D|
|Steve| 2D|
|Brawn| 3D|
| Rock| 4D|
| Rick| 5D|
+-----+---+

Apply different aggregate function to a PySpark groupby

I have a dataframe with a structure similar to
+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
and now I want to aggregate the lat and lon values, but using my own function:
def get_centroid(lat, lon):
# ...do whatever I need here
return t_lat, t_lon
get_c = udf(lambda x, y: get_centroid(x,y), FloatType())
gg = df.groupby('cod', 'name').agg(get_c('lat', 'lon'))
but I get the following error:
u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"
Is there a way to get the elements of the group by and operate on them, without having to use a UDAF? Something similar to pandas
df.groupby(['cod','name'])[['lat', 'lon']].apply(f).to_frame().reset_index()