how to groupby tuples within a window in Storm trident? - group-by

I need to do a groupby on tuples which fall in a tumblingWindow (the groupby is based on two other fields plus their time window) in Storm trident and then apply an aggregation function on them. The following code aggregates all tuples in a window:
topology.newStream("fixed_spout", fixedSpout).each(new Fields("flows"), new ExtractflowValues2("IPV4_SRC_ADDR"),new Fields("sa"))
.each(new Fields("flows"),new ExtractflowValues2("L4_DST_PORT"),new Fields("dp"))
.each(new Fields("flows"),new ExtractflowValues2("IPV4_DST_ADDR"),new Fields("da"))
.tumblingWindow(BaseWindowedBolt.Duration.seconds(5), wsf, new Fields(groupbyFileds), new countDistinct("da"), new Fields("count"));
In the above code I first extract 3 fields(sa,da and dp) from my tuples and then I put tuples in the windows of the duration of 5 seconds and count the number of their distinct "da"s for each window.
However, what I really need is to put tuples in windows of the duration of 5 seconds and do a groupby on these tuples basd on their "sa" and "dp" fields and then count the number of their distinct "da"s. How can I achieve it?

Related

Pyspark filter a dataframe based on another dataframe containing distinct values

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Spark ML Transformer - aggregate over a window using rangeBetween

I would like to create custom Spark ML Transformer that applies an aggregation function within rolling window with the construct over window. I would like to be able to use this transformer in Spark ML Pipeline.
I would like to achieve something that could be done quite easily with withColumn as given in this answer
Spark Window Functions - rangeBetween dates
for example:
val w = Window.orderBy(col("unixTimeMS")).rangeBetween(0, 700)
val df_new = df.withColumn("cts", sum("someColumnName").over(w))
Where
df is my dataframe
unixTimeMS is unix time in milliseconds
someColumnName is some column that I want to perform aggregation.
In this example I do a sum over the rows within the window.
the window w includes current transaction and all transactions within 700 ms from the current transaction.
Is it possible to put such window aggregation into Spark ML Transformer?
I was able to achieve something similar with Spark ML SQLTransformer where the
val query = """SELECT *,
sum(someColumnName) over (order by unixTimeMS) as cts
FROM __THIS__"""
new SQLTransformer().setStatement(query)
But I can't figure out how to use rangeBetween in SQL to select period of time. Not just number of rows. I need specific period of time with respect to unixTimeMS of the current row.
I understand the Unary Transforme is not the way to do it because I need to make an aggregate. Do I need to define a UDAF (user defined aggregate function) and use it in SQLTransformer?
I wasn't able to find any example of UDAF containing window function.
I am answering my own question for the future reference. I ended up using SQLTransformer. Just like the window function in the example where I use range between:
val query = SELECT *,
sum(dollars) over (
partition by Numerocarte
order by UnixTime
range between 1000 preceding and 200 following) as cts
FROM __THIS__"
Where 1000 and 200 in range between relate to units of the order by column.

Apache Spark Scala : groupbykey maintains order of values in input RDD or not

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. Is groupBykey maintains order of values. Value which occur first in input RDD should come first in the Output RDD. I tried this and it is mainlining that order, but i wanted to confirm that from expert. I need something like below
Input RDD [Int, Int]
1 20
2 10
1 8
1 25
Output RDD
1 20 8 25
2 10
No.
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with the existing partitioner/parallelism level. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions#groupByKey():org.apache.spark.rdd.RDD[(K,Iterable[V])]

SparkSQL PostgresQL Dataframe partitions

I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
For some reason, one partition of the DataFrame contains almost all rows.
For what I can understand lowerBound/upperBound are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
The stride is the frequency (number of elements returned each query) with which Spark will query the DB for each executor (partition)?
If not, what is the purpose of this parameters, what do they depend on and how can I balance my DataFrame partitions in a stable way (not asking all partitions contain the same number of elements, just that there is an equilibrium - for example 2 partitions 100 elements 55/45 , 60/40 or even 65/35 would do)
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
Cheers and thanks for your time.
Essentially the lower and upper bound and the number of partitions are used to calculate the increment or split for each parallel task.
Let's say the table has partition column "year", and has data from 2006 to 2016.
If you define the number of partitions as 10, with lower bound 2006 and higher bound 2016, you will have each task fetching data for its own year - the ideal case.
Even if you incorrectly specify the lower and / or upper bound, e.g. set lower = 0 and upper = 2016, there will be a skew in data transfer, but, you will not "lose" or fail to retrieve any data, because:
The first task will fetch data for year < 0.
The second task will fetch data for year between 0 and 2016/10.
The third task will fetch data for year between 2016/10 and 2*2016/10.
...
And the last task will have a where condition with year->2016.
T.
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Function columnPartition contains the code for the partitioning logic and the use of lower / upper bound.
lowerbound and upperbound have been currently identified to do what they do in the previous answers. A followup to this would be how to balance the data across partitions without looking at the min max values or if your data is heavily skewed.
If your database supports "hash" function, it could do the trick.
partitionColumn = "hash(column_name)%num_partitions"
numPartitions = 10 // whatever you want
lowerBound = 0
upperBound = numPartitions
This will work as long as the modulus operation returns a uniform distribution over [0,numPartitions)