APACHE BEAM GROUP DATA BY KEY - apache-beam

I have a
PCollection<KV<KV<String, String>, Long>
Each item of the PCollection looks like that: KV{KV{date,name},long}
I would like to group all item with the same date but I don't know how to do that
If someone has an idea ?
Thanks in advance

Once you got your PCollection in KV<> format. You can use GroupBYKey PTransform in apache beam which will put all your data in Iterable with same key.
Use the below code:
PCollection<KV<> outputElement = inputElemeny.apply(GroupByKey.<KV<>>create());
Then apply Pardo on output element to get the suitable result you want.

Related

Spark/Scala - Validate JSON document in a row of a streaming DataFrame

I have a streaming application which is processing a streaming DataFrame with column "body" that contains a JSON string.
So in the body is something like (these are four input rows):
{"id":1, "ts":1557994974, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3,"attr3":"something"}]}
{"id":2, "ts":1557994975, "details":[{"id":1,"attr2":"3","attr3":"something"}, {"id":2,"attr2":"3","attr3":"something"},{"id":3,"attr2":"3","attr3":"something"}]}
{"id":3, "ts":1557994976, "details":[{"id":1,"attr2":3,"attr3":"something"}, {"id":2,"attr2":3}]}
{"id":4, "ts":1557994977, "details":[]}
I would like to check that each row has the correct schema (data types and contains all attributes). I would like to filter out and log the invalid records somewhere (like a Parquet file). I am especially interested in the "details" array - each of the nested documents must have specified fields and correct data types.
So in the example above only row id = 1 is valid.
I was thinking about a case class such as:
case class Detail(
id: Int,
attr2: Int,
attr3: String
)
case class Input(
id: Int,
ts: Long,
details: Seq[Detail]
)
and Try but not sure how to go about it.
Could someone help, please?
Thanks
One approach is to use JSON Schema that can help you with schema validations on the data. The getting started page is a good place to start off with if you're new.
The other approach would roughly work as follows
Build models (case classes) for each of the objects like you've attempted in your question.
Use a JSON library like Spray JSON / Play-JSON to parse the input json.
For all input that fail to be parsed into valid records mostly likely invalid and you can partition those output into a different sink in your spark code. It would also make this robust if you've an isValid method on the objects which can validate if a parsed record is correct or not too.
The easiest way for me is to create a Dataframe with a Schema and then filter with id == 1. This is not the most efficient way.
Heare you can find a example to create a dataframe with Schema: https://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/
Edit
I can't find a Pre-filtering to speed up JSON search in scala, but you can use this three options:
spark.read.schema(mySchema).format(json).load("myPath").filter($"id" === 1)
or
spark.read.schema(mySchema).json("myPath").filter($"id" === 1)
or
spark.read.json("myPath").filter($"id" === 1)

Nested output of Flink

I am processing a Kafka stream using Flink SQL where every message is pulled from Kafka, processed using flink sql and pushed back into kafka. I wanted a nested output where input is flat and output is nested. Say for example my input is
{'StudentName':'ABC','StudentAge':33}
and want output as
{'Student':{'Name':'ABC','Age':33}}
I tried searching here and few similar links but could not find so. Is it possible to do so using Apache Flink SQL API? Can use User Defined Functions if necessary but would want to avoid so.
You could try something like this:
SELECT
MAP ['Student', MAP ['Name', StudentName, 'Age', StudentAge]]
FROM
students;
I found the MAP function here, but I had to experiment in the SQL Client to figure out the syntax.
I was able to achieve the same by returning a map from Flink UDF. The eval() function in UDF will return a Map while the FlinkSQL query will call the UDF with student as an Alias:
The UDF should looks like this
public class getStudent extends ScalarFunction {
public Map<String, String> eval(String name, Integer age) {
Map<String, String> student = new HashMap<>();
student.put("Name", name);
student.put("Age", age.toString());
return student;
}
}
and the FlinkSQL query stays like this:
Select getStudent(StudentName, StudentAge) as `Student` from MyKafkaTopic
The same can be done for Lists as well when trying get List out of FlinkSQL

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")

Separate all values from Iterable, Apache Spark

I have grouped all my customers in JavaPairRDD<Long, Iterable<ProductBean>> by there customerId (of Long type). Means every customerId have a List or ProductBean.
Now i want to save all ProductBean to DB irrespective of customerId. I got all values by using method
JavaRDD<Iterable<ProductBean>> values = custGroupRDD.values();
Now i want to convert JavaRDD<Iterable<ProductBean>> to JavaRDD<Object, BSONObject> so that i can save it to Mongo. Remember, every BSONObject is made of Single ProductBean.
I am not getting any idea of how to do this in Spark, i mean which Spark's Transformation is used to do that job. I think this task is some kind of seperate all values from Iterable. Please let me know how is this possible.
Any hint in Scala or Python are also ok.
You can use the flatMapValues function:
JavaRDD<Object,ProductBean> result = custGroupRDD.flatMapValues(v -> v)