Dynamically cast JsonElement value when mapping Json object - scala

What I'm trying to achieve here is, in a JSON field which contains more than 1 type of element that I need to use later on but cast in prior stages so there is no need to do so deep in the execution schedule, dynamically map those specific JSON types to the objects so there is no need to do so later.
My actual situation is the following one...
obj.exampleField = json
.getAsJsonObject(exampleField)
.entrySet
.map(entry => entry.getKey -> entry.getValue.getAsString)
.toMap
At the moment everything is a String, but will need to be modified so that exampleField starts containing a field which is an Array type.
How can I dynamically map those classes within my current .map stage? So that a key which contains a String type field is already casted to type resultant of getAsString and, in case it's an ArrayType, getAsJsonArray.
Or there is no other option rather than avoid the current .map stage and move the casting to classes to the last stage in the execution schedule?

Related

Azure Data Factory: append array to array in ForEach

In Azure Data Factory pipeline, I have ForEach1 loop over Databricks Activities. Those Databricks Activities output arrays of different sizes. I would like to union and pass those arrays to another ForEach2 loop so that every element of every array would be an item in this new ForEach2 loop.
How could I collect output arrays from ForEach1 into one big array? I've tried Append Variable Activity, but got the following error:
The value of type 'Array' cannot be appended to the variable of type 'Array'.
The action type 'AppendToArrayVariable' only supports values of types 'Float, Integer, String, Boolean, Object'.
Is there a way to union/merge arrays inside ForEach1 loop? Or are there any other ways to pass arrays to ForEach2 where each element of each array would be considered as a separate item and ForEach2 would loop over each item?
There is a collection function called union() in Azure data factory which takes 2 arguments (both of type array or object). This can be used to achieve your requirement. You can follow the following example which I have tried with Get Metadata activity instead of Databricks Notebook activity.
I have a container called input and there are 2 folders a and b in it. These folders have some files. Using the similar approach, I am appending the array of child items (file names) generated by Get Metadata activity in each iteration to get a list of all file names (one big array). The following is my folder structure inside which files are present:
First I used get metadata to get names of folders inside container.
I used #activity('folder_names').output.childItems for items value in ForEach activity. Now Iniside for each, I have again used get metadata to get child items of each folder (I created a dataset and gave dynamic value for folder name in path).
You can use the procedure from below to get the requirement:
I have given the output of the 2nd get metadata activity (files in folder) to a set variable activity. I created a new variable current_file_list (array type) and gave its value as:
#union(variables('list_of_files'),activity('files in folder').output.childItems)
Note: Instead of activity('files in folder').output.childItems) in the above union, you can use the array returned by your databricks activity in Foreach1 for each iteration.
The list_of_files is another array type variable (created using another set variable activity) which is the final array that contains all elements as one big array. I am assigning this variable value as #variables('current_file_list')
This indirectly means that list_of_files value is the union of previous array elements and the current folder child item array.
Reason: If we use one variable (say list_of_files) and give its value as #union(variables('list_of_files'),activity('files in folder').output.childItems) it throws an error. We cannot use self-reference of a variable in azure data factory dynamic content so, we need to create 2 variables to overcome this.
When I debug the pipeline, it gives expected output, and we can see the output produced after each activity.
The following are the output images:
Input for list of files in current folder first iteration will be:
Input for list of files in current folder second iteration will be:
Final list_of_files array with all elements in single array:
You can follow the above approach to the combined array and use it for your ForEach2 activity.

Merge objects in mutable list on Scala

I have recently started looking at Scala code and I am trying to understand how to go about a problem.
I have a mutable list of objects, these objects have an id: String and values: List[Int]. The way I get the data, more than one object can have the same id. I am trying to merge the items in the list, so if for example, I have 3 objects with id 123 and whichever values, I end up with just one object with the id, and the values of the 3 combined.
I could do this the java way, iterating, and so on, but I was wondering if there is an easier Scala-specific way of going about this?
The first thing to do is avoid using mutable data and think about transforming one immutable object into another. So rather than mutating the contents of one collection, think about creating a new collection from the old one.
Once you have done that it is actually very straightforward because this is the sort of thing that is directly supported by the Scala library.
case class Data(id: String, values: List[Int])
val list: List[Data] = ???
val result: Map[String, List[Int]] =
list.groupMapReduce(_.id)(_.values)(_ ++ _)
The groupMapReduce call breaks down into three parts:
The first part groups the data by the id field and makes that the key. This gives a Map[String, List[Data]]
The second part extracts the values field and makes that the data, so the result is now Map[String, List[List[Int]]]
The third part combines all the values fields into a single list, giving the final result Map[String, List[Int]]

Bulk delete records from HBase - how to convert an RDD to Array[Byte]?

I have an RDD of objects that I want to bulk delete from HBase. After reading HBase documentation and examples I came up with the following code:
hc.bulkDelete[Array[Byte]](salesObjects, TableName.valueOf("salesInfo"),
putRecord => new Delete(putRecord), 4)
However as far as I understand salesObjects has to be converted to Array[Byte].
Since salesObjects is an RDD[Sale] how to convert it to Array[Byte] correctly?
I've tried Bytes.toBytes(salesObjects) but the method doesn't accept RDD[Sale] as an argument. Sale is a complex object so it will be problematic to parse each field to bytes.
For now I've converted RDD[Sale] to val salesList: List[Sale] = salesObjects.collect().toList but currently stuck with where to proceed next.
I've never used this method but I'll try to help:
the methods accepts a RDD of any type T: https://github.com/apache/hbase/blob/master/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseContext.scala#L290 ==> so you should be able to use it on your RDD[Sale]
bulkDelete expects a function transforming your Sale object to HBase's Delete object (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html)
Delete object represents a row to delete. You can get an example of Delete object initialization here: https://www.tutorialspoint.com/hbase/hbase_delete_data.htm
depending on what and how you want to remove a row, you should convert the parts of your Sales into a byte. For instance, you want to remove the data by row key, you should extract it and put into Delete object
In my understanding bulkDelete method will accumulate batchSize number of Delete objects and send them into HBase at once. Otherwise, could you please show some code to get a more concrete idea of what you're trying to do ?
Doing val salesList: List[Sale] = salesObjects.collect().toList is not a good idea since it brings all data into your driver. Potentially it can lead to OOM problems.

How to get the values in DataFrame with the correct DataType?

When I tried to get some values in a DataFrame, like:
df.select("date").head().get(0) // type: Any
The result type is Any, which is not expected.
Since a dataframe contains the schema of the data, it should know the DataType for each column, so when i try to get a value using get(0), it should return the value with the correct type. However, it does not.
Instead, I need to specify which DataType i want using getDate(0), which seems weird, inconvenient, and makes me mad.
When I have specified the schema with the correct DataTypes for each column when i created the Dataframe, I don't want to use different getXXX()' for differentcolumn`s.
Are there some convenient ways that I can get the values with their own correct types? That is to say, how can I get the values with the correct DataType specified in the schema?
Thank you!
Scala is a statically typed language. so the get method defined on the Row can only return values with a single type because the return type of the get method is Any. It cannot return Int for one call and a String for another.
you should be calling the getInt, getDate and other get methods provided for each type. Or the getAs method in which you can pass the type as a parameter (for example row.getAs[Int](0)).
As mentioned in the comments other options are
use Dataset instead of a DataFrame.
use Spark SQL
You can call the generic getAs method as getAs[Int](columnIndex), getAs[String](columnIndex) or use specific methods like getInt(columnIndex), getString(columnIndex).
Link to the Scaladoc for org.apache.spark.sql.Row.

Scala 2.10.6 + Spark 1.6.0: Most appropriate way to map from DataFrame to case classes

I'm looking for the most appropriate way to map the information contained in a DataFrame to some case classes I've defined, according to the following situation.
I have 2 Hive tables, and a third table which represents the many-to-many relationship between them. Lets call them "Item", "Group", and "GroupItems".
I'm considering executing a single query joining them all, to get the information of a single group, and all its items.
So, each row of the resulting DataFrame, would contain the fields of the Group, and the fields of an Item.
Then, I've created 4 different case classes to use this information in my application. Lets call them:
- ItemProps1: its properties match with some of the Item fields
- ItemProps2: its properties match with some of the Item fields
- Item: contains some properties which match with some of the Item fields, and has 1 object of type ItemProps1, and another of type ItemProps2
- Group: its properties match with the Group fields, and contains a list of items
What I want to do is to map the info contained in the resulting DataFrame into these case classes, but I don't know which would be the most appropriate way.
I know DataFrame has a method "as[U]" which is very useful to perform this kind of mapping, but I'm afraid in my case it wont be useful.
Then, I've found some options to perform the mapping manually, like the following ones:
df.map {
case Row(foo: Int, bar: String) => Record(foo, bar)
}
-
df.collect().foreach { row =>
val foo = row.getAs[Int]("foo")
val bar = row.getAs[String]("bar")
Record(foo, bar)
}
Is any of these approaches the most appropriate one, or should I do it in another way?
Thanks a lot!