Conversion Flat File to Cassandra Data Model using Spark and Scala - scala

I am using Spark(Data Frame) and Scala to transform the flat file to Cassandra data model for storing. Cassandra data model have lot of frozen inside each column and difficult to achieve it.
Tried multiple option using dataframe/dataset nothing worked out and do not want to achieve using RDD.
Cassandra Data Model
Transfer Table
transferNumber - String
orderList - Frozen List -forder
forder Frozen
orderNumber TEXT
lineItem Frozen List -flineitem
flineitem Frozen
lineitemid INT
trackingnumber Frozen List -ftrackingnumber
ftrackingnumber Frozen
trackingnumber TEXT
expectedquantity INT
xList List Text
yList LIST TEXT
DataFrame Output
[order1,10,tracking1,WrappedArray(xlist12,xlist13),null,transfer1]
[order1,20,tracking1,null,WrappedArray(ylist14),transfer1]
[order2,10,tracking2,null,WrappedArray(ylist15),transfer1]
Data Frame Schema
root
orderNumber: string (nullable = true)
lineItemId: integer (nullable = true)
trackingNumber: string (nullable = true)
xList: array (nullable = true)
element: string (containsNull = true)
yList: array (nullable = true)
element: string (containsNull = true)
transferNumber: string (nullable = true)
Tried code
val groupByTransferNumber = lineItem.groupBy("transferNumber").agg(collect_set($"orderNumber".alias("order")))
Output:
root
transferNumber: string (nullable = true)
collect_set: array (nullable = true)
element: long (containsNull = true)
[transfer1,WrappedArray(order1,order2)]

Related

How to extract DB name and table name from a comma separated string of Dataframe column

I have a dataframe column table_name which is having below string value:
tradingpartner.parent_supplier,lookup.store,lab_promo_invoice.tl_cc_mbr_prc_wkly_inv,lab_promo_invoice.mpp_club_card_promotion_funding_view,lab_promo_invoice.supplier_sale_apportionment_cc,tradingpartner.supplier,stores.rpm_zone_location_mapping,lookup.calendar
How to extract DB name and table name from the above string and store it as DB name in one column and tablename in another column.
I want the output as below
One possible solution is to define two different UDF to achieve the goal.
Starting from this input DataFrame, called dfInput:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|table_name |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|tradingpartner.parent_supplier,lookup.store,lab_promo_invoice.tl_cc_mbr_prc_wkly_inv,lab_promo_invoice.mpp_club_card_promotion_funding_view,lab_promo_invoice.supplier_sale_apportionment_cc,tradingpartner.supplier,stores.rpm_zone_location_mapping,lookup.calendar,sauces,plant|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The first UDF, called dbName, is responsible to take from the input String column all the database names from the String:
def dbNames(k: String): String = {
// this String is the returning value
// containing all the databases from the input string
var dbNames=""
// split the input String by comma
val arrays = k.split(",")
for (str <- arrays){
// if in the input String there is a value like
// database.table take just the database value
if(str.contains(".")) {
val indexOfPoint = str.indexOf(".")
dbNames += str.substring(0, indexOfPoint) + ", "
}
}
// delete last occurence of char ", "
return dbNames.dropRight(2)
}
val dbName = udf[String, String](dbNames)
The second UDF, called tableName, is responsible to take from the input String column all the table names from the String:
def tableNames(k: String): String = {
// this String is the returning value
// containing all the tables from the input string
var tableNames=""
// split the input String by comma
val arrays = k.split(",")
for (str <- arrays){
// if in the input String there is a value like
// database.table take just the table value
// else is intended to be just the table name
if(str.contains(".")) {
val indexOfPoint = str.indexOf(".")
tableNames += str.substring(indexOfPoint+1) + ", "
}
else tableNames += str + ", "
}
// delete last occurence of char ", "
return tableNames.dropRight(2)
}
val tableName = udf[String, String](tableNames)
Then, to obtain the expected output we need to call the UDFs like the following:
val dfOutput = dfInput.withColumn("DBName", dbName(col("table_name")))
.withColumn("Table", tableName(col("table_name")))
dfOutput.show(true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|table_name |DBName |Table |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|tradingpartner.parent_supplier,lookup.store,lab_promo_invoice.tl_cc_mbr_prc_wkly_inv,lab_promo_invoice.mpp_club_card_promotion_funding_view,lab_promo_invoice.supplier_sale_apportionment_cc,tradingpartner.supplier,stores.rpm_zone_location_mapping,lookup.calendar,sauces,plant|tradingpartner, lookup, lab_promo_invoice, lab_promo_invoice, lab_promo_invoice, tradingpartner, stores, lookup|parent_supplier, store, tl_cc_mbr_prc_wkly_inv, mpp_club_card_promotion_funding_view, supplier_sale_apportionment_cc, supplier, rpm_zone_location_mapping, calendar, sauces, plant|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

com.mongodb.MongoInternalException: The reply message length is less than the maximum message length 4194304

I am trying to get the documents in a mongodb collection using a databricks script in pyspark. I am trying to fetch the data for each day.
Script works fine for some days but sometime it throws following error for some day.
com.mongodb.MongoInternalException: The reply message length 14484499 is less than the maximum message length 4194304.
Not sure what this error is and how to resolve this. Any help is appreciated.
This is the sample code I am running:
pipeline = [{'$match':{'$and':[{'UpdatedTimestamp':{'$gte': 1555891200000}},
{'UpdatedTimestamp':{'$lt': 1555977600000}}]}}]
READ_MSG = spark.read.format("com.mongodb.spark.sql.DefaultSource")
.option("uri",connectionstring)
.option("pipeline",pipeline)
.load()
The datetime is provided in epoch format.
It is more a comment than an answer (I have not enough reputation to post comments).
I have the same issue.
After some Research I found out that it was my nested field "survey" with more than 1 sublevel that was creating the issue, since I was able to read the db by selecting all the other fields except this one:
root
|-- _id: string (nullable = true)
|-- _t: array (nullable = true)
| |-- element: string (containsNull = true)
|-- address: struct (nullable = true)
| |-- streetAddress1: string (nullable = true)
|-- survey: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- SurveyQuestionId: string (nullable = true)
| | |-- updated: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- value: string (nullable = true)
Is there anyone getting closer to a workaround to what seems to be a bug of the mongodb spark connector?
After adding appName in the mongo db connection string, the issue seems to be resolved. I am not getting this error now.

How to pass HashSet to server to test API from postman?

I created an API which I want to test using postman. My api is accepting many parameters and one parameter is HAshSet. I dont know how to pass HashSet parameter using postman. Please help me. Thanks in advance
Here is my code:
#PutMapping
#ApiOperation(value = "collectMultiInvoices", nickname = "collectMultiInvoices")
public BaseResponse collectAmountMultipleInvoices(#RequestParam(value = "invoice_id") HashSet<Integer> invoiceIds,
#RequestParam("date") String _date,
#RequestParam(value = "cash", required = false) Float cashAmount,
#RequestParam(value = "chequeAmount", required = false) Float chequeAmount,
#RequestParam(value = "chequeNumber", required = false) String chequeNumber,
#RequestParam(value = "chequeDate", required = false) String _chequeDate,
#RequestParam(value = "chequeImage", required = false) MultipartFile chequeImage,
#RequestParam(value = "chequeBankName", required = false) String chequeBankName,
#RequestParam(value = "chequeBankBranch", required = false) String chequeBankBranch,
#RequestParam(value = "otherPaymentAmount", required = false) Float otherPaymentAmount,
#RequestParam(value = "otherPaymentType", required = false) Integer otherPaymentType,
#RequestParam(value = "otherPaymentTransactionId", required = false) String otherPaymentTransactionId,
#RequestParam(value = "discountPercentorAmount", required = false) String discountPercentorAmount,
#RequestParam(value = "discountId", required = false) String discountId) throws AppException.RequestFieldError, AppException.CollectionAmountMoreThanOutstanding {
//method implementation
}
A Set or HashSet is a java concept. There is no such thing as a Set from the HTTP perspective, and there is no such thing as a Set in Postman. So from Postman, you need to send the invoice_ids in a format that Spring's parsing library can convert to a HashSet. As #Michael pointed out in the comments, one way to do this is to comma separate the invoice_ids like this: invoice_id=id1,id2,id3. When Spring processes this request, it will see that you are expecting data in the form of a HashSet, so it will attempt to convert id1,id2,id3 into a HashSet<Integer>, which it knows how to do automatically.
Side note: Unless you specifically need a HashSet, it is considered good practice to declare your type using the interface instead of an implementing subclass. So in this situation I would recommend changing your method signature to accept a Set<Integer> instead of a HashSet<Integer>

Sorting Boolean variable in a Struct [duplicate]

This question already has answers here:
Swift sort array of objects based on boolean value
(3 answers)
Closed 4 years ago.
Mr Vadian, THIS QUESTION IS NOT DUPLICATED. READ WITH A LITTLE ATTENTION THE QUESTION, PLEASE!
I want group all occurrences with a boolean variable value "true" first, and then, group all occurrences with a boolean variable value "false".
struct myStruct {
var itemID: Date
var itemName: String
var selected: Bool
}
var iTable = [myStruct]()
I try to sort iTable by Boolean variable:
let sortedTable = iTable.sorted { (previous, next) -> Bool in
return next.selected && previous.selected
}
iTable = sortedTable
(I will improve the example to explain me better)
I use the contents of the internal table (struct) with a Table View
(itemID: 111, itemName: John, selected: true)
(itemID: 477, itemName: Rose, selected: false)
(itemID: 431, itemName: Peter, selected: true)
(itemID: 215, itemName: Mary, selected: true)
(itemID: 442, itemName: Marisa, selected: false)
After sorting (by Boolean variable "selected"), I list the contents of internal table "iTable" (only the names) and I get the same wrong output.
John
Rose
Peter
Mary
Marisa
...
If (repeat If) I sort by name, "itemName" variable (alphabetically)
iTable = iTable.sorted { $0.itemName < $1.itemName }
_tableView.reloadData()
The result is perfect!
In the same way that alphabetically, but sorting by Boolean variable I would like to obtain:
John (true)
Peter (true)
Mary (true)
Rose (false)
Marisa (false)
...
My shabby solution:
CREATE 2 INTERNAL TABLES, 1 WITH ALL THE OCCURRENCES WITH VALUE "true" AND ANOTHER WITH ALL THE OCCURRENCES WITH VALUE "false".
THEN, MERGE THE 2 INTERNAL TABLES OBTAINING THE DESIRED RESULT.
try this:
let sortedTable = iTable.sorted(by: {$0.selected && !$1.selected})
If you only care about the selected property, just write
let sortedTable = iTable.sorted { (previous, next) -> Bool in
return previous.selected
}

Cannot invoke initializer for type: with an argument list of type '(_Element)'

I am new on Swift. I am trying to convert string to character array and I want the integer value of character. Here is my code:
var string = "1234"
var temp = Array(string.characters)
var o = Int(temp[0])
But at line 3 I am getting above error. What's wrong with this code?
Please help me
You need to map your Character to String because Int has no Character initializer.
You can also map your Character array to String Array
var temp = string.characters.map(String.init)
or convert your character to String when initializing your var
var o = Int(String(temp[0]))
Swift 4
let string = "1234"
let temp = string.map(String.init)
let o = Int(temp[0])