Scala Spark convert date to a specific format - scala

I am reading some JSON file into a data frame and I want to convert a field in it into a specific format, the JSON file has server_received_time with the following format as String I want to convert it to be like yyyy-MM-dd:hh
"server_received_time":"2019-01-26T03:04:36Z"
but whatever I tied just returned null
df.select("server_received_time")
.withColumn("tx_date", to_date($"server_received_time", "yyy-MM-dd:hh").cast("timestamp"))
.withColumn("tx_date2", to_timestamp($"server_received_time", "yyy-MM-dd:hh").cast("timestamp"))
.withColumn("tx_date3", to_date(unix_timestamp($"server_received_time", "yyyy-MM-dd:hh").cast("timestamp")))
.withColumn("tx_date4", to_utc_timestamp(to_timestamp(col("server_received_time"), "yyyy-MM-dd:hh"), "UTC"))
.withColumn("tx_date5", to_timestamp($"server_received_time","yyyy-MM-dd:hh"))
.show(10, false)
+--------------------+-------+--------+--------+--------+--------+
|server_received_time|tx_date|tx_date2|tx_date3|tx_date4|tx_date5|
+--------------------+-------+--------+--------+--------+--------+
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
|2019-02-18T16:02:20Z|null |null |null |null |null |
+--------------------+-------+--------+--------+--------+--------+
I want to have the server_received_time in this format yyyy-MM-dd:hh

to_ methods take actual format, not desired output format. To format you have to convert data back to string
import org.apache.spark.sql.functions._
val df = Seq("2019-02-18T16:02:20Z").toDF("server_received_time")
df.select(date_format(to_timestamp($"server_received_time"), "yyy-MM-dd:hh")).show
// +---------------------------------------------------------------+
// |date_format(to_timestamp(`server_received_time`), yyy-MM-dd:hh)|
// +---------------------------------------------------------------+
// | 2019-02-18:05|
// +---------------------------------------------------------------+

The format is different. This should work as below:
df.select(date_format(to_timestamp($"server_received_time", "yyyy-MM-dd'T'HH:mm:ss'Z'"), "yyyy-MM-dd:hh").as("custom_date"))

Related

Scala Spark dataframe filter using multiple column based on available value

I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.
You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+

How to replace values with NULL of all column's start with "stage{col's}" when condition statisfied

I have a scenario, where the final dataframe looks like below which was the result of Joining stage and foundation.
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-02-02 21:50:25.585|9123 |20.00 |1000.00 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
i am looking for a logic, on each row when final_{timestamp} > stage_{timestamp}, have to replace the value with "null" all the column starts with stage_{}.
Like below:
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |null |null |null |null |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
It would be great if you can help me with the logic.
"""
Here's the code:
// Select all stage columns from dataframe
val stageColumns = df.columns.filter(_.startsWith("stage"))
// For each stage column nullify unless condition satisfied
stageColumns.foldLeft(df) { (acc, c) =>
acc.withColumn(c, when('final_timestamp <= 'stage_timestamp, col(c)).otherwise(lit(null)))
}
Check below code.
Condition
scala> val expr = col("final_{timestamp}") > col("stage_{timestamp}")
Condition Matched Columns
scala> val matched = df
.columns
.filter(_.startsWith("stage"))
.map(c => (when(expr,lit(null)).otherwise(col(c))).as(c))
Condition Not Matched Columns
scala> val notMatched = df
.columns
.filter(!_.startsWith("stage"))
.map(c => col(c).as(c))
Combining Not Matched & Matched Columns
scala> val allColumns = notMatched ++ matched
Final Result
scala> df.select(allColumns:_*).show(false)
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key|ICC_key|suff_key|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |null |null |null |null |
|333 |333 |1 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |
|444 |444 |1 |null |null |null |null |null |null |null |null |
|555 |333 |1 |2020-05-03 21:50:25.585|813 |30.00 |200.00 |null |null |null |null |
|111 |111 |1 |null |null |null |null |null |null |null |null |
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
pyspark solution :
#test_data
tst = sqlContext.createDataFrame([(867,0.12,'G','2020-07-01 17-49-32','2020-07-02 17-49-32'),(430,0.72,'R','2020-07-01 17-49-32','2020-07-02 17-49-32'),(658,0.32,'A','2020-07-01 17-49-32','2020-06-01 17-49-32'),\
(157,0.83,'R','2020-07-01 17-49-32','2020-06-01 17-49-32'),(521,0.12,'G','2020-07-01 17-49-32','2020-08-01 17-49-32'),(867,0.49,'A','2020-07-01 16-45-32','2020-08-01 17-49-32'),
(430,0.14,'G','2020-07-01 16-45-32','2020-07-01 17-49-32'),(867,0.12,'G','2020-07-01 16-45-32','2020-07-01 17-49-32')],
schema=['stage_1','stage_2','RAG','timestamp','timestamp1'])
# change the string column to date
tst_format= (tst.withColumn('timestamp',F.to_date('timestamp',format='yyyy-MM-dd'))).withColumn('timestamp1',F.to_date('timestamp1',format='yyyy-MM-dd'))
# extract column information with name 'stage' and retain others
col_info =[x for x in tst_format.columns if 'stage_' in x]
col_orig = list(set(tst_format.columns)-set(col_info))
# Build the query expression
expr = col_orig+[(F.when(F.col('timestamp1')>F.col('timestamp'),F.col(x)).otherwise(F.lit(None))).alias(x) for x in col_info]
# execute query
tst_res = tst_format.select(*expr)
The results are :
+---+----------+----------+-------+-------+
|RAG|timestamp1| timestamp|stage_1|stage_2|
+---+----------+----------+-------+-------+
| G|2020-07-02|2020-07-01| 867| 0.12|
| R|2020-07-02|2020-07-01| 430| 0.72|
| A|2020-06-01|2020-07-01| null| null|
| R|2020-06-01|2020-07-01| null| null|
| G|2020-08-01|2020-07-01| 521| 0.12|
| A|2020-08-01|2020-07-01| 867| 0.49|
| G|2020-07-01|2020-07-01| null| null|
| G|2020-07-01|2020-07-01| null| null|
+---+----------+----------+-------+-------+

Concatenate names of the columns in a dataframe which contain true

I have a dataframe that looks like this:
|id|type|isBlack|isHigh|isLong|
|1 |A |true |false |null |
|2 |B |true |true |true |
|3 |C |false |null |null |
I'm trying to concatenate names of columns which contain 'true' into another column to get this:
|id|type|isBlack|isHigh|isLong|Description |
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null |null |
Now, I have a predefined list of column names that I need to check for (in this example it'd be Seq("isBlack", "isHigh", "isLong"), which are present in the dataframe (this list can be somewhat long).
val cols = Seq("isBlack", "isHigh", "isLong")
df.withColumn("description", concat_ws(",", cols.map(x => when(col(x), x)): _*)).show(false)
+---+----+-------+------+------+---------------------+
|id |type|isBlack|isHigh|isLong|description |
+---+----+-------+------+------+---------------------+
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null | |
+---+----+-------+------+------+---------------------+
Firstly map columns to column names when the value is true with:
cols.map(x => when(col(x), x))
And then use concat_ws to combine the columns concat_ws(",", cols.map(x => when(col(x), x)): _*)

Value of column changes after changing the Date format in scala spark

This is my data frame without data format
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00+00:00|YUH |null |null |null |1983-12-31T00:00:00+00:00|true |false |1983-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
This is waht i do to change in data format
val df2resultTimestamp = finalXmlDf.withColumn("FilingDateTime_1", date_format(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
This is the output I get where FilingDateTime_1 column value is changed
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T08:30:00Z|YUH |null |null |null |1983-12-31T05:30:00Z|true |false |1983-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
The value should be 1984-02-14T03:00:00Z
I dont know what i am missing here ..
All you need is addition to_timestamp inbuilt function as below
val df2resultTimestamp = df.withColumn("FilingDateTime_1", date_format(to_timestamp(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(to_timestamp(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(to_timestamp(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
which should give you the correct output as
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00Z|YUH |null |null |null |1983-12-31T00:00:00Z|true |false |1983-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+

How to create columns from multi-layer struct type in spark dataframe?

This is the schema of my main data frame:
root
|-- DataPartition: string (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- _lineItemId: long (nullable = true)
|-- _organizationId: long (nullable = true)
|-- fl:FinancialConceptGlobal: string (nullable = true)
|-- fl:FinancialConceptGlobalId: long (nullable = true)
|-- fl:FinancialConceptLocal: string (nullable = true)
|-- fl:FinancialConceptLocalId: long (nullable = true)
|-- fl:InstrumentId: long (nullable = true)
|-- fl:IsCredit: boolean (nullable = true)
|-- fl:IsDimensional: boolean (nullable = true)
|-- fl:IsRangeAllowed: boolean (nullable = true)
|-- fl:IsSegmentedByOrigin: boolean (nullable = true)
|-- fl:LineItemName: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:LocalLanguageLabel: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:SegmentChildDescription: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _languageId: long (nullable = true)
|-- fl:SegmentGroupDescription: string (nullable = true)
|-- fl:Segments: struct (nullable = true)
| |-- fl:SegmentSequence: struct (nullable = true)
| | |-- _VALUE: long (nullable = true)
| | |-- _segmentId: long (nullable = true)
|-- fl:StatementTypeCode: string (nullable = true)
|-- FFAction|!|: string (nullable = true)
From this my required output is below:
LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高(相手先別)|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|
To get above output this is what I have tried:
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartition"), $"env:Header.env:info.env:TimeStamp".as("TimeStamp"), $"column1.*")
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)
With this i am getting below output
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|DataPartition |TimeStamp |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:LineItemName |fl:LocalLanguageLabel|fl:SegmentChildDescription|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3 |4298009288 |XTOT |3016350 |null |null |null |true |false |false |false |[Total Assets,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|9 |4298009288 |XTCOI |3016329 |null |null |21521455386 |true |false |false |false |[S/O-Ordinary Shares,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|10 |4298009288 |XTCOC |3016328 |null |null |null |true |false |false |false |[Total Equivalent No of Common Shares O/S,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|11 |4298009288 |XTCTI |3016331 |null |null |21521455386 |true |false |false |false |[T/S-Ordinary Shares,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|19 |4298009288 |ESGA |3018991 |null |null |null |false |false |false |false |[General and administrative expense,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|20 |4298009288 |XTOE |3016349 |null |null |null |false |false |false |false |[Total Operating Expense,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|21 |4298009288 |XIBT |3016299 |null |null |null |true |false |false |false |[Net Income Before Taxes,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|22 |4298009288 |TTAX |3019472 |null |null |null |false |false |false |false |[Income tax benefit,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|23 |4298009288 |XIAT |3016297 |null |null |null |true |false |false |false |[Net Income After Taxes,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|24 |4298009288 |XBXP |3016252 |null |null |null |true |false |false |false |[Net Income Before Extra. Items,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|25 |4298009288 |XNIC |3019922 |null |null |null |true |false |false |false |[Net loss,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|26 |4298009288 |XNCN |3016316 |null |null |null |true |false |false |false |[Income Available to Com Excl ExtraOrd,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|27 |4298009288 |XNCX |3016318 |null |null |null |true |false |false |false |[Net loss,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|29 |4298009288 |CDNI |3018735 |null |null |null |true |false |false |false |[Diluted Net Income,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|30 |4298009288 |XTAX |3019589 |null |null |null |false |false |false |false |[Income Taxes - Total,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|33 |4298009288 |RNTS |3015275 |null |null |null |true |false |false |false |[Revenues,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|34 |4298009288 |XTLR |3016345 |null |null |null |true |false |false |false |[Total revenues,505074] |null |null |null |null |INC |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|35 |4298009288 |XTCII |3016326 |null |null |21521455386 |true |false |false |null |[Common Shares Issued - (Instrument Level),505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|36 |4298009288 |XTCTIPF |1002023922 |null |null |21521455386 |true |false |false |null |[Common Treasury Shares on Instrument Level Multiplied to its Conversion to Primary Factor,505074] |null |null |null |null |BAL |I|!| |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|37 |4298009288 |XTCOIPF |1002023921 |null |null |21521455386 |true |false |false |null |[Common Shares Outstanding on Instrument Level Multiplied to its Conversion to Primary Factor,505074]|null |null |null |null |BAL |I|!| |
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
My issue is columns name fl:LineItemName .
This is a struct type and i need to create two different columns out of this .
One for the _VALUE as LineItemName and another for the _languageId as LanguageId.
Same way I have to create for fl:LocalLanguageLabel and for the fl:SegmentChildDescription.
Do I have to do this using with column option?
Or is there any way without that I can do?
This is working for me except for the last line:
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)
val dfnewTemp = dfType
.withColumn("LineItemName", $"fl:LineItemName._VALUE")
.withColumn("LineItemName.languageId", $"fl:LineItemName._languageId")
.withColumn("LocalLanguageLabel", $"fl:LocalLanguageLabel._languageId")
.withColumn("LocalLanguageLabel.languageId", $"fl:LocalLanguageLabel._VALUE")
.withColumn("SegmentChildDescription", $"fl:SegmentChildDescription._languageId")
.withColumn("SegmentChildDescription.languageId", $"fl:SegmentChildDescription._VALUE")
.drop($"fl:LineItemName")
.drop($"fl:LocalLanguageLabel")
.drop($"fl:SegmentChildDescription")
dfnewTemp.show(false)
val temp = dfnewTemp.select(dfnewTemp.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)
What you have to do is to use withColumn and simply select the variables present inside the structs. The fl:LineItemName column contains a struct with two values, _VALUE and _languageId which can simply be selected as follows:
val df = dfType.withColumn("LineItemName", $"fl:LineItemName._VALUE")
.withColumn("LanguageId", $"fl:LineItemName._languageId")
.drop("fl:LineItemName")
For the other two mentioned columns, simply do the same thing.