copy avro schema of one data frame to another-pyspark - pyspark

I have a dataset A with schema A , also dataset B with Schema B. Both datasets A and B are mostly similar(have same columns ,but data types are different for only few), but have minor differences.One example being a column in dataset A has date value('2020-08-03' represented as string data type), the same column in dataset B is represented as an epoch number(long). Now i have to merge these two data sets.If i have to merge i have to use same data types in both the datasets.
Could you please suggest how do i this ? is this possible ?

You have to use sql functions to change column types. For example you can convert your string date to unix timestamp:
df.withColumn("date", unix_timestamp("date", "yyyy-MM-dd"))
Then you can use union with both dataframes.

Related

ADF Unpivot Dynamically With New Column

There is an Excel worksheet that I wanted to unpivot all the columns after "Currency Code" into rows, the number of columns need to be unpivot might vary, new columns might be added after "NetIUSD". Is there a way to dynamically unpivot this worksheet with unknown columns?
It worked when I projected all the fields and define the datatype for all the numerical fields as "double" and set the unpivot column data type as "double" as well. However, the issue is there might be additional columns added to the source file, which I won't be able to define the datatype ahead, in this case, if the new column has different data type other than "double", it will throw an error that the new column is not of the same unpivot datatype.
I tried to repro this in Dataflow with sample input details.
Take the unpivot transformation and in unpivot settings do the following.
Ungroup by: Code, Currency_code
Unpivot column: Currency
Unpivoted Columns: Column arrangement: Normal
Column name: Amount
Type: string
Data Preview
All columns other than mentioned in ungroup by can be dynamically unpivoted even if you add additional fields.
I confirm an Aswin answer. Got the same issue: failed dataflow with dynamically new columns. The reason was in datatype of unpivoted columns. Changed that to string and all goes smoothly.
Imported projection does not affect this case i`ve tried both with imported and manually coded, both works with "string" datatype.

reduce function not working in derived column in adf mapping data flow

I am trying to create the derived column based on the condition that met the value and trying to do the summation of multiple matching column values dynamically. So I am using reduce function in ADF derived column mapping data flow. But the column is not getting created even the transformation is correct.
Columns from source
Derived column logic
Derived column data preview without the new columns as per logic
I could see only the fields from source but not the derived column fields. If I use only the array($$) I could see the fields getting created.
Derived column data preview with logic only array($$)
How to get the derived column with the summation of all the fields matching the condition?
We are getting data of 48 weeks forecast and the data to be prepared on monthly basis.
eg: Input data
Output data:
JAN
----
506 -- This is for first record i.e. (94 + 105 + 109 + 103 + 95)
The problem is that the array($$) in the reduce function has only one element, so that the reduce function can not accumulate the content of the matching columns correctly.
You can solve this by using two derived columns and a data flow parameter as follows:
Create derived columns with pattern matching for each month-week you did it before, but put the reference $$ into the value field, instead of the reduce(...) function.
This will create derived columns like jan0, jan1, etc. containing the copy of the original values. For example Week 0 (1 Jan - 7 Jan) => 0jan with value 95.
This step gives you a predefined set of column names for each week, which you can use to summarize the values with specific column names.
Define Data Flow parameters for each month containing the month-week column names in a string array, like this:
ColNamesJan=['0jan' ,'1jan', etc.] ColNamesFeb=['0feb' ,'1feb', etc.] and so on.
You will use these column names in a reduce function to summarize the month-week columns to monthly column in the next step.
Create a derived column for each month, which will contain the monthly totals, and use the following reduce function to sum the weekly values:
reduce(array(byNames($ColNamesJan)), 0, #acc + toInteger(toString(#item)),#result)
Replace the parameter name accordingly.
I was able to summarize the columns dynamically with the above solution.
Please let me know if you need more information (e.g. screenshots) to reproduce the solution.
Update -- Here are the screenshots from my test environment.
Data source (data preview):
Derived columns with pattern matching (settings)
Derived columns with pattern matching (data preview)
Data flow parameter:
Derived column for monthly sum (settings):
Derived column for monthly sum (data preview):

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Fastest way to get specific record in Spark DataFrame

I want to collect a specific Row in a Spark 1.6 DataFrame which originates from a partitioned HiveTable (the table is partitioned by a String column named date and saved as Parquet)
A record is unambiguously identified by date,section,sample
In addition, I have the following constraints
date and section are Strings, sample is Long
date is unique and the Hive table is partitioned by date. But
there are maybe more than 1 files on HDFS for each date
section is also unique across the dataframe
sample is unique for a given section
So far I use this query, but it takes quite a long time to execute (~25 seconds using 10 executors):
sqlContext.table("mytable")
.where($"date"=== date)
.where($"section"=== section)
.where($"sample" === sample)
.collect()(0)
I also tried to replace collect()(0) with take(1)(0) which is not faster.

UNION with different data types in db2 server

I have built a query which contains UNION ALL, but the two parts of it
have not the same data type. I mean, i have to display one column but the
format of the two columns, from where i get the data have differences.
So, if i get an example :
select a,b
from c
union all
select d,b
from e
a and d are numbers, but they have different format. It means that a's length is 15
and b's length is 13. There are no digits after the floating point.
Using digits, varchar, integer and decimal didn't work.
I always get the message : Data conversion or data mapping error.
How can i convert these fields in the same format?
I've no DB2 experience but can't you just cast 'a' & 'd' to the same types. That are large enough to handle both formats, obviously.
I have used the cast function to convert the columns type into the same type(varchar with a large length).So i used union without problems. When i needed their original type, back again, i used the same cast function(this time i converted the values into float), and i got the result i wanted.