How to extract specific value in a dictionary column with multiple lists - data-cleaning

I'm trying to extract specific value inside a column in a dataframe as you can see in the next image without any success, referring back to similar question still didn't work for my code.
If there is any way to extract the values as [Culture, Climate change, technology, ...]
Data
First Try
I have tried split() function however I reached a dead end as still I need the exact value after the word "name", and this new dataframe contains 75 columns. If I can only get a for loop to extract the value after the word "name" that's my latest vision to solve my problem.

Related

Stack overflow error occurred when same data frame is repeated inside pyspark

When same dataframe is repeated inside loop then stack overflow error occurred.
Data volume is just 40k records. Cluster size is tried with single node 14Gb/28gb.
Sample data.
FT/RT,Country,Charge_Type,Tariff_Loc,Charge_No,Status,Validity_from,Validity_to,Range_Basis,Limited_Parties,Charge_Detail,Freetime_Unit,Freetime,Count_Holidays,Majeure,Start_Event,Same/Next_Day,Next_Day_if_AFTER,Availability_Date,Route_Group,Route_Code,Origin,LoadZone,FDischZone,PODZone,FDestZone,Equipment_Group,Equipment_Type,Range_From,Range_To,Cargo_Type,commodity,SC_Group,SC_Number,IMO,Shipper_Group,Cnee_Group,Direction,Service,haulage,Transport_Type,Option1,Option2,1st_of_Route_Group,1st_of_LoadZone,1st_of_FDischZone,1st_of_PODZone,1st_of_FDestZone,1st_of_Equipment_Group,1st_of_SC_Group,1st_of_Shipper_Group,1st_of_Cnee_Group,operationalFacilityGroup,operationalFacility,operator,commodityGroup,equipmentType,consignee,consigneeGroup,shipper,shipperGroup,serviceContract,serviceContractGroup,transportMode,agreementType
FT,IN,DET,INCCU,34298,EXPIRED,02-07-2020,30-11-2020,C/B,Y,2,DAY,14,Y,N,DISCHARG,S,null,N,MSL,null,null,null,null,null,null,ADRY,null,null,null,null,2313,null,ONLINE1,null,null,null,IMP,null,null,null,null,null,A1,null,null,null,null,20BULK,null,null,null,INCCU,,MSL,MSL,null,,null,,null,ONLINE1,null,null,SPOT
Expected output as below
Works for few records,if dataframe has mroe records stackoverflow error occured.
Please find the attached screenshot.
The error occurs mainly because of the usage of DataFrame.withColumn(). Using this method multiple times/ inside loops can cause this error. Refer to this official documentation to understand about DataFrame.withColumn().
pyspark.sql.DataFrame.withColumn — PySpark 3.3.0 documentation (apache.org)
The only way to counter this error is to optimize the code. Since you want to convert the data of multiple columns to JSON data, you can try implementing the following code.
Instead of using loops to add a new column that consists of JSON data, use create_map() function. This function converts multiple Pyspark columns into one MapType() column.
from pyspark.sql.functions import *
df = df.withColumn("dealKeys",create_map(
lit("Direction"),create_map(lit("Value"),col("Direction"),lit("Description"),lit("...")),
lit("Country"),create_map(lit("Value"),col("Country"),lit("Description"),lit("..."))
))
The output will be as shown in the below image. I have created a MapType() with key as the column name and its value as a MapType(). This value consists of key value pairs of value of column and its description.
Though this output is not same as your requirement, this transformation of data is much easier to access and perform further transformations even without using loops. You can use df['dealKeys.Direction'] to get its value(MapType()) or you can use df['dealKeys.Direction.Value'] to directly get the value it is holding.

Remove rows with a certain value in any column in pyspark

I am working in pyspark to clean a data set. The data set has "?" in various rows in various columns. I want to remove any row that has the value anywhere in it. I tried the following:
df = df.replace("?", "np.Nan")
df=df.dropna()
However, It did not work to remove those values.
I keep looking online but can't find any understandable answers (i am a newbie)

Spark - Update target data if primary keys match?

Is it possible to overwrite a record in the target if specific conditions are met using spark without reading the target into a dataframe? For example, I know we can do this if both sets of data are loaded into dataframes, but I would like to know if there is a way to perform this action without loading the target into a dataframe. Basically, a way to specify overwrite/update conditions.
I am guessing no, but I figured I would ask before I dive into this project. I know we have the write options of append and overwrite. What I really want is, if a few specific columns already exist in the data target, then overwrite it and fill in the other columns with the new data. For example:
File1:
id,name,date,score
1,John,"1-10-17",35
2,James,"1-11-17",43
File2:
id,name,date,score
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
I would like the result to look like this:
id,name,date,score
1,John,"1-10-17",35
3,Michael,"1-10-17",23
4,James,"1-11-17",56
5,James,"1-12-17",58
Basically, Name and Date columns act like primary keys in this scenario. I want updates to occur based on those two columns matching, otherwise make a new record. As you can see id 4 overwrites id 2, but id 5 appends because the date column did not match. Thanks ahead guys!

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Transpose data using Talend

I have this kind of data:
I need to transpose this data into something like this using Talend:
Help would be much appreciated.
dbh's suggestion should work indeed, but I did not try it.
However, I have another solution which doesn't require to change input format and is not too complicated to implement. Indeed the job has only 2 transformation components (tDenormalize and tMap).
The job looks like the following:
Explanation :
Your input is read from a CSV file (could be a database or any other kind of input)
tDenormalize component will Denormalize your column value (column 2), based on value on id column (column 1), separating fields with a specific delimiter (";" in my case), resulting as shown in 2 rows.
tMap : split the aggregated column into multiple columns, by using java's String.split() method and spreading the resulting array into multiple columns. The tMap should like like this:
Since Talend doesn't accept to store Array objects, make sure to store the splitted String in Object format. Then, cast that object into Array on the right side of the Map.
That approach should give you the expected result.
IMPORTANT:
tNormalize might shuffle the rows, meaning for bigger input, you might encounter unsorted output. Make sure to sort it if needed or use tDenormalizeSortedRow instead.
tNormalize is similar to an aggregation component meaning it scans the whole input before processing, which results into possible performance issues with particularly big inputs (tens of millions of records).
Your input is probably wrong (you have 5 entries with 1 as id, and 6 entries with 2 as id). 6 columns are expected meaning you should always have 6 lines per id. If not, then you should implement dbh's solution, and you probably HAVE TO add a column with a key.
You can use Talend's tPivotToColumnsDelimited component to achieve this. You will most likely need an additional column in your data to represent the field name.
Like "Identifier, field name, value "
Then you can use this component to pivot the data and write a file as output. If you need to process the data further, read the resulting file with tFileInoutDelimited .
See docs and an example at
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/13.43+tPivotToColumnsDelimited