How to delete null values when saving json file to dataframe - pyspark

I'm trying to save json file to a datagram, and I need to include only those values in a column which are not null.
There is only one column in a file.
I tried
.option("column", isNotNull)
.filter("column" is notNul)
.dropna("column")
but nothing works.
and I can't find any documentation on removing nulls when reading a json file.
Could anyone help me to resolve this, please? Thanks!

Related

Azure data factory Dataset

I have a DelimitedText ADF dataset. It is pipe delimited. When I use this as a source in copy data activity in a pipeline and write the file data to a SQL database table, blank values are loaded as NULL.
How can I avoid this? I want blank values to read as blank values and write into database table as blank values.
I tried keeping NULL value as blank and "treatEmptyAsNull": false in dataset json; both didnt work.
Any suggestions?
Tested and set NULL value property with '', it works:
Result comparation:
I tested expression concat(''), it also works.
Hope this helps.

How to import CSV into PostgreSQL

Respected,
I have problems with importing CSV into PostgreSQL via pgAdmin. No matter what I do, it shows the following error:
ERROR: extra data after last expected column.
Can anyone please help me and point me out a possible solution?
Thank you.
Milorad K.
check that your data is formatted as postgresql expects it to be
That error could be caused by specifying the wrong quote character or the wrong field separator. or it could be that your input file is corrupt.
I've had corrupt CSV files from banks before, so don't trust anyone.

Talend shuffle the order of the columns

I was trying to achieve merging all the rows of a file into columns based on a certain sequence number. This has been achieved by tpivotToColumnDelimited.( this has to be done , cannot be changed ).
But after using that, the column ordering has been changed.
Is there any way of reading a file according to a schema and writing the file according to some other schema in talend ? ( Basically shuffling the column ordering in a file )
I had tried using setting tdynamicschema from input and output but was not able to read and write the data properly.
Any help would be highly appreciated.
I had solved the issue.
Simply added a column which had the index number read from the file and before using the tpivotToColumnDelimited , i had used that column dynamically to sort the results and write to a tmp file and then with the help of tpivotToColumnDelimited , it is now according to the input schema.

Generate a CSV file with combination of unique field and XML using spark dataframe

I am reading an XML into a spark Dataframe using com.databricks.spark.xml and trying to generate a csv file as output.
My Input is like below
<id>1234</id>
<dtl>
<name>harish</name>
<age>21</age>
<class>II</class>
</dtl>
My output should be a csv file with the combination of id and remaining whole XML tag like
id, xml
1234,<dtl><name>harish</name><age>21</age><class>II</class></dtl>
Is there a way to achieve the output in the above format.
your help is very much appreciated.
Create a plain RDD to load xml as text file using sc.textFile() without parsing.
Extract id manually with the help of regex/xpath and also try to slice RDD string using string slicing from start of your tag to end of your tag.
Once it's done you will have your data into map like (id,"xml").
I hope this tactical solution will help you...

Varbinary Data copying issue

I have a table with a varbinary data column, I want to copy the data to another table but when I try I get errors even though I have set the receiving column to varbinary.
I am obviously missing something here, can anyone help?
regards