Fill null columns dynamically in dataframe using pyspark

Fill null columns dynamically in dataframe using pyspark - pyspark

I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million.
I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data.
How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic.
Thanks for the help
Regards,
Robin

If I get your question correctly, you want to have some unique value in a column if there has been a Null value before. One possible method would be the following code, which checks for Null values in the value column. If it finds Null it will use the monotonically_increasing id to replace the Null. In the other case the original value will remain.
test_df = spark.createDataFrame([
('a','2018'),
('b',None),
('c',None)
], ("col_name","value"))
test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)
Result:
+--------+-----+------------+
|col_name|value|col3 |
+--------+-----+------------+
|a |2018 |2018 |
|b |null |403726925824|
|c |null |609885356032|
+--------+-----+------------+
PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.

For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)
Thanks for all the comments and help.

Related

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks

If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

This question although may seem previously answered it is not. All transposing seem to relate to one column and pivoting the data in that column. I want to make a vertical table from a horizontal set of columns, for example:-
Take this example:-
MyPrimaryKey
Insurer_Factor_1_Name
Insurer_Factor_1_Code
Insurer_Factor_1_Value
Insurer_Factor_2_Name
Insurer_Factor_2_Code
Insurer_Factor_2_Value
Insurer_Factor_[n]_Name
Insurer_Factor_[n]_Code
Insurer_Factor_[n]_Value
XX-ABCDEF-1234-ABCDEF123
Special
SP1
2500
Awesome
AW2
3500
ecetera
etc
999999
[n] being any number of iterations
transforming it into a new vertical representation dataframe:-
MyPrimaryKey
Insurer_Factor_ID
Insurer_Factor_Name
Insurer_Factor_Code
Insurer_Factor_Value
XX-ABCDEF-1234-ABCDEF123
1
Special
SP1
2500
XX-ABCDEF-1234-ABCDEF123
2
Awesome
AW2
3500
XX-ABCDEF-1234-ABCDEF123
[n]
ecetera
etc
999999
There is also the possibility that the “Code” column may be missing and we only receive the name and value therefore requiring null to be added to the code column.
I've searched High and low for this, but there just doesn't seem to be anything out there?
Also there could be many rows in the first example...

The reason you haven't found it is that there is not a magic trick to move a 'interstingly' designed table into a well designed table. You are going to have to hand code a query to either union the rows into your table, or select arrays that you then explode.
Sure you could probably write some code to generate the SQL that you want, but really they're isn't a good feature to magically translate this feature format into a row based format.

In order of preference:
Revisit your decision to send multiple files:
It sounds like it would save a lot of work if you just sent multiple files.
Change the column schema:
Put a delimiter (every 4th column) into the column schema allowing us to see the rows. We can then suck the file in as rows. Using a delimiter.
Write your own custom datasource:
You can use the existing text one as a example for how you can write your own, that cold interpret every 3 columns as a row.
Write a custom UDF that takes all columns as a parameter and returns an array of rows, that you then call explode on to turn them into rows. This will be slow so I give it to you as the final option.

*** WARNING ***
This is going to use up a lot of memory. with 6000 rows it will be slow and may run out of memory. If it works great but I suggest you code your own data source as that likely is a better/faster strategy.
If you want to do this with a UDF and you are only doing this with a couple of row you can do this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
/* spark.sql("select * from info").show();
+----+-------+----+
|type|db_type|info|
+----+-------+----+
| bot| x_bot| x|
| bot| x_bnt| x|
| per| xper| b|
+----+-------+----+ */
val schema = ArrayType(new StructType().add("name","string").add("info","string"))
val myUDF = udf((s: Row) => {
Seq( Row( s.get(0).toString, s.get(1).toString ), Row(s.get(2).toString, s.get(2).toString ) )
},schema)
val records = spark.sql("select * from info");
val arrayRecords = records.select( myUDF(struct(records.columns.map(records(_)) : _*)).alias("Arrays") )
arrayRecords.select( explode(arrayRecords("Arrays")).alias("myCol") )
.select( col("myCol.*").show()
+----+-----+
|name| info|
+----+-----+
| bot|x_bot|
| x| x|
| bot|x_bnt|
| x| x|
| per| xper|
| b| b|
+----+-----+
Sudo code
Create schema for rows.
create udf (with schema) (here I only show small manipulation but you can obviously use more complicated logic in your case)
select data,
Apply udf,
Explode Array.

Splitting a Tmap output into several tables based on the value of a column

I have an output of tmap below is:
|src_table|src_columname
--------------------------
|Account |ID
|Account |Name
|Account |Owner
|Contact |ID
|Contact |Name
|Contact |FirstName
|Contact |LastName
I want output in two table like first Account and second Contact
Account
-----------------
ID |Name |Owner |
Contact
-------------------------------
ID |Name |FirstName |LastName |
I am beginer in taled. Please tell me which component i need to use for above output.
Actually, I am not an expert user and I don't found my solution. scenario is:
I'm trying to migrate some 10 tables from SQL server dB to oracle server DB and i wish to use Talend but I don't know in which way I could make it. First I tried by below method: I have created many sub-jobs for mapping table by table in one job, because each table has a different table structure, I have created different sub job with the corresponding schema, for example tOracleInput_1--main tMSSQLOutput_1 (migrate table1) |
onsubjobok |
tOracleInput_2--main--tMSSQLOutput_2 (migrate table2) |
onsubjobok |
...other subjobs for other tables...
But I do not want to create many sub job. is there any way like i need to create one subjob for all tables?

You'll have to use multiple outputs in tMap, and use the filter option in it.
Add a second output to your tMap .
Then activate a filter on both outputs (filter button in the yellow bar title)
In output #1, put a filter on src_table, like "Account".equals(row2.src_table)
In output #2, put a filter on src_table, like "Contact".equals(row2.src_table)
Then you will have only Accounts in your first output, only Contacts in your second output.

SSRS Column Grouping Based on Row and Column

I've looked everywhere but found no luck and did some tinkering which didn't amount to much. I have a table that displays the following result set:
| Name | Value|
| Pat | 1.6 |
| Pat | 1.4 |
I have to group them in together by Row based on the first column (which is not a problem). Although I'm trying to make the report put the two numeric values in one cell in the Tablix.
This is what I need to do:
And this is what I have achieved
I achieved the third one by grouping it by the first column of my result set as a Row group.
Any nudge to the right direction will be very much appreciated!

Add a new tablix and add Name field in the Row Groups Pane, then delete details group.
In the Value cell use the below expression:
=join(LookupSet(
Fields!Name.Value,
Fields!Name.Value,
Fields!Value.Value,
"DataSetName"
),Environment.NewLine)
Replace DataSetName by the actual name of yours. You will get:
UPDATE: Expression to surround the second value with parenthesis.
=join(LookupSet(
Fields!Name.Value,
Fields!Name.Value,
Fields!Value.Value,
"DataSet4"
),Environment.NewLine & "(") & ")"
Let me know if this helps.

Tableau Crosstab Row Percentage of Whole

I am new to Tableau and I have created a crosstab that shows a count of items per type. I want to add a column to my table. I need to know the percentage of the whole - I looked up a couple of things but I can't seem to find this exact problem.
Type |Count |% of Whole
-----------------------------------
A |10 |1%
B |99 |9.9%
C |256 |25.6%
D |300 |30%
E |335 |33.5%
After reading some I think my issue is that I am not sure how to derive a calculation that is going to give a TOTAL # of Types. In Excel I would take the row value divided by the sum of all rows. Additionally I am fairly certain that this will lead to an issue once I filter this table - not sure I know how to preserve the percentages with filters.
I am using Tableau 9.2. Thanks in advance for any help.

You can create the following calculated field:
SUM([Count])/TOTAL(SUM([Count]))
TOTAL takes into consideration all values of your variable.
Alternatively, you can use quick table calculation by right-clicking on count (here I'm using an example from the Superstore dataset):

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fill null columns dynamically in dataframe using pyspark - pyspark

For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False) Thanks for all the comments and help.

Related

How to get the numeric value of missing values in a PySpark column?

Transpose a group of repeating columns in large horizontal dataframe into a new vertical dataframe using Scala or PySpark in databricks

Splitting a Tmap output into several tables based on the value of a column

SSRS Column Grouping Based on Row and Column

Tableau Crosstab Row Percentage of Whole

Categories

Resources