spark bad records : bad records shows reason for only one column - pyspark

I am trying to filter out bad records from a csv file using pyspark. Code snippet given below
from pyspark.sql import SparkSession
schema="employee_id int,name string,address string,dept_id int"
spark = SparkSession.builder.appName("TestApp").getOrCreate()
data = spark.read.format("csv").option("header", True).schema(schema).option("badRecordsPath", "/tmp/bad_records").load("/path/to/csv/file")
schema_for_bad_record="path string,record string,reason string"
bad_records_frame=spark.read.schema(schema_for_bad_record).json("/tmp/bad_records")
bad_records_frame.select("reason").show()
The valid dataframe is
+-----------+-------+-------+-------+
|employee_id| name|address|dept_id|
+-----------+-------+-------+-------+
| 1001| Floyd| Delhi| 1|
| 1002| Watson| Mumbai| 2|
| 1004|Thomson|Chennai| 3|
| 1005| Bonila| Pune| 4|
+-----------+-------+-------+-------+
In one of the records, both employee_id and dept_id has incorrect values. But the reason shows only one column's issue.
java.lang.NumberFormatException: For input string: \\"abc\\"
Is there any way to show reasons for multiple columns in case of failure?

Related

Ambiguous Column in DataFrame Join - Unable to Alias or Call

Getting into databricks from a SQL background and working with some dataframe samples for joining for basic transformations, and I am having issues isolating the correct dataframe.column for other transformations after the join.
For DF1, I have 3 columns: user_id, user_ts, email. For DF2, I have two columns: email, converted.
Below is how I have the logic for the join. This works and returns 5 columns; however, there are two email columns in the schema
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
)
I am trying to do some basic transformation on the df2 email as part the dataframe string, but I receive the error:
"Cannot resolve column name "df2.email" among (user_id, user_ts, email, email, converted)"
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
.na.fill(False,["df2.email"])
)
If I remove the df2 from the fill(), I get the error that the columns are ambiguous.
How can I define which column I want to do a transformation on if it has the same column name as a second column. In SQL, I just use a table alias predicate for the column, but this doesn't seem to be how pyspark is bested used.
Suggestions?
If you want to avoid both key columns in the join result and get combined result then you can pass list of key columns as an argument to join() method.
If you want to retain same key columns from both dataframes then you have to rename one of the column name before doing transformation, otherwise spark will throw ambiguous column error.
df1 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'def#gmail.com')],["id1", "email"])
df2 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'ghi#gmail.com')],["id2", "email"])
df1.join(df2,['email'], 'outer').show()
'''
+-------------+----+----+
| email| id1| id2|
+-------------+----+----+
|def#gmail.com| 2|null|
|ghi#gmail.com|null| 2|
|abc#gmail.com| 1| 1|
+-------------+----+----+'''
df1.join(df2,df1['email'] == df2['email'], 'outer').show()
'''
+----+-------------+----+-------------+
| id1| email| id2| email|
+----+-------------+----+-------------+
| 2|def#gmail.com|null| null|
|null| null| 2|ghi#gmail.com|
| 1|abc#gmail.com| 1|abc#gmail.com|
+----+-------------+----+-------------+'''
df1.join(df2,df1['email'] == df2['email'], 'outer') \
.select('id1', 'id2', df1['email'], df2['email'].alias('email2')) \
.na.fill('False','email2').show()
'''
+----+----+-------------+-------------+
| id1| id2| email| email2|
+----+----+-------------+-------------+
| 2|null|def#gmail.com| False|
|null| 2| null|ghi#gmail.com|
| 1| 1|abc#gmail.com|abc#gmail.com|
+----+----+-------------+-------------+ '''

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer