How to rename a column in Spark dataframe while using explode function - pyspark

I am trying to use explode array function in Pyspark and below is the code -
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).printSchema()
Output -
root
|-- _name: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
When I am trying to select "_name" column I am able to do so like -
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col.*")).select(col("_name")).show(50,False)
But the same is not working while trying to access "0" or "1" column -
Error -
File "/usr/local/spark/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1614.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_696#696
Is there any way to rename column "0" and "1" or extract them via select in dataframe ?

Use explode('col_name').alias('new_name')

Try cast to col column to struct<cola:string,colb:string>. You can choose your own column names inside struct, for example I have taken cola & colb
Check below code.
df_map_transformation.select(col("_name") , explode(arrays_zip(col("instances.Instance._name"), col("instances.Instance._id") ))).select(col("_name"), col("col").cast("struct<cola:string,colb:string>")).select(col("_name"),col("col.cola"),col("col.colb")).printSchema()
root
|-- _name: string (nullable = true)
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
Also you can use withColumnRenamed
df_map_transformation.select(col("_name") ,explode(
arrays_zip(
col("instances.Instance._name"),
col("instances.Instance._id") )
)
).select(col("_name"), col("col.*"))
.withColumnRenamed("0","cola")
.withColumnRenamed("1","colb")

df = (df_map_transformation
.select(
col("_name"),
explode(arrays_zip(
col("instances.Instance._name"),
col("instances.Instance._id")
)
.alias('name','id') # <-- alias can take multiple names
)
)
here's what it looks like in my notebook:

Related

Deequ - How to put validation on a subset of dataset?

I have a usecase where I want to put certain validations on subset of data that satisfies a specific condition.
For example, I have a dataframe which has 4 columns. colA, colB, colC, colD
df.printSchema
root
|-- colA: string (nullable = true)
|-- colB: integer (nullable = false)
|-- colC: string (nullable = true)
|-- colD: string (nullable = true)
I want to put a validation that, wherever "colA == "x" and colB > 20" , combination of "colC and colD" should be unique. ( basically, hasUniqueness(Seq("colC", "colD"), Check.IsOne)

In pyspark 2.4, how to handle columns with the same name resulting of a self join?

Using pyspark 2.4, I am doing a left join of a dataframe on itself.
df = df.alias("t1") \
.join(df.alias("t2"),
col(t1_anc_ref) == col(t2_anc_ref), "left")
The resulting structure of this join is the following:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
I would like to be able to drop the penultimate column of this dataframe (anc_ref_1).
Using the column name is not possible, as there are duplicates. So instead of this, I select the column by index and then try to drop it:
col_to_drop = len(df.columns) - 2
df= df.drop(df[col_to_drop])
However, that gives me the following error:
pyspark.sql.utils.AnalysisException: "Reference 'anc_ref_1' is
ambiguous, could be: t1.anc_ref_1, t2.anc_ref_1.;"
Question:
When I print the schema, there is no mention of t1 and t2 in column names. Yet it is mentionned in the stack trace. Why is that and can I use it to reference a column ?
I tried df.drop("t2.anc_ref_1") but it had no effect (no column dropped)
EDIT: Works well with df.drop(col("t2.anc_ref_1"))
How can I handle the duplicate column names ? I would like to rename/drop so that the result is:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true) -> dropped
|-- anc_ref_2: string (nullable = true) -> renamed to anc_ref_3
Option1
drop the column by referring to the original source dataframe.
Data
df= spark.createDataFrame([ ( 'Value1', 'Something'),
('Value2', '1057873 1057887'),
('Value3', 'Something Something'),
('Value4', None),
( 'Value5', '13139'),
( 'Value6', '1463451 1463485'),
( 'Value7', 'Not In Database'),
( 'Value8', '1617275 16288')
],( 'anc_ref_1', 'anc_ref'))
df.show()
Code
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
df1 = df_as1.join(df_as2, df_as1.anc_ref == df_as2.anc_ref, "left").drop(df_as1.anc_ref_1)#.drop(df_as2.anc_ref)
df1.show()
Option 2 Use a string sequence to join and then select the join column
df_as1.join(df_as2, "anc_ref", "left").select('anc_ref',df_as1.anc_ref_1).show()

How to add assign value to empty dataframe existing column in scala?

I am reading a csv file which has | delimiter at last , while load method make last column in dataframe with no name and no values in Spark 1.6
df.withColumnRenamed(df.columns(83),"Invalid_Status").drop(df.col("Invalid_Status"))
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","|").option("header","true").load("filepath")
val df2 = df.withColumnRenamed(df.columns(83),"Invalid_Status").
I expected result
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- Invalid_Status: string (nullable = true)
but actual output is
root
|-- FddCell: string (nullable = true)
|-- Trn_time: string (nullable = true)
|-- CELLNAME.FddCell: string (nullable = true)
|-- : string (nullable = true)
with no value in column so I have to drop this column and again make new column.
It is not completely clear what you want to do, to just rename the column to Invalid_Status or to drop the column entirely. What I understand is, you are trying to operate (rename/drop) on the last column which has no name.
But I will try to help you with both the solution -
To Rename the column with same values (blanks) as it is:
val df2 = df.withColumnRenamed(df.columns.last,"Invalid_Status")
Only To Drop the last column without knowing its name, use:
val df3 = df.drop(df.columns.last)
And then add the "Invalid_Status" column with default values:
val requiredDf = df3.withColumn("Invalid_Status", lit("Any_Default_Value"))

How to compare 2 JSON schemas using pyspark?

I have 2 JSON schemas as below -
df1.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
How can I compare these 2 schemas and highlight the differences using pyspark as I am using pyspark-sql to load data from the JSON file into a DF.
While it is not clear what do you mean by "compare", the following code will give you the fields (FieldType) which are on DF2 and not on DF1.
set(df2.schema.fields) - set(df1.schema.fields)
Set will take your list and truncate the duplicates.
I find the following one line code useful and neat. I also provides you the two-directional differences at a column level
set(df1.schema).symmetric_difference(set(df2.schema))

SparkSQL Timestamp query failure

I put some log files into sql tables through Spark and my schema looks like this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|-- cs_method: string (nullable = true)
|-- cs_uri_stem: string (nullable = true)
|-- cs_query: string (nullable = true)
|-- sc_status: integer (nullable = false)
|-- sc_bytes: integer (nullable = false)
|-- cs_bytes: integer (nullable = false)
|-- time_taken: integer (nullable = false)
|-- User_Agent: string (nullable = true)
|-- Referrer: string (nullable = true)
As you can notice I created a timestamp field which I read is supported by Spark (Date wouldn't work as far as I understood). I would love to use for queries like "where timestamp>(2012-10-08 16:10:36.0)" but when I run it I keep getting errors.
I tried these 2 following sintax forms:
For the second one I parse a string so Im sure Im actually pass it in a timestamp format.
I use 2 functions: parse and date2timestamp.
Any hint on how I should handle timestamp values?
Thanks!
1)
scala> sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)").collect
java.lang.RuntimeException: [1.55] failure: ``)'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=(2012-10-08 16:10:36.0)
^
2)
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp="+date2timestamp(formatTime3.parse("2012-10-08 16:10:36.0"))).collect
java.lang.RuntimeException: [1.54] failure: ``UNION'' expected but 16 found
SELECT * FROM Logs as l where l.timestamp=2012-10-08 16:10:36.0
^
I figured that the problem was the precision of the timestamp first of all and also the string that I pass representing the timestamp has to be casted as a String
So this query works now:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestampLog as String) <= '2012-10-08 16:10:36'")
You forgot the quotation marks.
Try something with this syntax:
L.timestamp = '2012-07-16 00:00:00'
Alternatively, try
L.timestamp = CAST('2012-07-16 00:00:00' AS TIMESTAMP)
Cast the string representation of the timestamp to timestamp. cast('2012-10-10 12:00:00' as timestamp) Then you can do comparison as timestamps, not strings. Instead of:
sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as String) <= '2012-10-08 16:10:36'")
try
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp <= cast('2012-10-08 16:10:36' as timestamp)")
Sadly this didn't work for me. I am using Apache Spark 1.4.1. The following code is my solution:
Date date = new Date();
String query = "SELECT * FROM Logs as l where l.timestampLog <= CAST('" + new java.sql.Timestamp(date.getTime()) + "' as TIMESTAMP)";
sqlContext.sql(query);
Casting the timestampLog as string did not throw any errors but returned no data.