What is PySpark SQL equivalent function for pyspark.pandas.DataFrame.to_string? - pyspark

Pandas API function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_string.html
Another answer, though it doesn't work for me pyspark : Convert DataFrame to RDD[string]
Following above post advice, I tried going with
data.rdd.map(lambda row: [str(c) for c in row])
Then I get this error
TypeError: 'PipelinedRDD' object is not iterable
I would like for it to output rows of strings as if it's similar to to_string() above. Is this possible?

Would pyspark.sql.DataFrame.show satisfy your expectations about the console output? You can sort the df via pyspark.sql.DataFrame.sort before printing if required.

Related

Parsing nested XML in Databricks

I am trying to p
I am trying to read the XML into a data frame and trying to flatten the data using explode as below.
val df = spark.read.format("xml").option("rowTag","on").option("inferschema","true").load("filepath")
val parsxml= df
.withColumn("exploded_element", explode(("prgSvc.element"))).
I am getting the below error.
command-5246708674960:4: error: type mismatch;
found : String("prgSvc.element")
required: org.apache.spark.sql.Column
.withColumn("exploded_element", explode(("prgSvc.element")))**
Before reading the XML into the data frame, I also tried to manually assign a custom schema and read the XML file. But the output is all NULL. Could you please let me know if my approach is valid and how to resolve this issue and achieve the output.
Thank you.
Use this
import spark.implicits._
val parsxml= df .withColumn("exploded_element", explode($"prgSvc.element"))

regexp_extract in scala data frame is giving the error

I am trying to convert the below Hive SQL statement into Spark dataframe and getting the error.
trim(regexp_extract(message_comment_txt, '(^.*paid\\s?\\$?)(.*?)(\\s?toward.*)', 2))
Sample data: message_comment_txt = "DAY READER, paid 12.76 toward the cost"
I need to get the output as 12.76
Please help me to provide equivalent spark dataframe statement.
Try with paid\\s+(.*?)\\s+toward regex.
df.withColumn("extract",regexp_extract(col("message_comment_txt"),"paid\\s+(.*?)\\s+toward",1)).show(false)
//for case insensitive
df.withColumn("extract",regexp_extract(col("message_comment_txt"),"(?i)paid\\s+(.*?)\\s+(?i)toward",1)).show(false)
//+--------------------------------------+-------+
//|message_comment_txt |extract|
//+--------------------------------------+-------+
//|DAY READER, paid 12.76 toward the cost|12.76 |
//+--------------------------------------+-------+

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

pyspark.sql.functions abs() fails with PySpark Column input

I'm trying to convert the following HiveQL query into PySpark:
SELECT *
FROM ex_db.ex_tbl
WHERE dt >= 20180901 AND
dt < 20181001 AND
(ABS(HOUR(FROM_UNIXTIME(local_timestamp))-13)>6 OR
(DATEDIFF(FROM_UNIXTIME(local_timestamp), '2018-12-31') % 7 IN (0,6))
I am not great at PySpark, but I have viewed the list of functions. I have gotten to the point where I am attempting the ABS() function, but struggling to do so in PySpark. Here is what I have tried:
import pyspark.sql.functions as F
df1.withColumn("abslat", F.abs("lat"))
An error occurred while calling z:org.apache.spark.sql.functions.abs
It doesn't work. I read that the input must be a PySpark Column. I checked and that condition is met.
type(df1.lat)
<class 'pyspark.sql.column.Column'>
Can someone please point me in the right direction?
Your passsing string to abs which is valid in case of scala with $ Operator which consider string as Column.
you need to use abs() method like this abs(Dataframe.Column_Name)
For your case try this one:
df1.withColumn("abslat", abs(df1.lat))

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.