Spark DataFrameWriter omits trailing tab delimiters when saving (Spark 1.6) - scala

I am leaving my question below as it was originally posted for the sake of future developers who run into this problem. The issue was resolved once I moved to Spark2.0 - I.e. the output was as I expected without making any chnages to my original code. Looks like some implementation difference in the 1.6 version I used at first.
I have Spark 1.6 Scala code that reads a TSV (CSV with tab delimiter) and writes it to TSV output (without changing the input - just filtering the input).
The input data has sometimes null values in the last columns of rows.
When I use the delimiter "," the output has trailing commas.
E.g.
val1, val2, val3,val4,val5
val1, val2, val3,,
but if I use tab (\t) as a delimiter the output does not include the trailing tabs. E.g (I am writing here TAB where \t appears):
val1 TAB val2 TAB val3 TAB val4 TAB val5
val1 TAB val2 TAB val3 <= **here I expected two more tabs (as with the comma delimiter)**
I also tried other delimiters and saw that when the delimiter is a white space character (e.g. the ' ' characted) the trailing delimeters are not in the output.
If I use other visible delimiter (e.g. the letter 'z') it works fine as with the comma separator and I have trailing delimiters.
I thought this might have to do with the options ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace but setting them both to false when writing didn't help either.
My code looks like this:
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load(params.inputPathS3)
df_filtered = df.filter(...)
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").save(outputPath)
I also tried (as I wrote above):
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").save(outputPath)

Below is a working example(with spark 1.6):
Input file (with some trailing spaces in the end):
1,2,3,,
scala> val df = sqlContext.read.option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").format("com.databricks.spark.csv").option("delimiter", ",").load("path")
df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: string, C4: string]
scala> df.show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
scala> df.write.option("nullValue", "null").option("quoteMode", "ALL").mode("overwrite").format("com.databricks.spark.csv").option("delimiter", "\t").save("path")
scala> sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("path").show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
Please refer: this for all options while reading, writing with databricks csv library.

Related

Column Renaming in pyspark dataframe

I have column names with special characters. I renamed the column and trying to save and it gives the save failed saying the columns have special characters. I ran the print schema on the dataframe and i am seeing the column names with out any special characters. Here is the code i tried.
for c in df_source.columns:
df_source = df_source.withColumnRenamed(c, c.replace( "(" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( ")" , ""))
df_source = df_source.withColumnRenamed(c, c.replace( "." , ""))
df_source.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(stg_location)
and i get the following error
Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Number_of_data_samples_(input)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Also one more thing i noticed was when i do a df_source.show() or display(df_source), both shows the same error and printschema shows that there are not special characters.
Can someone help me in finding a solutions for this.
Try Using it as below -
Input_df
from pyspark.sql.types import *
from pyspark.sql.functions import *
data = [("xyz", 1)]
schema = StructType([StructField("Number_of_data_samples_(input)", StringType(), True), StructField("id", IntegerType())])
df = spark.createDataFrame(data=data, schema=schema)
df.show()
+------------------------------+---+
|Number_of_data_samples_(input)| id|
+------------------------------+---+
| xyz| 1|
+------------------------------+---+
Method 1
Using regular expressions to replace the special characters and then use toDF()
import re
cols=[re.sub("\.|\)|\(","",i) for i in df.columns]
df.toDF(*cols).show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 2
Using .withColumnRenamed()
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
+----------------------------+---+
|Number_of_data_samples_input| id|
+----------------------------+---+
| xyz| 1|
+----------------------------+---+
Method 3
Using .withColumn to create a new column and drop the existing column
df = df.withColumn("Number_of_data_samples_input", lit(col("Number_of_data_samples_(input)"))).drop(col("Number_of_data_samples_(input)"))
df.show()
+---+----------------------------+
| id|Number_of_data_samples_input|
+---+----------------------------+
| 1| xyz|
+---+----------------------------+

Using rlike with list to create new df scala

just started with scala 2 days ago.
Here's the thing, I have a df and a list. The df contains two columns: paragraphs and authors, the list contains words (strings). I need to get the count of all the paragraphs where every word on list appears by author.
So far my idea was to create a for loop on the list to query the df using rlike and create a new df, but even if this does work, I wouldn't know how to do it. Any help is appreciated!
Edit: Adding example data and expected output
// Example df and list
val df = Seq(("auth1", "some text word1"), ("auth2","some text word2"),("auth3", "more text word1").toDF("a","t")
df.show
+-------+---------------+
| a| t|
+-------+---------------+
|auth1 |some text word1|
|auth2 |some text word2|
|auth1 |more text word1|
+-------+---------------+
val list = List("word1", "word2")
// Expected output
newDF.show
+-------+-----+----------+
| word| a|text count|
+-------+-----+----------+
|word1 |auth1| 2|
|word2 |auth2| 1|
+-------+-----+----------+
You can do a filter and aggregation for each word in the list, and combine all the resulting dataframes using unionAll:
val result = list.map(word =>
df.filter(df("t").rlike(s"\\b${word}\\b"))
.groupBy("a")
.agg(lit(word).as("word"), count(lit(1)).as("text count"))
).reduce(_ unionAll _)
result.show
+-----+-----+----------+
| a| word|text count|
+-----+-----+----------+
|auth3|word1| 1|
|auth1|word1| 1|
|auth2|word2| 1|
+-----+-----+----------+

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

Pyspark removing multiple characters in a dataframe column

Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.
I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.
Use case: remove all $, #, and comma(,) in a column A
You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.
For example, let's say you had the following DataFrame:
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+
and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:
df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+
If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().
df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+
The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.
If someone need to do this in scala you can do this as below code:
val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)
You can see the output as below:

Using rlike in org.apache.spark.sql.Column

I am trying to implement a query in my Scala code which uses a regexp on a Spark Column to find all the rows in the column which contain a certain value like:
column.rlike(".*" + str + ".*")
str is a String that can be anything (except null or empty).
This works fine for the basic queries that I am testing. However being new to Spark / Scala, I am unsure of if there are any special cases that could break the code here that I need to take care of. Are there any characters that I need to be escaping or special cases that I need to worry about here?
This can be broken by any invalid regexp. You don't even have to try hard:
Seq("[", "foo", " ba.r ").toDF.filter($"value".rlike(".*" + "[ " + ".*")).show
or can give unexpected results if str is a non-trivial pattern itself. For simple cases like this you'll be better with Column.contains:
Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("[")).show
Seq("[", "foo", " ba.r ").toDF.filter($"value".contains("a.r")).show
You can use rlike as zero suggested and Pattern.quote to handle the special regex characters. Suppose you have this DF:
val df = Seq(
("hel?.o"),
("bbhel?.o"),
("hel?.oZZ"),
("bye")
).toDF("weird_string")
df.show()
+------------+
|weird_string|
+------------+
| hel?.o|
| bbhel?.o|
| hel?.oZZ|
| bye|
+------------+
Here's how to find all the strings that contain "hel?.o".
import java.util.regex.Pattern
df
.withColumn("has_hello", $"weird_string".rlike(Pattern.quote("hel?.o")))
.show()
+------------+---------+
|weird_string|has_hello|
+------------+---------+
| hel?.o| true|
| bbhel?.o| true|
| hel?.oZZ| true|
| bye| false|
+------------+---------+
You could also add the quote characters manually to get the same result:
df
.withColumn("has_hello", $"weird_string".rlike("""\Qhel?.o\E"""))
.show()
If you don't properly escape the regex, you won't get the right result:
df
.withColumn("has_hello", $"weird_string".rlike("hel?.o"))
.show()
+------------+---------+
|weird_string|has_hello|
+------------+---------+
| hel?.o| false|
| bbhel?.o| false|
| hel?.oZZ| false|
| bye| false|
+------------+---------+
See this post for more details.