How to Save a file with multiple delimiter in spark

How to Save a file with multiple delimiter in spark - pyspark

I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'

AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")

From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)

Related

How to remove all characters that start with "_" from a spark string column

I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0

As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?

To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

Convert dataframe String column to Array[Int]

I am new to Scala and Spark and I am trying to read a csv file locally (for testing):
val spark = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate;
val topics_df = spark.read.format("csv").option("header", "true").load("path-to-file.csv")
topics_df.show(10)
Here's how the file looks like:
+-----+--------------------+--------------------+
|topic| termindices| termweights|
+-----+--------------------+--------------------+
| 15|[21,31,51,108,101...|[0.0987100701,0.0...|
| 16|[42,25,121,132,55...|[0.0405490884,0.0...|
| 7|[1,23,38,7,63,0,1...|[0.1793091892,0.0...|
| 8|[13,40,35,104,153...|[0.0737646511,0.0...|
| 9|[2,10,93,9,158,18...|[0.1639456608,0.1...|
| 0|[28,39,71,46,123,...|[0.0867449145,0.0...|
| 1|[11,34,36,110,112...|[0.0729913664,0.0...|
| 17|[6,4,14,82,157,61...|[0.1583892199,0.1...|
| 18|[9,27,74,103,166,...|[0.0633899386,0.0...|
| 19|[15,81,289,218,34...|[0.1348582482,0.0...|
+-----+--------------------+--------------------+
with
ReadSchema: struct<topic:string,termindices:string,termweights:string>
The termindices column is supposed to be of type Array[Int], but when saved to CSV it is a String (This usually would not be a problem if I pull from databases).
How do I convert the type and eventually cast the DataFrame to a:
case class TopicDFRow(topic: Int, termIndices: Array[Int], termWeights: Array[Double])
I have the function ready to perform the conversion:
termIndices.substring(1, termIndices.length - 1).split(",").map(_.toInt)
I have looked at udf and a few other solutions but I am convinced that there should be a much cleaner and faster way to perform said conversion. Any help is greatly appreciated!

UDFs should be avoided when it's possible to use the more efficient in-built Spark functions. To my knowledge there is no better way than the one proposed; remove the first and last characters of the string, split and convert.
Using the in-built functions, this can be done as follows:
df.withColumn("termindices", split($"termindices".substr(lit(2), length($"termindices")-2), ",").cast("array<int>"))
.withColumn("termweights", split($"termweights".substr(lit(2), length($"termweights")-2), ",").cast("array<double>"))
.as[TopicDFRow]
substr if 1-index based so to remove the first character we start from 2. The second argument is the length to take (not the end point) hence the -2.
The last command will cast the dataframe to a dataset of type TopicDFRow.

Spark Scala read CSV which has a comma in the data

My CSV file which is in a zip file has the below data,
"Potter, Jr",Harry,92.32,09/09/2018
John,Williams,78,01/02/1992
And I read it using spark scala csv reader. If I use,
.option('quote', '"')
.option('escape', '"')
I will not be getting the fixed number of columns as output. For line 1, the output would be 5 and line 2 it would be 4. The desired output should return 4 columns only. Is there any way to read it as DF or RDD?
Thanks,
Ash

For the given input data, I was able to read the data using:
val input = spark.read.csv("input_file.csv")
This gave me a Dataframe with 4 string columns.

Check this.
val df = spark.read.csv("in/potter.txt").toDF("fname","lname","value","dt")
df.show()
+----------+--------+-----+----------+
| fname| lname|value| dt|
+----------+--------+-----+----------+
|Potter, Jr| Harry|92.32|09/09/2018|
| John|Williams| 78|01/02/1992|
+----------+--------+-----+----------+

How to drop particular record with some junk characters(like $,"NA",etc) from CSV file using PYSPARK

I have a CSV file, from which I want to remove records which have particular characters like "$", "NA", "##".
I am not able to figure out any function to drop the records for this scenario.
How can I achieve this?
Hello All,
I tried below code and it is working fine.But this code for paricular pattern
and i want to remove multiple occurances of garbage
values(#,##,###,$,$$,$$$)like this.
eg] filter_list = ['##', '$']
df = df.filter(df.color.isin(*filter_list) == False)
df.show()
In this example I used single column is "color", but instead of a single column
I want to work with multiple columns (passing array).
Thanks in advance.

You can accomplish this by using the filter function
(http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter)
Here's some example code:
from pyspark.sql import functions as F
#create some test data
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"),
(13, "2017-03-15T12:27:18+00:00", "$"),
(25, "2017-03-18T11:27:18+00:00", "##")],
["dollars", "timestampGMT", "color"])
df.show()
Here's what the data looks like:
+-------+--------------------+------+
|dollars| timestampGMT| color|
+-------+--------------------+------+
| 17|2017-03-10T15:27:...|orange|
| 13|2017-03-15T12:27:...| $|
| 25|2017-03-18T11:27:...| ##|
+-------+--------------------+------+
You can create a filter list and then filter out the records that match (in this case from the color column):
filter_list = ['##', '$']
df = df.filter(df.color.isin(*filter_list) == False)
df.show()
Here's what the data looks like now:
+-------+--------------------+------+
|dollars| timestampGMT| color|
+-------+--------------------+------+
| 17|2017-03-10T15:27:...|orange|
+-------+--------------------+------+