Spark TSV file and incorrect column spit - scala

I have TSV file with many of lines. Much of the lines work fine but I have the issue of working with the following line:
tt7841930 tvEpisode "Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded 0 2018 \N 24 Animation,Family
I use Spark and Scala in order to load the file into DataFrame:
val titleBasicsDf = spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("delimiter", " ")
.csv("title.basics.tsv.gz")
As result I receive:
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tconst |titleType|primaryTitle |originalTitle|isAdult|startYear|endYear|runtimeMinutes |genres|averageRating|numVotes|parentTconst|seasonNumber|episodeNumber|
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
|tt7841930|tvEpisode|"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded|0 |2018 |\N |24 |Animation,Family|null |null |null |tt4947580 |6 |2 |
+---------+---------+-------------------------------------------------------------------------------+-------------+-------+---------+-------+----------------+------+-------------+--------+------------+------------+-------------+
So as you may see, the following data in the line:
"Stop and Hear the Cicadas/Cold-Blooded "Stop and Hear the Cicadas/Cold-Blooded
is not properly split into two different values for primaryTitle and originalTitle columns and primaryTitle contains both of them:
{
"runtimeMinutes":"Animation,Family",
"tconst":"tt7841930",
"seasonNumber":"6",
"titleType":"tvEpisode",
"averageRating":null,
"originalTitle":"0",
"parentTconst":"tt4947580",
"startYear":null,
"endYear":"24",
"numVotes":null,
"episodeNumber":"2",
"primaryTitle":"\"Stop and Hear the Cicadas/Cold-Blooded\t\"Stop and Hear the Cicadas/Cold-Blooded",
"isAdult":2018,
"genres":null
}
What am I doing wrong and how to configure Spark to properly understand and split this line? As I mentioned previously, many of other lines from this file are split correctly into the proper columns.

I found the answer here: https://github.com/databricks/spark-csv/issues/89
The way to turn off the default escaping of the double quote character
(") with the backslash character () - i.e. to avoid escaping for all
characters entirely, you must add an .option() method call with just
the right parameters after the .write() method call. The goal of the
option() method call is to change how the csv() method "finds"
instances of the "quote" character as it is emitting the content. To
do this, you must change the default of what a "quote" actually means;
i.e. change the character sought from being a double quote character
(") to a Unicode "\u0000" character (essentially providing the Unicode
NUL character assuming it won't ever occur within the document).
the following magic option did the trick:
.option("quote", "\u0000")

Related

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

Removing tailing tabs from a string column in a Spark Dataframe

I need to clean a column from a Dataframe which contains tailing whitespaces. Something like this:
'17063256 '
'17403492 '
'17390052 '
First, I tried to remove white spaces using trim:
df.withColumn("col1_cleansed", trim(col("col1")))
Then I though it may be tailing "tabs", so I tried as well with:
df.withColumn("col1_cleansed", regexp_replace(col("col1"), "\t", ""))
However none of these two solutions seems to be working.
What is the correct way to remove "tab" characters from a string column in Spark?
Method trim or rtrim does seem to have problem handling general whitespaces. To remove trailing whitespaces, consider using regexp_replace with regex pattern \\s+$ (with '$' representing end of string), as shown below:
val df = Seq(
"17063256 ", // space
"17403492 ", // tab
"17390052 " // space + tab
).toDF("c1")
df.withColumn("c1_trimmed", regexp_replace($"c1", "\\s+$", "")).show
// Output (prettified)
// +------------+----------+
// | c1|c1_trimmed|
// +------------+----------+
// | 17063256 | 17063256|
// | 17403492 | 17403492|
// |17390052 | 17390052|
// +------------+----------+
Try below udf & change as per your needs.
val normalize = udf((in: String) => {
import java.text.Normalizer.{normalize ⇒ jnormalize, _}
val cleaned = in.trim.toLowerCase
val normalized = jnormalize(cleaned, Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}\\p{IsLm}\\p{IsSk}]+", "")
normalized.replaceAll("'s", "")
.replaceAll("ß", "ss")
.replaceAll("ø", "o")
.replaceAll("[^a-zA-Z0-9-]+", " ")
})
df.withColumn("col1_cleansed", normalize(col("col1")))
You can regex_replace to and replace with ""
df.withColumn("new", regexp_replace($"id", " ",""))
.show(false)
Output:
+------------------------------------------------------+----------+
|id |new |
+------------------------------------------------------+----------+
|'17063256 ' |'17063256'|
|'17403492 '|'17403492'|
|'17390052 ' |'17390052'|
+------------------------------------------------------+----------+
Another way to look at the problem. Extract only the required portion from the column .This will work if you are expecting only alphanumeric values and nothing else.
Feel free to modify it to accept numbers only if required.
df.withColumn("cleansed_col",regexp_extract(col("input"),"[a-z0-9]+",0))

Scala Spark Handle Single Quote Characters With Commas

I'm reading a CSV in Spark with scala and it handles line one correctly in the example below, but in line two of the example the line has an ending quote character, but no leading quote character for the first column. This causes an issue by moving the data over and outputting bad|col in the final result, which is incorrect.
"good,col","good,col"
bad,col","good,col"
Is there an option to handle quote characters that don't have a leading (or ending) quote in the option specification when reading the file in spark with scala?
Hm... by using rdd and with some replacements, I can obtain what you want.
val df = rdd.map(r => (r.replaceAll("\",\"", "|").replaceAll("\"", "").split("\\|"))).map{ case Array(a, b) => (a, b) }.toDF("col1", "col2")
df.show()
+--------+--------+
| col1| col2|
+--------+--------+
|good,col|good,col|
| bad,col|good,col|
+--------+--------+

How to Save a file with multiple delimiter in spark

I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)

Scala spark dataframe keep leading zeroes

I'm reading the following csv files :
id,hit,name
0001,00000,foo
0002,00001,bar
0003,00150,toto
As a spark Dataframe with an SqlContext which give the output :
+--+---+----+
|id|hit|name|
+--+---+----+
|1 |0 |foo |
|2 |1 |bar |
|3 |150|toto|
+--+---+----+
I need to keep the leading zeros in the Dataframe.
I tried with the option "allowNumericLeadingZeros" set to true, it doesn't work.
I saw some posts saying it's an excel issue, but my issue is that the leading zeros are removing inside the Dataframe.
How can I keep the leading zeros inside the Dataframe ?
Thanks !
public Dataset csv(String... paths)
Loads a CSV file and returns the result as a DataFrame.
This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.
You can set the following CSV-specific options to deal with CSV files:
sep (default ,): sets the single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different form com.databricks.spark.csv.
escape (default ): sets the single character used for escaping quotes inside an already quoted value.
comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): defines whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): defines whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSZZ): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
java.sql.Timestamp.valueOf() and java.sql.Date.valueOf() or ISO 8601 format.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default 1000000): defines the maximum number of characters allowed for any given value being read.
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
Parameters:
paths - (undocumented)
Returns:
(undocumented)
Since:
2.0.0
https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/DataFrameReader.html
Example :
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+---+---+----+
| id|hit|name|
+---+---+----+
| 1| 0| foo|
| 2| 1| bar|
| 3|150|toto|
+---+---+----+
Now change .option("inferSchema", "false")
val dataframe= sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types
.load("/FileStore/tables/1wmfde6o1508943117023/Book2.csv")
val selectedData = dataframe.select("id","hit","name")
Result :
+----+-----+----+
| id| hit|name|
+----+-----+----+
|0001|00000| foo|
|0002|00001| bar|
|0003|00150|toto|
+----+-----+----+
I would suggest you to create a schema for your dataframe and define the types as Strings.
You can create schema as
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("id", StringType, true), StructField("hit", StringType, true), StructField("name", StringType, true)))
and use it in sqlContext as
val df = sqlContext.read.option("header", true).schema(schema).format("com.databricks.spark.csv")
.csv("path to csv file")
As others answered you that the real culprit is that you might be using .option("inferSchema", true) which takes id and hit columns as integers and leading 0s are removed.
So you can read the csv file without .option("inferSchema", true) or with schema defined as above
I hope the answer is helpful
You must have set InferSchema as true while reading dataframe, remove this option or set it to false
sparkSession.read.option("header","true").option("inferSchema","false").csv("path")
Through this option, Spark infer the schema of dataframe and set dataType according to the values found, so spark is basically inferring that id and hits columns are numeric in nature and so it is removing all the leading zeros.
For further assistance take a look at this