How to handle multi line rows in spark? - scala

I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.

By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string

Related

How to display DataFrame in Databricks?

I recently started working with Databricks and I am new to Pyspark. I am trying to display a tidy and understandable dataset from a text file in pyspark.
Here is the code snippet:
# File location and type
file_location = "/FileStore/tables/2014_GE_By_Precinct.txt"
file_type = "txt"
# The applied options are for txt files. For other file types, these will be ignored.
df = spark.read.option("inferSchema", "true") \
.option("header", "true") \
.csv(file_location)
df.show(truncate=False)
Here is the result I am getting:
I want the dataframe to be displayed in a way so that I can scroll it horizontally and all my column headers fit in one top line instead of a few of them coming in the next line and making it hard to understand which column header represents which column.
In Databricks, use display(df) command.
%python
display(df)
Read about this and more in Apache Sparkā„¢ Tutorial: Getting Started with Apache Spark on Databricks.

Adjusting columns from txt to parquet

this is my first time trying to convert a txt file to parquet format so please bear with me.
I have a txt file which originally looks like this:
id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1
And I'd like to make it to parquet format like this:
+---+-------------------------+-------+
|id |roads |weights|
+---+-------------------------+-------+
|a01|1026|1172 |1 |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1 |
|b01|DT:SR:1|7|SW |1 |
So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:
val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")
But I my column names are now considered a row and everything has been put together as a single column call "value".
Any help would be appreciated!
read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.
val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)

Spark Streaming Context to read CSV

I'm having a tough time using StreamingContext to read a CSV and send each row to another method that does other processing. I tried splitting by newline but it splits after three columns (there are about 10 columns per
row):
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
lines.map{row => {
val columnValues = row.split("\n")
(columnValues(0), "\n")
}}.print()
If I open the CSV in Excel, there are about 10 values per column. If I open the same file using Sublime or some text editor, there appears to be a newline after those first 3 values. Not sure if it's an encoding thing or just the way Sublime displays it. In any case I'm trying to get the entire row in Spark - not sure if there's a way to do that.
ssc.textFileStream internally creates a file stream and start splitting on the new line character. But your data is containing the text qualifiers
1996, Jeep, "Grand Cherokee, MUST SELL!
air", moon roof, loaded, 4799.00
Here some text is in double quotes and the row is multi lined row. If you try to split the data by , it will be:
[1996, Jeep, "Grand Cherokee,MUST SELL!]
It will miss the other data points because you are splitting by comma. To avoid, that you can use sqlContext
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.load(path)
Or you can pre-process your CSV using Univocity Parser to handle multi-line and double quotes and other special characters, and put these files in to the directory and start your ssc.textFileStream after that.

Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"
I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");
How do I replace newline within fields which starts with quotes. Is there any easier way?
According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.
I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.
That's however Spark 2.0, and you use spark-csv module.
In the referenced SPARK-19610 it was fixed with the pull request:
hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline
In other words, use wholeFile option in Spark 2.x (as you can see in CSVDataSource).
As to spark-csv, this comment might be of some help (highlighting mine):
However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).
In spark-csv's Features you can find the following:
The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:
quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.
quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes
There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.
Here is an example of loading in a CSV to a dataframe with that option:
var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))
Upgrade to Spark 2.x. Newline is actually CRLF represented by ascii 13 and 10. But backslash and 'n' are different ascii which are programatically interpreted and written. Spark 2.x will read correctly.. I tried it..s.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]")
val sc = SparkSession.builder().master("local").getOrCreate()
val df = sc.read.csv("src/main/resources/data.csv")
df.foreach(row => println(row.mkString(", ")))
If you cant upgrade, then do a cleanup of \n on RDD with regex. This wont remove end of line since it is $ in regex. S.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("src/main/resources/data.csv")
val rdd2 = rdd1.map(row => row.replace("\\n", ""))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd2.toDF()
df.foreach(row => println(row.mkString(", ")))

How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.
I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?
If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted).
Use it like this:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "\u0001")
.load("cars.csv")
With Spark 2.x and the CSV API, use the sep option:
val df = spark.read
.option("sep", "\u0001")
.csv("path_to_csv_files")