I'm having a tough time using StreamingContext to read a CSV and send each row to another method that does other processing. I tried splitting by newline but it splits after three columns (there are about 10 columns per
row):
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
lines.map{row => {
val columnValues = row.split("\n")
(columnValues(0), "\n")
}}.print()
If I open the CSV in Excel, there are about 10 values per column. If I open the same file using Sublime or some text editor, there appears to be a newline after those first 3 values. Not sure if it's an encoding thing or just the way Sublime displays it. In any case I'm trying to get the entire row in Spark - not sure if there's a way to do that.
ssc.textFileStream internally creates a file stream and start splitting on the new line character. But your data is containing the text qualifiers
1996, Jeep, "Grand Cherokee, MUST SELL!
air", moon roof, loaded, 4799.00
Here some text is in double quotes and the row is multi lined row. If you try to split the data by , it will be:
[1996, Jeep, "Grand Cherokee,MUST SELL!]
It will miss the other data points because you are splitting by comma. To avoid, that you can use sqlContext
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.load(path)
Or you can pre-process your CSV using Univocity Parser to handle multi-line and double quotes and other special characters, and put these files in to the directory and start your ssc.textFileStream after that.
Related
HD|20211210
DT|D-|12/22/2017|12/22/2017 09:41:45.828000|11/01/2017|01/29/2018 14:46:10.666000|1.2|1.2|ABC|ABC|123|123|4554|023|11/01/2017|ACDF|First|0012345||f|ABCD|ABCDEFGH|ABCDEFGH||||
DT|D-|12/25/2017|12/25/2017 09:24:20.202000|12/13/2017|01/29/2018 07:52:23.607000|6.4|6.4|ABC|ABC|123|123|4540|002|12/13/2017|ACDF|First|0012345||f|ABC|ABCDEF|ABCDEFGH||||
TR|0000000002
File name is Datafile.Dat. Scala version 2.11
I need to create header Dataframe with the first line but excluding "HD|", Need to create trailer dataframe with the last line but excluding "TR|", and finally need to create actual dataframe by skipping both the first and last line and excluding "DT|" from each line.
Please help me on this.
I see you have a defined schema for your dataframe (except first and last row).
What you can do is to read that file and seperator will be '|'
and you can enable "DROPMALFORMED" mode.
schema = 'define your schema here'
df = spark.read.option("mode","DROPMALFORMED").option("delimiter","|").option("header","true").schema(schema).csv("Datafile.Dat")
Another way is to use zipWithIndex.
I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")
I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?
Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.
By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string
I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"
I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");
How do I replace newline within fields which starts with quotes. Is there any easier way?
According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.
I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.
That's however Spark 2.0, and you use spark-csv module.
In the referenced SPARK-19610 it was fixed with the pull request:
hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline
In other words, use wholeFile option in Spark 2.x (as you can see in CSVDataSource).
As to spark-csv, this comment might be of some help (highlighting mine):
However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).
In spark-csv's Features you can find the following:
The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:
quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.
quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes
There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.
Here is an example of loading in a CSV to a dataframe with that option:
var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))
Upgrade to Spark 2.x. Newline is actually CRLF represented by ascii 13 and 10. But backslash and 'n' are different ascii which are programatically interpreted and written. Spark 2.x will read correctly.. I tried it..s.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]")
val sc = SparkSession.builder().master("local").getOrCreate()
val df = sc.read.csv("src/main/resources/data.csv")
df.foreach(row => println(row.mkString(", ")))
If you cant upgrade, then do a cleanup of \n on RDD with regex. This wont remove end of line since it is $ in regex. S.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("src/main/resources/data.csv")
val rdd2 = rdd1.map(row => row.replace("\\n", ""))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd2.toDF()
df.foreach(row => println(row.mkString(", ")))
Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.
I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?
If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted).
Use it like this:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", "\u0001")
.load("cars.csv")
With Spark 2.x and the CSV API, use the sep option:
val df = spark.read
.option("sep", "\u0001")
.csv("path_to_csv_files")