new line in csv file dealing with spark - scala

One of my input file is a csv (separated with comma). One of the field is address which has new line character in it. So this causes me considerable trouble when I read it using spark, where one input record gets split into muitiple records.
Has any one able to find a solution to deal with this. The workaround currently done is to remove the new line characters in data at source side before reading into spark.
I would like to create a general solution for this in spark. I use scala dataframe api's.

You can try the multiLine option of the csv reader.
spark.read.csv(file, multiLine=True)

Related

Scala Spark read with partitions drop partitions

There is hdfs-directory:
/home/path/date=2022-12-02, where date=2022-12-02 is a partition.
Parquet file with the partition "date=2022-12-02", has been written to this directory.
To read file with partition, I use:
spark
.read
.option("basePath", "/home/path")
.parquet("/home/path/date=2022-12-02")
The file is read successfully with all partition-fieds.
But, partition folder ("date=2022-12-02") is dropped from directory.
I can't grasp, what is the reason and how to avoid it.
There are two ways to have the date as part of your table;
Read the path like this: .parquet("/home/path/")
Add a new column and use input_file_path() function, then manipulate with the string until you get date column (should be fairly easy, taking last part after slash, splitting on equal sign and taking index 1).
I don't think there is another way to do what you want directly.

how to save a Dataset[row] as text file in spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Write spark dataframe to single parquet file
(2 answers)
Closed 3 years ago.
I would like to save a Dataset[Row] as text file with a specific name in specific location.
Can anybody help me?
I have tried this, but this produce me a folder (LOCAL_FOLDER_TEMP/filename) with a parquet file inside of it:
Dataset.write.save(LOCAL_FOLDER_TEMP+filename)
Thanks
You can`t save your dataset to specific filename using spark api, there is multiple workarounds to do that.
as Vladislav offered, collect your dataset then write it into your filesystem using scala/java/python api.
apply repartition/coalesce(1), write your dataset and then change the filename.
both are not very recommended, because in large datasets it can cause OOM or just lost of the power of spark`s parallelism.
The second issue that you are getting parquet file, its because the default format of spark, you should use:
df.write.format("text").save("/path/to/save")
Please use
RDD.saveAsTextFile()
It Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Refer Link : rdd-programming-guide
Spark always creates multiple files - one file per partition. If you want a single file - you need to do collect() and then just write it to file the usual way.

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once).
So far, I've tried the following options:
Use TextIO.read(): this is no good because a quoted CSV field could contain a newline. In addition, this tries to read the entire file into memory at once.
Write a DoFn that reads the file as a stream and emits records (e.g. with commons-csv). However, this still reads the entire file all at once.
Try a SplittableDoFn as described here. My goal with this is to have it gradually emit records as an Unbounded PCollection - basically, to turn my file into a stream of records. However, (1) it's hard to get the counting right (2) it requires some hacky synchronizing since ParDo creates multiple threads, and (3) my resulting PCollection still isn't unbounded.
Try to create my own UnboundedSource. This seems to be ultra-complicated and poorly documented (unless I'm missing something?).
Does Beam provide anything simple to allow me to parse a file the way I want, and not have to read the entire file into memory before moving on to the next transform?
The TextIO should be doing the right thing from Beam's prospective, which is reading in the text file as fast as possible and emitting events to the next stage.
I'm guessing you are using the DirectRunner for this, which is why you are seeing a large memory footprint. Hopefully this isn't too much explanation: The DirectRunner is a test runner for small jobs and so it buffers intermediate steps in memory rather then to disk. If you are still testing your pipeline, you should use a small sample of your data until you think it is working. Then you can use the Apache Flink runner or Google Cloud Dataflow runner which will both write intermediate stages to disk when needed.
In general, splitting csv files with quoted newlines is hard as it may require arbitrary look-back to determine whether a given newline is or is not in a quoted segment. If you can arrange such that the CSV has no quoted newlines, TextIO.read() works well. Otherwise
If you're using BeamPython, consider the dataframe operation apache_beam.dataframe.io.read_csv which will handle quotation correctly (and efficiently).
In another language, you can either use that as a cross-language transform, or create a PCollection of file paths (e.g. via FileIO.MatchAll) followed by a DoFn that reads and emits rows incrementally using your CSV library of choice. With the exception of a direct/local runner, this should not require reading the entire file into memory (though it will cause each individual file to be read by a single worker, possibly limiting parallelism).
You can use the logic in Text to Cloud Spanner for handling new lines while reading a CSV.
This template reads data from a CSV and writes to Cloud Spanner.
The specific files containing the logic to read CSV with newlines are in ReadFileShardFn and SplitIntoRangesFn.

Spark Multi-class classification -Categorical Variables

I have a data set as a csv file. It has around 50 columns most of which are categorical. I am planning to run a RandomForest multi class classification with a new test data-set.
The pain-point of this is to handle the categorical variables. What would be the best way to handle them? I read the guide for Pipeline in Spark Website http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline which creates a DataFrame from a hard coded sequence also with the features a space delimited string. This looks very specific and I wanted to achieve the same thing on how they use HashingTF for the features using the CSV file i have.
In short I want to achieve the same thing as in the link but using a CSV file.
Any suggestions?
EDIT:
Data -> 50 features, 100k rows, most of it alphanumeric categorical
I am pretty new to MLlib and hence struggling to find the proper pipeline for my data from CSV. I tried creating a DataFrame from the file, but confused as to how I should encode the categorical columns. The doubts I have ar as follows
1. The example in the link above tokenizes the data ans uses it but I have a dataframe.
2. Also even if I try using a StringIndexer , should I write an indexer for every column? Shouldn't there be one method which accepts multiple columns?
3. How will I get back the label from the String Indexer for showing the prediction?
5. For new test data, how will I keep consistent encoding for every column?
I would suggest having a look at the feature transformers http://spark.apache.org/docs/ml-features.html and in particular the StringIndexer and VectorAssembler.

Spark Streaming Iterative Algorithm

I want to create a Spark Streaming application coded in Scala.
I want my application to:
read from a HDFS Text File line by line
analyze every line as String and if needed modify it and:
keep state that is needed for the analysis in some kind of data structures (Hashes probably)
output of everything on text files (any kind)
I've had no problems with the first step:
val lines = ssc.textFileStream("hdfs://localhost:9000/path/")
My analysis consist in searching a match in the Hashes for some fields of the String analyzed, that's why I need to maintain a state and do the process iteratively.
The data in those Hashes is also extracted by the strings analyzed.
What can I do for next steps?
Since you just have to read one HDFS text file line by line, you probably do not need to Spark Streaming for that. You can just use Spark.
val lines = sparkContext.textFile("...")
Then you can use mapPartition to do a distributed processing of the whole partitioned file.
val processedLines = lines.mapPartitions { partitionAsIterator =>
processPartitionAndReturnNewIterator(partitionAsIterator)
}
In that function, you can walk through the lines in the partition, store state stuff in a hashmap, etc. and finally return another iterator of output records corresponding to that partition.
Now if you want share state across partitions, then you probably have to do some more aggregations like groupByKey() or reduceByKey() on processedLines dataset.