pyspark writing accented character in a dataframe - encoding

I have a spark job that reads a file and then json parse line by a line and just read one the json key as an example
logs = sc.textFile(path_to_file)
# log['a_key'] contain UTF data (accented character)
logs = logs.map(lambda x: json.loads(x)).map(lambda x: x['a_key'])
df = sql_context.createDataFrame(logs, ["test_column"])
df.coalesce(1).write.format("com.databricks.spark.csv").options(header=True).save(destination_path)
When I look a the outpouted csv file all the accented character are replaced by strange character.
How do I make pyspark write accented character in the csv file ? I have tried using log['a_key'].encode('UTF-8') but it was the same result

Related

How to read csv file for which data contains double quotes and comma seperated using spark dataframe in databricks

I'm trying to read csv file using spark dataframe in databricks. The csv file contains double quoted with comma separated columns. I tried with the below code and not able to read the csv file. But if I check the file in datalake I can see the file.
The input and output is as follows
df = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("quoteAll","true")\
.option("escape",'"')\
.csv("mnt/A/B/test1.csv")
The input file data:header:
"A","B","C"
"123","dss","csc"
"124","sfs","dgs"
Output:
"A"|"B"|"C"|

Converting csv to parquet in spark gives error if csv column headers contain spaces

I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")

Spark Streaming Context to read CSV

I'm having a tough time using StreamingContext to read a CSV and send each row to another method that does other processing. I tried splitting by newline but it splits after three columns (there are about 10 columns per
row):
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
lines.map{row => {
val columnValues = row.split("\n")
(columnValues(0), "\n")
}}.print()
If I open the CSV in Excel, there are about 10 values per column. If I open the same file using Sublime or some text editor, there appears to be a newline after those first 3 values. Not sure if it's an encoding thing or just the way Sublime displays it. In any case I'm trying to get the entire row in Spark - not sure if there's a way to do that.
ssc.textFileStream internally creates a file stream and start splitting on the new line character. But your data is containing the text qualifiers
1996, Jeep, "Grand Cherokee, MUST SELL!
air", moon roof, loaded, 4799.00
Here some text is in double quotes and the row is multi lined row. If you try to split the data by , it will be:
[1996, Jeep, "Grand Cherokee,MUST SELL!]
It will miss the other data points because you are splitting by comma. To avoid, that you can use sqlContext
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.load(path)
Or you can pre-process your CSV using Univocity Parser to handle multi-line and double quotes and other special characters, and put these files in to the directory and start your ssc.textFileStream after that.

How to handle multi line rows in spark?

I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?
Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.
By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string

Converting Tab separated file to csv file

I am new to scala and spark, I am trying to convert a Tab saparated file into CSV file to further convert it into RDD.
Actually I tried to convert Tab separated file to RDD using sc.textFile . It is getting implemented but the results afterwards like .first() , .take(n) are not very systematic and unable to read properly even after using foreach(println).
I tried converting the file to csv using Excel but the data size being very large, It is not getting loaded at the first place.
Is there any simple ay to convert Tab separated file to CSV so as to get systematic results for the above mentioned problem.
Here is a mini tutorial:
Let's say your TSV data is:
row11 \t row12 \t row13... \t row1n
row21 \t row22 \t row23... \t row2n
Read this file as an RDD of strings:
val readFile = sc.textFile("FILEPAHT HERE")
Parse it's contents by using the tab delimiter:
val parseRows = readFile.map(row => row.split("\t"))
Convert the row arrays into a string delimited by ","
val outputCsvRdd = parseRows.map(row => row.mkString(","))
Write out the file which would be a csv:
outputCsvRdd.saveAsTextFile('OUTPUTPATH')