I am new to scala and spark, I am trying to convert a Tab saparated file into CSV file to further convert it into RDD.
Actually I tried to convert Tab separated file to RDD using sc.textFile . It is getting implemented but the results afterwards like .first() , .take(n) are not very systematic and unable to read properly even after using foreach(println).
I tried converting the file to csv using Excel but the data size being very large, It is not getting loaded at the first place.
Is there any simple ay to convert Tab separated file to CSV so as to get systematic results for the above mentioned problem.
Here is a mini tutorial:
Let's say your TSV data is:
row11 \t row12 \t row13... \t row1n
row21 \t row22 \t row23... \t row2n
Read this file as an RDD of strings:
val readFile = sc.textFile("FILEPAHT HERE")
Parse it's contents by using the tab delimiter:
val parseRows = readFile.map(row => row.split("\t"))
Convert the row arrays into a string delimited by ","
val outputCsvRdd = parseRows.map(row => row.mkString(","))
Write out the file which would be a csv:
outputCsvRdd.saveAsTextFile('OUTPUTPATH')
Related
I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)
I'm trying to read csv file using spark dataframe in databricks. The csv file contains double quoted with comma separated columns. I tried with the below code and not able to read the csv file. But if I check the file in datalake I can see the file.
The input and output is as follows
df = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("quoteAll","true")\
.option("escape",'"')\
.csv("mnt/A/B/test1.csv")
The input file data:header:
"A","B","C"
"123","dss","csc"
"124","sfs","dgs"
Output:
"A"|"B"|"C"|
I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")
Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.
Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.
Any idea on how to fix this?
per #CodeHunter's request
sadly, the parquet file format does not allow for spaces in column names;
the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".
ORC also does not allow for spaces in column names :(
Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines
I would rename the offending columns in the dataframe, to change space to underscore, before saving. Could be with select "foo bar" as "foo_bar" or .withColumnRenamed("foo bar", "foo_bar")
I'm having a tough time using StreamingContext to read a CSV and send each row to another method that does other processing. I tried splitting by newline but it splits after three columns (there are about 10 columns per
row):
val lines = ssc.textFileStream("file:///Users/userName/Documents/Notes/MoreNotes/tmpFolder/")
lines.map{row => {
val columnValues = row.split("\n")
(columnValues(0), "\n")
}}.print()
If I open the CSV in Excel, there are about 10 values per column. If I open the same file using Sublime or some text editor, there appears to be a newline after those first 3 values. Not sure if it's an encoding thing or just the way Sublime displays it. In any case I'm trying to get the entire row in Spark - not sure if there's a way to do that.
ssc.textFileStream internally creates a file stream and start splitting on the new line character. But your data is containing the text qualifiers
1996, Jeep, "Grand Cherokee, MUST SELL!
air", moon roof, loaded, 4799.00
Here some text is in double quotes and the row is multi lined row. If you try to split the data by , it will be:
[1996, Jeep, "Grand Cherokee,MUST SELL!]
It will miss the other data points because you are splitting by comma. To avoid, that you can use sqlContext
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema","true")
.option("multiLine","true")
.option("quoteMode","ALL")
.load(path)
Or you can pre-process your CSV using Univocity Parser to handle multi-line and double quotes and other special characters, and put these files in to the directory and start your ssc.textFileStream after that.
I have a spark job that reads a file and then json parse line by a line and just read one the json key as an example
logs = sc.textFile(path_to_file)
# log['a_key'] contain UTF data (accented character)
logs = logs.map(lambda x: json.loads(x)).map(lambda x: x['a_key'])
df = sql_context.createDataFrame(logs, ["test_column"])
df.coalesce(1).write.format("com.databricks.spark.csv").options(header=True).save(destination_path)
When I look a the outpouted csv file all the accented character are replaced by strange character.
How do I make pyspark write accented character in the csv file ? I have tried using log['a_key'].encode('UTF-8') but it was the same result