Adjusting columns from txt to parquet

Adjusting columns from txt to parquet - scala

this is my first time trying to convert a txt file to parquet format so please bear with me.
I have a txt file which originally looks like this:
id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1
And I'd like to make it to parquet format like this:
+---+-------------------------+-------+
|id |roads |weights|
+---+-------------------------+-------+
|a01|1026|1172 |1 |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1 |
|b01|DT:SR:1|7|SW |1 |
So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:
val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")
But I my column names are now considered a row and everything has been put together as a single column call "value".
Any help would be appreciated!

read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.
val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)

Related

How to read first record from .dat file transform it and finally store in HDFS

I am trying to read a .dat file in aws s3 using spark scala shell, and create a new file with just the first record of the .dat file.
Let's say my file path to the .dat file is "s3a://filepath.dat"
I assume my logic should look something like but I wasn't able to figure out how to get the first record.
val file = sc.textFile("s3a://filepath.dat")
val onerecord = file.getFirstRecord()
onerecord.saveAsTextFile("s3a://newfilepath.dat")
I've been trying to follow these solutions
How to skip first and last line from a dat file and make it to dataframe using scala in databricks
https://stackoverflow.com/questions/51809228/spark-scalahow-to-read-data-from-dat-file-transform-it-and-finally-store-in-h#:~:text=dat%20file%20in%20Spark%20RDD,be%20delimited%20by%20%22%20%25%24%20%22%20signs

It depends on how records are separated in your .dat file, but in general, you could do something like this(think delimiter is '|'):
val raw = session.sqlContext.read.format("csv").option("delimiter","|").load("data/input.txt")
val firstItem = raw.first()
It looks weird but it will solve your problem.

How do I extract the last string of a csv file and append it to the other?

I have csv file of many rows, each having 101 columns, with the 101th column being a char, while the rest of the columns are doubles. Eg.
1,-2.2,3 ... 98,99,100,N
I implemented a filter to operate on the numbers and wrote the result in a different file, but now I need to map the last column of my old csv to my new csv. how should I approach this?
I did the original loading using loadcsv but that didn't seem to load the character so how should I proceed?

In MATLAB there are many ways to do it, this answer expands on the use of tables:
Input
test.csv
1,2,5,A
2,3,5,G
5,6,8,C
8,9,7,T
test2.csv
1,2,1.2
2,3,8
5,6,56
8,9,3
Script
t1 = readtable('test.csv'); % Read the csv file
lastcol = t{:,end}; % Extract the last column
t2 = readtable('test2.csv'); % Read the second csv file
t2.addedvar = lastcol; % Add the last column of the first file to the table from the second file
writetable(t2,'test3.csv','Delimiter',',','WriteVariableNames',false) % write the new table in a file
Note that test3.csv is a new file but you could also overwrite test2.csv
'WriteVariableNames',false allows you to write the csv file without the headers of the table.
Output
test3.csv
1,2,1.2,A
2,3,8,G
5,6,56,C
8,9,3,T

How to read csv file for which data contains double quotes and comma seperated using spark dataframe in databricks

I'm trying to read csv file using spark dataframe in databricks. The csv file contains double quoted with comma separated columns. I tried with the below code and not able to read the csv file. But if I check the file in datalake I can see the file.
The input and output is as follows
df = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("quoteAll","true")\
.option("escape",'"')\
.csv("mnt/A/B/test1.csv")
The input file data:header:
"A","B","C"
"123","dss","csc"
"124","sfs","dgs"
Output:
"A"|"B"|"C"|

Unzip the multiple *.gz files and make one csv file in spark scala

I am having multiple files in S3 bucket and have to unzip these files and merge all files into a single file(CSV) with single header. All files are contains same header.
The data files are looks like below.
Storage system : S3 bucket.
part-0000-XXXX.csv.gz
part_0001-YYYY.csv.gz
part-0002-ZZZZ.csv.gz
.
.
.
.
part-0010_KKKK.csv.gz.
I want one single CSV file from all the files as shown above. Please help me how to unzip and merge all the files.
After unzip and merging all files into a single CSV, then I can use this file for data comparison with previous files..
I am using spark 2.3.0 and scala 2.11
Many thanks.

Below Mentioned code seems to be working fine.
scala> val rdd = sc.textFile("/root/data")
rdd: org.apache.spark.rdd.RDD[String] = /root/data MapPartitionsRDD[1] at textFile at <console>:24
scala> rdd.coalesce(1).saveAsTextFile("/root/combinedCsv", classOf[org.apache.hadoop.io.compress.GzipCodec])
You can see the input data is in /root/data directory and combined csv in gzip format is stored in /root/combinedCsv directory.
Update
If you want to store data in csv format, strip off GzipCodec part.
scala> rdd.coalesce(1).saveAsTextFile("/root/combinedCsv")

You can use below code, also you can directly read from gz file without extracting:
val filePath = "/home/harneet/<Dir where all gz/csv files are present>"
var cdnImpSchema = StructType(Array(
StructField("idate", TimestampType, true),
StructField("time", StringType, true),
StructField("anyOtherColumn", StringType, true)
))
var cdnImpDF = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("delimiter", ","). // Set delimiter to tab or comma or whatever you want.
schema(cdnImpSchema). // Schema that was built above.
load(filePath)
cdnImpDF.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
repartition(1) -> Will produce one file as output.

How to handle multi line rows in spark?

I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.

By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Adjusting columns from txt to parquet - scala

Related

How to read first record from .dat file transform it and finally store in HDFS

How do I extract the last string of a csv file and append it to the other?

How to read csv file for which data contains double quotes and comma seperated using spark dataframe in databricks

Unzip the multiple *.gz files and make one csv file in spark scala

How to handle multi line rows in spark?

Categories

Resources