this is my first time trying to convert a txt file to parquet format so please bear with me.
I have a txt file which originally looks like this:
id|roads|weights
a01|1026|1172|1
a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1
b01|DT:SR:1|7|SW|1
And I'd like to make it to parquet format like this:
+---+-------------------------+-------+
|id |roads |weights|
+---+-------------------------+-------+
|a01|1026|1172 |1 |
|a02|DT:SR:0|2|NE|DT:SR:1|2|NE|1 |
|b01|DT:SR:1|7|SW |1 |
So far, I have uploaded my txt file to the HDFS, and tried to use spark to convert it to parquet format with:
val textfile = spark.read.text("hdfs:some/path/file.txt")
textfile.write.parquet("some.parquet")
val parquetfile = spark.read.parquet("hdfs:some/path/some.parquet")
But I my column names are now considered a row and everything has been put together as a single column call "value".
Any help would be appreciated!
read.text loads the text file and returns a single column named "value".You can make use of read.csv to read the delimited file .The following piece of code should work for you.
val textFile=spark.read.option("delimiter","|").option("header",true).csv("hdfs:some/path/file.txt")
textFile.write.parquet(parquet_file_path)
Related
I am trying to read a .dat file in aws s3 using spark scala shell, and create a new file with just the first record of the .dat file.
Let's say my file path to the .dat file is "s3a://filepath.dat"
I assume my logic should look something like but I wasn't able to figure out how to get the first record.
val file = sc.textFile("s3a://filepath.dat")
val onerecord = file.getFirstRecord()
onerecord.saveAsTextFile("s3a://newfilepath.dat")
I've been trying to follow these solutions
How to skip first and last line from a dat file and make it to dataframe using scala in databricks
https://stackoverflow.com/questions/51809228/spark-scalahow-to-read-data-from-dat-file-transform-it-and-finally-store-in-h#:~:text=dat%20file%20in%20Spark%20RDD,be%20delimited%20by%20%22%20%25%24%20%22%20signs
It depends on how records are separated in your .dat file, but in general, you could do something like this(think delimiter is '|'):
val raw = session.sqlContext.read.format("csv").option("delimiter","|").load("data/input.txt")
val firstItem = raw.first()
It looks weird but it will solve your problem.
I have csv file of many rows, each having 101 columns, with the 101th column being a char, while the rest of the columns are doubles. Eg.
1,-2.2,3 ... 98,99,100,N
I implemented a filter to operate on the numbers and wrote the result in a different file, but now I need to map the last column of my old csv to my new csv. how should I approach this?
I did the original loading using loadcsv but that didn't seem to load the character so how should I proceed?
In MATLAB there are many ways to do it, this answer expands on the use of tables:
Input
test.csv
1,2,5,A
2,3,5,G
5,6,8,C
8,9,7,T
test2.csv
1,2,1.2
2,3,8
5,6,56
8,9,3
Script
t1 = readtable('test.csv'); % Read the csv file
lastcol = t{:,end}; % Extract the last column
t2 = readtable('test2.csv'); % Read the second csv file
t2.addedvar = lastcol; % Add the last column of the first file to the table from the second file
writetable(t2,'test3.csv','Delimiter',',','WriteVariableNames',false) % write the new table in a file
Note that test3.csv is a new file but you could also overwrite test2.csv
'WriteVariableNames',false allows you to write the csv file without the headers of the table.
Output
test3.csv
1,2,1.2,A
2,3,8,G
5,6,56,C
8,9,3,T
I'm trying to read csv file using spark dataframe in databricks. The csv file contains double quoted with comma separated columns. I tried with the below code and not able to read the csv file. But if I check the file in datalake I can see the file.
The input and output is as follows
df = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("quoteAll","true")\
.option("escape",'"')\
.csv("mnt/A/B/test1.csv")
The input file data:header:
"A","B","C"
"123","dss","csc"
"124","sfs","dgs"
Output:
"A"|"B"|"C"|
I am having multiple files in S3 bucket and have to unzip these files and merge all files into a single file(CSV) with single header. All files are contains same header.
The data files are looks like below.
Storage system : S3 bucket.
part-0000-XXXX.csv.gz
part_0001-YYYY.csv.gz
part-0002-ZZZZ.csv.gz
.
.
.
.
part-0010_KKKK.csv.gz.
I want one single CSV file from all the files as shown above. Please help me how to unzip and merge all the files.
After unzip and merging all files into a single CSV, then I can use this file for data comparison with previous files..
I am using spark 2.3.0 and scala 2.11
Many thanks.
Below Mentioned code seems to be working fine.
scala> val rdd = sc.textFile("/root/data")
rdd: org.apache.spark.rdd.RDD[String] = /root/data MapPartitionsRDD[1] at textFile at <console>:24
scala> rdd.coalesce(1).saveAsTextFile("/root/combinedCsv", classOf[org.apache.hadoop.io.compress.GzipCodec])
You can see the input data is in /root/data directory and combined csv in gzip format is stored in /root/combinedCsv directory.
Update
If you want to store data in csv format, strip off GzipCodec part.
scala> rdd.coalesce(1).saveAsTextFile("/root/combinedCsv")
You can use below code, also you can directly read from gz file without extracting:
val filePath = "/home/harneet/<Dir where all gz/csv files are present>"
var cdnImpSchema = StructType(Array(
StructField("idate", TimestampType, true),
StructField("time", StringType, true),
StructField("anyOtherColumn", StringType, true)
))
var cdnImpDF = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("delimiter", ","). // Set delimiter to tab or comma or whatever you want.
schema(cdnImpSchema). // Schema that was built above.
load(filePath)
cdnImpDF.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
repartition(1) -> Will produce one file as output.
I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv format(or txt) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?
Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.
By default spark saveTextFile considers a different row if it encounters \n. This is same with csv. In csv reading you can specify the delimiter with option("delimiter", "\t").
In my opinion the best way to read multiline input is through hadoopAPI. You can specify your own delimiter and process the data.
Something like this :
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "<your delimiter>")
val data: RDD[(LongWritable, Text)] =spark.sparkContext.newAPIHadoopFile(<"filepath">, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
Here in the data Text is your delimiter separated string