JSONL to CSV in Apache Beam and Java - apache-beam

I have a jsonl file which should be converted in CSV file(s) with headers
Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Desired CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]

Related

Reading .parquet file from spark throwing exceptions when empty

While reading .parquet file as below:
src = spark.read.parquet(filePath)
When file is empty is throwing below error:
filename.parquet is not a Parquet file (too small length: 0)
How to check if the file is empty before reading?

How to read utf-8 encoding file in Spark Scala

I am trying to read utf-8 encoding file into Spark Scala. I am doing this
val nodes = sparkContext.textFile("nodes.csv")
where the given csv file is in UTF-8, but spark converts non-english characters to ? How do I get it to read actual values? I tried it in pyspark and it works fine because pyspark's textFile() function has encoding option and by default support utf-8 (it seems).
I am sure the file is in utf-8 encoding. I did this to confirm
➜ workspace git:(f/playground) ✗ file -I nodes.csv
nodes.csv: text/plain; charset=utf-8
Using this post, we can read the file first then feed it to the sparkContext
val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
val rdd = sc.parallelize(Source.fromFile(filename)(decoder).getLines().toList)

Facing issue on adding Parallelism feature in an Avroconvertor application

I have an application which is used to take zip file and convert the text file under the zip files into avro files.
It executes the process in a serial manner in following way:
1) Picks the zip file and unzip it
2) Take each text file under that zip file and its content
3) Take avsc files(Schema files) from different location
4) Merge the text file content with the respective schema and hence making an avro file
But this process is done in serial manner(one file at a time).
Now I want execute this process in parallel. I have all the zip files under a folder.
folder/
A.zip
B.zip
C.zip
1) Under each zip file there are text files which only consist of data (without Schema/headers)
My text file looks like this:
ABC 1234
XYZ 2345
EFG 3456
PQR 4567
2) Secondly I have a avsc files which has the schema for the same text files
My avsc File looks like
{
"Name": String,
"Employee Id" : Int
}
As an experiment I used
SparkContext.parallelize(Folder having all the zip files).map {each file => //code of avro conversion}
but in the code of avro conversion part(which is under SparkContext.parallelize) I have used SparkContext.newHadoopAPIFile feature of spark which also returns an RDD
So when I run application with these changes I get Task not Serializable issue.
Suspecting this issue because of two reasons
1) Have used SparkContext under SparkContext.parallelize
2) Have made an RDD inside an RDD.
org.apache.spark.SparkException: Task not serializable
Now I need to have the Parallelism feature but not sure if there is any alternative approach to achieve parallelism for this Use Case OR how to resolve this Task not Serializable issue.
I am using Spark 1.6 and Scala Version 2.10.5

How do I update the source file name in Talend

I have the source file on HDFS and I want to write the output file with a new column to have the name of the source file for each row. I have my Talend job like:
tHDFSGet --> tInputFilePositional --> tmap --> tfileoutputfile
Please help get the file name for each row in the new column.
Used thdfsList to get the filename in it and used
StringHandling.RIGHT(StringHandling.LEFT(((String)globalMap.get("tHDFSList_2_CURRENT_FILEPATH")),StringHandling.LEN(((String)globalMap.get("tHDFSList_2_CURRENT_FILEPATH")))+6),7)
This trimmed the filepath to just the filename.

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !