How to read utf-8 encoding file in Spark Scala - scala

I am trying to read utf-8 encoding file into Spark Scala. I am doing this
val nodes = sparkContext.textFile("nodes.csv")
where the given csv file is in UTF-8, but spark converts non-english characters to ? How do I get it to read actual values? I tried it in pyspark and it works fine because pyspark's textFile() function has encoding option and by default support utf-8 (it seems).
I am sure the file is in utf-8 encoding. I did this to confirm
➜ workspace git:(f/playground) ✗ file -I nodes.csv
nodes.csv: text/plain; charset=utf-8

Using this post, we can read the file first then feed it to the sparkContext
val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
val rdd = sc.parallelize(Source.fromFile(filename)(decoder).getLines().toList)

Related

How to read .pkl file in pyspark

I have a dictionary saved in .pkl format using the following code in python 3.X
import pickle as cpick
OutputDirectory="My data file path"
with open("".join([OutputDirectory, 'data.pkl']), mode='wb') as fp:
cpick.dump(data_dict, fp, protocol=cpick.HIGHEST_PROTOCOL)
I want to read this file in pyspark. Can you suggest me how to do that? Currently I'm using spark 2.0 & python 2.7.13

reading compressed file in spark with scala

I am trying to read the content of .gz file in spark/scala in a dataframe/rdd using the following code
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.wholeTextFiles("path to gz file")
data.collect().foreach(println);
.gz file is 28 mb and when i do the spark submit using this command
spark-submit --class sample--master local[*] target\spark.jar
It gives ma Java Heap space issue in the console .
Is this the best way of reading .gz file and if yes how could i solve java heap error issue .
Thanks
Disclaimer: That code and description will purely read in a small compressed text file using spark, collect it to an array of every line and print every line in the entire file to console. The number of ways and reasons to do this outside far outnumber those to do it in spark
1) use SparkSession instead of SparkContext if you can swing it. sparkSession.read.text() is the command to use (it automatically handles a few compression formats)
2) Or at least use sc.textFile() instead of wholeTextFiles
3) you're calling .collect on that data which brings the entire file back to the driver (in this case since you're local not network bound). Add the --driver-memory option to the spark shell to increase memory if you MUST do the collect.

Spark Read/Write (csv) ISO-8859-1

I need to read an iso-8859-1 encoded file, do some operations then save it (with iso-8859-1 encoding). To test this, I'm losely mimicking a testcase I found on the Databricks CSV package:
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala
-- specifically: test("DSL test for iso-8859-1 encoded file")
val fileDF = spark.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("charset", "iso-8859-1")
.option("delimiter", "~") // bogus - hopefully something not in the file, just want 1 record per line
.load("s3://.../cars_iso-8859-1.csv")
fileDF.collect // I see the non-ascii characters correctly
val selectedData = fileDF.select("_c0") // just so show an operation
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", "~")
.option("charset", "iso-8859-1")
.save("s3://.../carOutput8859")
This code runs without an error - but it doesn't seem to honor the iso-8859-1 option on output. At a Linux prompt (after copying from S3 -> local Linux)
file -i cars_iso-8859-1.csv
cars_iso-8859-1.csv: text/plain; charset=iso-8859-1
file -i carOutput8859.csv
carOutput8859.csv: text/plain; charset=utf-8
I'm just looking for some good examples of reading and writing non-UTF8 files. At this point, I have plenty of flexibility in the approach. (doesn't have to be a csv reader) Any recommedations/examples?

PySpark: Empty RDD on reading gzipped BSON files

I have a script to analyse BSON dumps, however it works only with uncompressed files. I get an empty RDD while reading gz bson files.
pyspark_location = 'lib/pymongo_spark.py'
HDFS_HOME = 'hdfs://1.1.1.1/'
INPUT_FILE = 'big_bson.gz'
class BsonEncoder(JSONEncoder):
def default(self, obj):
if isinstance(obj, ObjectId):
return str(obj)
elif isinstance(obj, datetime):
return obj.isoformat()
return JSONEncoder.default(self, obj)
def setup_spark_with_pymongo(app_name='App'):
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sc.addPyFile(pyspark_location)
return sc
def main():
spark_context = setup_spark_with_pymongo('PysparkApp')
filename = HDFS_HOME + INPUT_FILE
import pymongo_spark
pymongo_spark.activate()
rdd = spark_context.BSONFileRDD(filename)
print(rdd.first()) #Raises ValueError("RDD is empty")
I am using mongo-java-driver-3.2.2.jar, mongo-hadoop-spark-1.5.2.jar, pymongo-3.2.2-py2.7-linux-x86_64 and pymongo_spark in along with spark-submit.
The version of Spark deployed is 1.6.1 along with Hadoop 2.6.4.
I am aware that the library does not support splitting compressed BSON files, however it should with a single split.
I have got hundreds of compressed BSON files to analyse and deflating each of them doesn't seem to be a viable option.
Any idea how should I proceed further?
Thanks in advance!
I've just tested in the environment: mongo-hadoop-spark-1.5.2.jar, spark version 1.6.1 for Hadoop 2.6.4, Pymongo 3.2.2. The source file is an output from mongodump compressed, and a small size file for a single split (uncompressed collection size of 105MB). Running through PySpark:
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "/file/example_bson.gz"
rdd = sc.BSONFileRDD(file_path)
rdd.first()
It is able to read the compressed BSON file, and listed the first document. Please make sure you can reach the input file, and the file is in the correct BSON format.

Reading Avro container files in Spark

I am working on a scenario where I need to read Avro container files from HDFS and do analysis using Spark.
Input Files Directory: hdfs:///user/learner/20151223/.lzo*
Note : The Input Avro Files are lzo compressed.
val df = sqlContext.read.avro("/user/learner/20151223/*.lzo");
When I run the above command.It throws an error :
java.io.FileNotFoundException: No avro files present at file:/user/learner/20151223/*.lzo
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at com.databricks.spark.avro.AvroRelation$$anonfun$11.apply(AvroRelation.scala:225)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:225)
This make sense,because the method read.avro() is expecting .avro extension files as input.So I extract and rename the input .lzo file to .avro.I am able to read the data in avro file properly.
Is there any way to read lzo compressed Avro files in spark ?
Solution worked, But !
I have found a way to solve this issue. I created a shell wrapper in which I have decompressed the .lzo into .avro file format using following way:
hadoop fs -text <file_path>*.lzo | hadoop fs - put - <file_path>.avro
I am successfull in decompressing lzo files but the problem is I am having atleast 5000 files in compressed format.Uncompressing and Converting one by one is taking nearly 1+ hours to run this Job.
Is there any way to do this Decompression in bulk way ?
Thanks again !