How can I read and parse a BSON dump file in Spark? - mongodb

I have a couple of BZ2 Mongo DB BSON Dumps in HDFS that need to be analysed. I'm using Spark 2.0.1 and Scala 2.11.8. Currently I'm using the Spark Shell.
I tried using mongo-spark connector by creating an RDD as follows -
val rdd = sc.newAPIHadoopFile(path="hdfs:///pathtofile/dump.bson.bz2",
classOf[com.mongodb.hadoop.BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
classOf[Object],
classOf[org.bson.BSONObject])
and then simply reading it using rdd.take(1).
Executing that gives me java.lang.IllegalStateException: unread block data.
I also tried the same step by extracting the bz2 archive. It results in the same error.
How can I address the mentioned error? Any alternative method to read BSON dumps in Spark?

Related

Is there a way to use Impala rather than Hive in PySpark?

I have queries that work in Impala but not Hive. I am creating a simply PySpark file such as:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
sconf = SparkConf()
sc = SparkContext.getOrCreate(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.sql('use db1')
...
When I run this script, it's queries get the errors I get when I run them in the Hive editor (they work in the Impala editor). Is there a way to fix this so that I can run these queries in the script using Impala?
You can use Impala or HiveServer2 in Spark SQL via JDBC Data Source. That requires you to install Impala JDBC driver, and configure connection to Impala in Spark application. But "you can" doesn't mean "you should", because it incurs overhead and creates extra dependencies without any particular benefits.
Typically (and that is what your current application is trying to do), Spark SQL runs against underlying file system directly, not needing to go through either HiveServer2 or Impala coordinators. In this scenario, Spark only (re)uses Hive Metastore to retrieve the metadata -- database and table definitions.

spark read from hdfs with kerberos and write on local filesystem

I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)

Spark SQL build for hive?

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar
Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.

Spark utf 8 error, non-English data becomes `??????????`

One of the fields in our data is in a non-English language (Thai).
We can load the data into HDFS and the system displays the non-English field correctly when we run:
hadoop fs -cat /datafile.txt
However, when we use Spark to load and display the data, all the non-English data shows ??????????????
We have added the following when we run Spark:
System.setProperty("file.encoding", "UTF-8")
Has anyone else seen this? What do I need to do to use non-English data in Spark?
We are running Spark 1.3.0, Scala 2.10.4 on Ubuntu 14.04.
Command that we run to test is:
val textFile = sc.textFile(inputFileName)
textFile.take(10).foreach(println)
We are running Spark on Docker and the problem was to do with setting the locale.
To set the locale on Docker, you need to update-locale then use source /etc/default/locale. Restarting Docker will not do this for you.
Thanks #lmm for the inspiration.

hadoop with mongodb plugin - read data

I know that it is possible read and write data from mongodb via hadoop.
I want know if this adapter when read data from mongodb collection use native driver of mongodb, so it use mongod instance or this adapter read directy data collection?
Also when hadoop read data of mongodb for processing in a map reduce, this map reduce of hadoop don't lock data collection of mongodb?
in other word when hadoop read data of mongodb, hadoop save this data for hadoop use, and hadoop don't interfere with mongodb data because when hadoop execute mapreduce it work on data retrieve by mongodb but save internal at hadoop for processing?
No data is cached or saved within Hadoop using the mongo-hadoop plugin.
Instead, each chunk is read into Hadoop as an individual input split to paralellize the Hadoop MapReduce job.
The only locking that occurs in mongodb is a light read lock as data is read from Mongo.