Pyspark 1.6 File compression issue - pyspark

We are using pyspark 1.6. and are trying to convert Text to other file format
(like Json,csv etc) with compression (gzip,lz4,snappy etc). But unable to see compressing working.
Please find the code blow we tried. please help us in pointing the issue in our code else suggest an work around.
Just to add to the question, none of the compressions are working in 1.6, but its working fine in spark 2.X
Option 1:
from pyspark import SparkContext SparkConf
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').save('hdfs:///user/U1/parquet_json_snappy')
Option 2:
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('codec','com.apache.hadoop.io.compress.SnappyCodec').save('hdfs:///user/U1/parquet_json_snappy_4')
Option 3:
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('compression','snappy').save('hdfs:///user/U1/parquet_json_snappy')

For Spark 1.6, to save text/json output, try using the
spark.hadoop.mapred.output.compression.codec parameter
There are 4 parameters to be set. This has been answered already and more details are in this link
With Spark 2.x, the API is simpler and you can use
df.write.option("compression", "gzip")

Related

I’m wanting to find the equivalent of "describe history" for databricks in pyspark. Does such a thing exist?

Title says it all really, I'm trying to find the latest version of a delta table but because im testing locally I dont have access to data-bricks.
Tried googling but not much luck im afraid.
If you configure SparkSession correctly as described in the documentation, then you can run SQL commands as well. But you can also access history using the Python or Scala APIs (see docs), like this:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
fullHistoryDF = deltaTable.history()

Load CSV file as dataframe from resources within an Uber Jar

So, I made an Scala Application to run in Spark, and created the Uber Jar using sbt> assembly.
The file I load is a lookup needed by the application, thus the idea is to package it together. It works fine from within InteliJ using the path "src/main/resources/lookup01.csv"
I am developing in Windows, testing locally, to after deploy it to a remote test server.
But when I call spark-submit on the Windows machine, I get the error :
"org.apache.spark.sql.AnalysisException: Path does not exist: file:/H:/dev/Spark/spark-2.4.3-bin-hadoop2.7/bin/src/main/resources/"
Seems it tries to find the file in the sparkhome location instead of from inside the JAr file.
How could I express the Path so it works looking the file from within the JAR package?
Example code of the way I load the Dataframe. After loading it I transform it into other structures like Maps.
val v_lookup = sparkSession.read.option( "header", true ).csv( "src/main/resources/lookup01.csv")
What I would like to achieve is getting as way to express the path so it works in every environment I try to run the JAR, ideally working also from within InteliJ while developing.
Edit: scala version is 2.11.12
Update:
Seems that to get a hand in the file inside the JAR, I have to read it as a stream, the bellow code worked, but I cant figure out a secure way to extract the headers of the file such as SparkSession.read.option has.
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val inputDF = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF
When the makeRDD is applied, I get the RDD and then can convert it to a dataframe, but it seems I lost the ability tu use the option from "read" that parsed out the headers as the schema.
Any way around it when using makeRDD ?
Other problem with this is that seems that I will have to manually parse the lines into columns.
You have to get the correct path from classPath
Considering that your file is under src/main/resources:
val path = getClass.getResource("/lookup01.csv")
val v_lookup = sparkSession.read.option( "header", true ).csv(path)
So, it all points to that after the file is inside JAR, it can only be accessed as a inputstream to read the chunk of data from within the compressed file.
I arrived at a solution, even though its not pretty it does what I need, that is to read a csv file, take the 2 first columns and make it into a dataframe and after load it inside a key-value structure (in this case i created a case class to hold these pairs).
I am considering migrating these lookups to a HOCON file, that may make the process less convoluted to load these lookups
import sparkSession.implicits._
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val input = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF()
val myRdd = input.map {
line =>
val col = utils.Utils.splitCSVString(line.getString(0))
KeyValue(col(0), col(1))
}
val myDF = myRdd.rdd.map(x => (x.key, x.value)).collectAsMap()
fileStream.close()

In Spark MLlib, How to save the BisectingKMeansModel with Python to HDFS?

In Spark MLlib, BisectingKMeansModel in pyspark have no save/load function.
why?
How to save or load the BisectingKMeans Model with Python to HDFS ?
It may be your spark version. For bisecting k_means is recommended to have above 2.1.0.
You can find a complete example here on the class pyspark.ml.clustering.BisectingKMeans, hope it helps:
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans%20featuresCol=%22features%22,%20predictionCol=%22prediction%22
The last part of the example code include a model save/load:
model_path = temp_path + "/bkm_model"
model.save(model_path)
model2 = BisectingKMeansModel.load(model_path)
It works for hdfs as well, but make sure that temp_path/bkm_model folder does not exist before saving the model or it will give you an error:
(java.io.IOException: Path <temp_path>/bkm_model already exists)

Loadiing a trained Word2Vec model in Spark

I am trying to load google's Pre-trained vectors 'GoogleNews-vectors-negative300.bin.gz' Google-word2vec into spark.
I converted the bin file to txt and created a smaller chunk for testing that I called 'vectors.txt'. I tried loading it as the following:
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Word2VecExample")
.getOrCreate()
val model2= Word2VecModel.load(sparkSession.sparkContext, "src/main/resources/vectors.txt")
val synonyms = model2.findSynonyms("the", 5)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
and to my surprise I am faced with the following error:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/elievex/Repository/ARCANA/src/main/resources/vectors.txt/metadata
I'm not sure where did the 'metadata' after 'vectors.txt' came from.
I am using Spark, Scala and Scala IDE for Eclipse.
What am I doing wrong? is there a different way to load a pre-trained model in spark? Would appreciate any tips.
How exactly did you get vector.txt? If you read JavaDoc for Word2VecModel.save you may see that:
This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
The model may be loaded using Loader.load.
So what you need is model in Parquet format which is standard for Spark ML models.
Unfortunately load from Google's native format has not been implemented yet (see SPARK-9484).

ERROR KeyProviderCache: Could not find uri with key

Below is the simple code to create HIVE table, and load data in it.
import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
val conf = new SparkConf().setAppName("HIVE_Test").setMaster("local").set("spark.executor.memory","1g").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf);
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("CREATE TABLE test_amit_hive12(VND_ID INT,VND_NM STRING,VND_SHORT_NM STRING,VND_ADR_LN_1_TXT STRING,VND_ADR_LN_2_TXT STRING,VND_CITY_CD STRING,VND_ZIP_CD INT,LOAD_TS FLOAT,UPDT_TS FLOAT, PROMO_STTS_CD STRING, VND_STTS_CD STRING)");
sqlContext.sql("LOAD DATA LOCAL INPATH 'path_to/amitesh/part.txt' INTO TABLE test_amit_hive12");
exit()
I have 2 queries::
1) In the "create table", I have hard coded the table names, but how would the code understand what delimiter the file is having ? when we create a HIVE table through HIVE prompt, we do write following lines
FIELDS TERMINATED BY ‘’
LINES TERMINATED BY ‘’
So, don't we need to do that while working with Spark/Scala?
2) While executing the code through Spark-shell, I am getting below error::
ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res1: org.apache.spark.sql.DataFrame = [result: string]
I found a post on stackoverflow, but it was unanswered. On other website, i found that its a bug with Hadoop 2.7.1. I checked mine, I have 2.7.2. So, what is the possibilities of the bug existing with my version. I am using IBMs BigInsight. Following is my version details
Hadoop 2.7.2-IBM-12
However, is there any one who could help me resolve this issue, I will have to have a very strong proof to prove this as a bug to my Manager.
Below is one of link where people says the error is a bug
`https://talendexpert.com/talend-spark-error/
A bit late, but does this solve your problem?
Got the same error, but it was not really a problem for me.
After the error the code ran just fine. Sometimes it pops up and sometimes it doesn't, so maybe it is connected to the executor nodes on our cluster that are involved in the particular Spark job.
It is not directly related to the Hadoop version, but it is based on the Spark version you run.
Bug and solution are reported here: https://issues.apache.org/jira/browse/SPARK-20594.
That is, upgrading to Spark 2.2.0 probably will solve this issue.