I’m wanting to find the equivalent of "describe history" for databricks in pyspark. Does such a thing exist? - pyspark

Title says it all really, I'm trying to find the latest version of a delta table but because im testing locally I dont have access to data-bricks.
Tried googling but not much luck im afraid.

If you configure SparkSession correctly as described in the documentation, then you can run SQL commands as well. But you can also access history using the Python or Scala APIs (see docs), like this:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
fullHistoryDF = deltaTable.history()

Related

libraryDependencies for `TFNerDLGraphBuilder()` for Spark with Scala

Can anyone tell what is libraryDependencies for TFNerDLGraphBuilder() for Spark with Scala? It gives me error, Cannot resolve symbol TFNerDLGraphBuilder
I see it works for notebook as given below
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb
TensorFlow graphs in Spark NLP are built using TF python api. As far as I know, the java version for creating the Conv1D/BiLSTM/CRC graph is not included.
So, you need to create it first following the instructions in:
https://nlp.johnsnowlabs.com/docs/en/training#tensorflow-graphs
That will create a pb TensorFlow file that you have to include in the NerDLApproach annotator. For example:
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
.setLabelColumn("label")
.setMaxEpochs(100)
.setRandomSeed(0)
.setPo(0.03f)
.setLr(0.2f)
.setDropout(0.5f)
.setBatchSize(100)
.setVerbose(Verbose.Epochs)
.setGraphFolder(TfGrpahPath)
Note that you have to include the embedding annotation first and that the training process will be executed in the driver. It is not distributed as it could be with BigDL.

pyLDAvis visualization from gensim not displaying the result in google colab

import pyLDAvis.gensim
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
The above code displayed the visualization of LDA model in google colab but then after reopening the notebook it stopped displaying.
I even tried
pyLDAvis.display(vis, template_type='notebook')
still not working
When I set
pyLDAvis.enable_notebook(local=True)
it does display the result but not the labels.. Any help would be appreciated!!
when you install LDAvis make sure to specify the version to be 2.1.2 with:
!pip install pyLDAvis==2.1.2
the new versions don't seem to play well with colab.
they changed the package name. use it like:
import pyLDAvis.gensim_models
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

LibreOffice BASIC : Connect to PostgreSQL

I've created a PostgreSQL connection File using LibreOffice Base (6.1) and I can run SQL Queries in there just fine but I was wondering if it's possible to use this Base connection in a LibreOffice BASIC function.
I know you can use JDBC connections for MySQL
mysql://hostname:port/database_name
But I'm hoping there's a way to use the Base File seeing as it works so well
I've been trying to find documentation on this online but I am struggling to find anything that bridges the gap between BASIC and Base.
I've found the answer, the solution was to use createUnoService and that allows you to specify the name of the odb that was setup in Base.
oService = createUnoService("com.sun.star.sdb.DatabaseContext")
oBase = oService.getByName("basePostgreSQL")
oConn = oBase.getConnection("","")
oQuery = oConn.createStatement()
oSql = "select col from table"
oResult = oQuery.executeQuery(oSql)
while oResult.next()
msgBox oResult.getString(1)
wend
oConn.close()

Pyspark 1.6 File compression issue

We are using pyspark 1.6. and are trying to convert Text to other file format
(like Json,csv etc) with compression (gzip,lz4,snappy etc). But unable to see compressing working.
Please find the code blow we tried. please help us in pointing the issue in our code else suggest an work around.
Just to add to the question, none of the compressions are working in 1.6, but its working fine in spark 2.X
Option 1:
from pyspark import SparkContext SparkConf
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').save('hdfs:///user/U1/parquet_json_snappy')
Option 2:
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('codec','com.apache.hadoop.io.compress.SnappyCodec').save('hdfs:///user/U1/parquet_json_snappy_4')
Option 3:
df = sqlContext.read.format('parquet').load('hdfs:///user/U1/json_parque_snappy')
df.write.format('json').option('compression','snappy').save('hdfs:///user/U1/parquet_json_snappy')
For Spark 1.6, to save text/json output, try using the
spark.hadoop.mapred.output.compression.codec parameter
There are 4 parameters to be set. This has been answered already and more details are in this link
With Spark 2.x, the API is simpler and you can use
df.write.option("compression", "gzip")

ERROR KeyProviderCache: Could not find uri with key

Below is the simple code to create HIVE table, and load data in it.
import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
val conf = new SparkConf().setAppName("HIVE_Test").setMaster("local").set("spark.executor.memory","1g").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf);
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("CREATE TABLE test_amit_hive12(VND_ID INT,VND_NM STRING,VND_SHORT_NM STRING,VND_ADR_LN_1_TXT STRING,VND_ADR_LN_2_TXT STRING,VND_CITY_CD STRING,VND_ZIP_CD INT,LOAD_TS FLOAT,UPDT_TS FLOAT, PROMO_STTS_CD STRING, VND_STTS_CD STRING)");
sqlContext.sql("LOAD DATA LOCAL INPATH 'path_to/amitesh/part.txt' INTO TABLE test_amit_hive12");
exit()
I have 2 queries::
1) In the "create table", I have hard coded the table names, but how would the code understand what delimiter the file is having ? when we create a HIVE table through HIVE prompt, we do write following lines
FIELDS TERMINATED BY ‘’
LINES TERMINATED BY ‘’
So, don't we need to do that while working with Spark/Scala?
2) While executing the code through Spark-shell, I am getting below error::
ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
res1: org.apache.spark.sql.DataFrame = [result: string]
I found a post on stackoverflow, but it was unanswered. On other website, i found that its a bug with Hadoop 2.7.1. I checked mine, I have 2.7.2. So, what is the possibilities of the bug existing with my version. I am using IBMs BigInsight. Following is my version details
Hadoop 2.7.2-IBM-12
However, is there any one who could help me resolve this issue, I will have to have a very strong proof to prove this as a bug to my Manager.
Below is one of link where people says the error is a bug
`https://talendexpert.com/talend-spark-error/
A bit late, but does this solve your problem?
Got the same error, but it was not really a problem for me.
After the error the code ran just fine. Sometimes it pops up and sometimes it doesn't, so maybe it is connected to the executor nodes on our cluster that are involved in the particular Spark job.
It is not directly related to the Hadoop version, but it is based on the Spark version you run.
Bug and solution are reported here: https://issues.apache.org/jira/browse/SPARK-20594.
That is, upgrading to Spark 2.2.0 probably will solve this issue.