I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.
Related
I am working on a small project. The aim of the project is to use framework ingestion tools to ingest data in to a data lake.
-I will be ingesting data in batches.
-The data formats will be RDBMS, csv files and flat files.
I've done my research on the ingestion tools to use and I have found plenty like: Sqoop, Flume, Gobblin, Kafka etc.
My question is: What ingestion tools or approaches do you recommend for this small project? (keep in mind I'll be using HDFS as my lake)
I want to use spark-redshift libraries for writing data from AWS S3 to AWS Redshift using the following code.
Before using this, I would like to know whether spark-redshift libraries are open-source/free to use or it has to be licensed via Databricks.
val query="delete from emp where empno=7790"
//Write data to RedShift
mydf.coalesce(1).write.
format("com.databricks.spark.redshift")
.option("url",redShiftUrl)
.option("dbtable","emp")
.option("tempdir",s3dir)
.option("forward_spark_s3_credentials",true)
.option("preactions",query).
mode("append").
save()
spark-redshift is a package maintained by Databricks, with community contributions from SwiftKey and other companies. It is free to use no license needed.
I am using apache Zeppelin for sometime now. I want to decide on a good tool for reporting using Spark/Scala. Please compare Zeppelin and Tableau.
Thanks
The Syncfusion Dashboard Designer, provides support to design the dashboard against the BigData through SparkSQL Or Hive connection. Refer,
https://help.syncfusion.com/dashboard-platform/dashboard-designer/connecting-to-data/connecting-to-data#connecting-to-spark-sql-data
https://help.syncfusion.com/dashboard-platform/dashboard-designer/connecting-to-data/connecting-to-data#connecting-to-hive-data
I am planning to use orientdb in production using the jdbc drive so i need confirm some points
is jdbc driver can give all the orientdb Features like (transaction and links ...etc) or using the the java api is the best choice
I noticed that you have spring data implementation in the orientdb github is it ready to use in the production
At this link a discussion on the issue that you wrote.
in general, JDBC driver supports only a subset of OrientDB, only the part you can use with commands.
If you're a Java developer, I suggest you to use the Java Graph API: http://orientdb.com/docs/last/Graph-Database-Tinkerpop.html
Is possible build a bigdata application on cloud with RED HAT'PaaS OpenShift? I'm looking how build on cloud an Scala Application with Hadoop (HDFS),Spark,an Apache Mahout but i can't find any thing about it.I've seen something with HortonWorks but nothing clear about how install it in an openshift environment an how add HDFS node in Cloud too.Is it possible with OpneShift?
It's possible in Amazon but my question is : IS possible in OpenShift ??
It really depends on what you're ultimately trying to achieve. I know you mention building a big data application on Openshift with Scala but what will the application ultimately be doing?
I've gotten Hadoop running in a gear before but if you want a better example check out this quickstart here to get an idea of how its done https://github.com/ryanj/flask-hbase-todos. I know its not scala but here's a good article that will show you how to put together a scala app https://www.openshift.com/blogs/building-distributed-and-event-driven-applications-in-java-or-scala-with-akka-on-openshift.
What will the application ultimately be doing?:
Forecasting for football match result for several football leagues,a web application (ruby) and
statistic computation and data mining ,calculations with Scala language
and apache frameworks(spark & mahout).
We get the info via CSV files, process and save it in nosql db (Cassandra).
And all of this on cloud(OpenShift),that's the idea.
I've seen the info https://github.com/ryanj/flask-hbase-todos.I'll try by this way but
with Scala.