How I can write Spark dataframe to DynamoDB using emr-dynamodb-connector and Python?
I can't find how I can create new JobConf with pyspark.
Related
I'm trying to achieve something similar using spark and scala
Updating BigQuery data using Java
https://cloud.google.com/bigquery/docs/updating-data
I want to update existing data and also insert new data into Bigquery table. Any ideas if we can using some sort of DML within spark to do an upsert operation against BigQuery ??
I found that BigQuery supports merge but I'm not sure if we can do something similar using spark and scala
Google BQ - how to upsert existing data in tables?
The spark API does not support upsert yet. The best workaround at this moment is to write the dataframe to a temporary table, run a MERGE job and then delete the temporary table.
I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc
In spark scala is there a way to create local dataframe in executors like pandas in pyspark. In mappartitions method i want to convert iterator to local dataframe (like pandas dataframe in python) so that dataframe features can be used instead of hand coding them on iterators.
That is not possible.
Dataframe is a distributed collection in Spark. And Dataframes can only be created on driver node (i.e. outside of transformations/actions).
Additionally, in Spark you cannot execute operations on RDDs/Dataframes/Datasets inside other operations:
e.g. following code will produce errors.
rdd.map(v => rdd1.filter(e => e == v))
DF and DS also have RDDs underneath, so same behavior there.
I am using HBase Version 1.2.0-cdh5.8.2 and Spark version 1.6.0.
I am using toHBaseTable() method of it.nerdammer.spark.hbase package to save RDD of Spark in HBASE.
val experiencesDataset = sc.parallelize(Seq((001, null.asInstanceOf[String]), (001,"2016-10-25")))
experiencesDataset .toHBaseTable(experienceTableName).save()
But I want to save data in HBase using Spark by Bulk Load.
I am not able to understand how to use Bulk Load option. Please assist me.
I am new to apache ignite as well as for spark...
Can any one help with example to convert ignite rdd to spark rdd in scala.
Updated----
Use case:
I will receive a dataframes of hbase tables.. I will execute some logic to build report out of it, save it to the ignite rdd... and same ignite rdd will be updated for each table... once all the tables are executed final ignite rdd will be converted to spark or java rdd and last rule will be executed on that rdd... to run that rule I need that rdd to be converted into dataframe. and that dataframe would be saved as a final report in hive...
What do you mean by converting? IgniteRDD is a Spark RDD, technically it' a subtype of RDD trait.
Spark internally has many type of RDDs: MappedRDD, HadoopRDD, LogicalRDD. IgniteRDD is only one of possible type of RDD and after some transformations it also will be wrapped by other RDD type, i.e. MappedRDD.
You can also write your own RDD :)
Example from documentation:
val cache = igniteContext.fromCache("partitioned")
val result = cache.filter(_._2.contains("Ignite")).collect()
After filtering cache RDD, type will be different - IgniteRDD will be wrapped to FilteredRDD. However it's still implementation of RDD trait.
Update after comment:
At first, have you imported implicits? import spark.implicits._
In SparkSession you've got various createDataFrame methods that will convert your RDD into DataFrame / Dataset
If it still not help you, please provide us error that you're getting while creating DataFrame and code example