Sorry if it sounds vague but can one explain the steps to writing an existing DataFrame "df" into MySQL table say "product_mysql" and the other way around.
please see this databricks article : Connecting to SQL Databases using JDBC.
import org.apache.spark.sql.SaveMode
val df = spark.table("...")
println(df.rdd.partitions.length)
// given the number of partitions above, users can reduce the partition value by calling coalesce() or increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "product_mysql", connectionProperties)
Related
I'm trying to read the data from Cassandra and write to Redis of a specific index. let's say Redis DB 5.
I need to write all data into Redis DB index 5 in the hashmap format.
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.config("spark.redis.db", 5)
.config("spark.cassandra.connection.host", "localhost")
.getOrCreate()
import spark.implicits._
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")
someDF.write
.format("org.apache.spark.sql.redis")
.option("keys.pattern", "*")
//.option("table", "person"). // Is it mandatory ?
.save()
Can I save data into Redis without a table name? Actually just I want to save all data into Redis index 5 without table name is it possible?
I have gone through the documentation of spark Redis connector I don't see any example related to this.
Doc link : https://github.com/RedisLabs/spark-redis/blob/master/doc/dataframe.md#writing
I'm currently using this version of spark redis-connector
<dependency>
<groupId>com.redislabs</groupId>
<artifactId>spark-redis_2.11</artifactId>
<version>2.5.0</version>
</dependency>
Did anyone face this issue? any workaround?
The error I get if I do not mention the table name in the config
FAILED
java.lang.IllegalArgumentException: Option 'table' is not set.
at org.apache.spark.sql.redis.RedisSourceRelation$$anonfun$tableName$1.apply(RedisSourceRelation.scala:208)
at org.apache.spark.sql.redis.RedisSourceRelation$$anonfun$tableName$1.apply(RedisSourceRelation.scala:208)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.redis.RedisSourceRelation.tableName(RedisSourceRelation.scala:208)
at org.apache.spark.sql.redis.RedisSourceRelation.saveSchema(RedisSourceRelation.scala:245)
at org.apache.spark.sql.redis.RedisSourceRelation.insert(RedisSourceRelation.scala:121)
at org.apache.spark.sql.redis.DefaultSource.createRelation(DefaultSource.scala:30)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
The table option is mandatory. The idea is that you specify the table name, so it is possible to read the dataframe back from Redis providing that table name.
In your case another option is to convert the dataframe to the key/value RDD and use sc.toRedisKV(rdd)
I have to disagree. I'm dealing with the exact same issues you are. Here's what I have found:
You must either reference a table OR keys pattern. (e.g.)
df = spark.read.format("org.apache.spark.sql.redis")
.option("keys.pattern", "rec-*")
.option("infer.schema", True).load()
In my case, I'm using a HASH and the HASH keys all begin with "rec-" followed by a int.
The spark-redis code considers the "rec-" a table. As mentioned, the trick is if you want to read the data back into Spark. It wants a table name but it seems to use a colon as the delimiter. Since I want to do read/write, I simply changed my table names to "rec:" and was good-to-go.
I think your confusion stems from the fact that in your example, you only have one record defined in Spark. What if you have two? Redis needs to create two different keys like "person:1" or "person:2". It uses the term table to describe "person". Is it a key or a table? The docs don't seem to be consistent.
My issue at the moment is being able to save to a different Redis db by somehow changing the db context .config("spark.redis.db", 5). This doesn't seem to work for me when I use it in df.write.format. Any ideas?
I have some data stored in an S3 bucket in parquet format, following a hive-like partitioning style, with these partition keys: retailer - year - month - day.
Eg
my-bucket/
retailer=a/
year=2020/
....
retailer=b/
year=2020/
month=2/
...
I wanna read all this data in a sagemaker notebook and I want to have the partitions as columns of my DynamicFrame, so that when I df.printSchema(), they are included.
If I use Glue's suggested method, the partitions don't get included in my schema. Here's the code I'm using:
df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options={
'paths': ['s3://my-bucket/'],
"partitionKeys": [
"retailer",
"year",
"month",
"day"
]
},
format='parquet'
)
By using normal spark code and the DataFrame class, instead, it works, and the partition get included in my schema:
df = spark.read.parquet('s3://my-bucket/').
I wonder if there is a way to do it with AWS Glue's specific methods or not.
maybe u could try crawling the data and read it using The from_catalog option. Although I would think U don’t need to mention the partition keys since it should see that = means it’s a partition. Especially considering glue is just a wrapper around spark
Does anyone have a nice neat and stable way to achieve the equivalent of:
pandas.read_sql(sql, con, chunksize=None)
and/or
pandas.read_sql_table(table_name, con, schema=None, chunksize=None)
connected to redshift with SQLAlchemy & psycopg2, directly into a dask DataFrame ?
The solution should be able to handle large amounts of data
You might consider the read_sql_table function in dask.dataframe.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table
>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
... npartitions=10, index_col='id') # doctest: +SKIP
This relies on the pandas.read_sql_table function internally, so should be able to operate with the same restrictions, except that now you're asked to provide a number of partitions and an index column.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
I need to run Spark SQL queries with my own custom correspondence from table names to Parquet data. Reading Parquet data to DataFrames with sqlContext.read.parquet and registering the DataFrames with df.registerTempTable isn't cutting it for my use case, because those calls have to be run before the SQL query, when I might not even know what tables are needed.
Rather than using registerTempTable, I'm trying to write an Analyzer that resolves table names using my own logic. However, I need to be able to resolve an UnresolvedRelation to a LogicalPlan representing Parquet data, but sqlContext.read.parquet gives a DataFrame, not a LogicalPlan.
A DataFrame seems to have a logicalPlan attribute, but that's marked protected[sql]. There's also a ParquetRelation class, but that's private[sql]. That's all I found for ways to get a LogicalPlan.
How can I resolve table names to Parquet with my own logic? Am I even on the right track with Analyzer?
You can actually retrieve the logicalPlan of your DataFrame with
val myLogicalPlan: LogicalPlan = myDF.queryExecution.logical