I am looking to get a Spark Row(s) from Apache Dataframe. I have tried searching online but could not find anything
You can do
Dataframe.map(x => x.getAsROW)
Related
I am trying to insert the records from dataframe into hive tables using below command. The command is successful but the target table is not loaded with records.
mergerdd.write.mode("append").insertInto("db.tablename")
I expect records to be loaded into hive table.
Please Check with my Solution. It worked for me.
df.repartition(1).write.format("csv").insertInto('db.tablename',overwrite=True) # CSV
df.repartition(1).write.format("orc").insertInto('db.tablename',overwrite=True) # ORC
df.repartition(1).write.format("parquet").insertInto('db.tablename',overwrite=True) #PARQUET
this way works for me via spark.sql
df.coalesce(#numberofoutputfile).createOrReplaceTempView(#temptablename)
spark.sql(f"insert into {db}.{tablename} select * from {temptablename}")
also mergerdd is an rdd or spark dataframe?
Here is another way of achieving what you are trying to achieve:
df.write.mode("append").saveAsTable("db.tablename")
I use this all the time without any problems.
Hope that helps.
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
I am new to pyspark and am trying to recreate a code I wrote in python. I am trying to create a new dataframe that has the averages of every 60 observations from the old dataframe. Here is the code I implemented in old python:
new_df=old_df.groupby(old_df.index // 60).mean()
I am struggling with how to do this same thing in databricks using pyspark.
I think if you have an index column in your dataframe you can do something similar as you proposed:
new_df=old_df.withColumn("new_index", col(index)/60).groupBy("new_index").agg(avg(YOUR_COLUMN_FOR_AVERAGE))
Best Regards,
I am trying to collect the distinct values of a spark dataframe column into a list using scala. I have tried different options:
df.select(columns_name).distinct().rdd.map(r => r(0).toString).collect().toList
df.groupBy(col(column_name)).agg(collect_list(col(column_name))).rdd.map(r => r(0).toString).collect().toList
and they both work, but for the volume of my data, the process is pretty slow, so I am trying to speed things up. Does anyone have a suggestion I could try?
I am using Spark 2.1.1
thanks!
You can try
df.select("colName").dropDuplicates().rdd.map(row =>row(0)).collect.toList
Or you can try
df.select("colName").dropDuplicates().withColumn("colName", collect_list("colName")).rdd.map(row =>row(0)).collect
I have Spark Java that looked like this. Code pulls data from oracle table using JDBC and displays the groupby output.
DataFrame jdbcDF = sqlContext.read().format("jdbc").options(options).load();
jdbcDF.show();
jdbcDF.groupBy("VA_HOSTNAME").count().show();
Long ll = jdbcDF.count();
System.out.println("ll="+ll);
When I ran the code, jdbcDF.show(); is working, whereas the groupBy and count are not printing anything and no errors were thrown.
My column name is correct. I tried by printing that column and it worked, but when groupBy it's not working.
Can someone help me with DataFrame output? I am using spark 1.6.3.
You can try
import org.apache.spark.sql.functions.count
jdbcDF.groupBy("VA_HOSTNAME").agg(count("*")).show()