select data from Hive where column value in list - scala

I would like to get data from Hive such that: if one column value is in List, then select data from Hive.
Example Data in Hive table is as:
Col1 | Col2 | Col3
-------+---------------
Joe | 32 | Place-1
Nancy | 28 | Place-2
Shalyn | 35 | Place-1
Andy | 20 | Place-3
I am querying Hive table as:
val name = List("Sherley","Joe","Shalyan","Dan")
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in (${name})")
I know that my query is wrong, as its throwing error. But I am not able to get proper replacement for where Col1 in (${name}).

The better idea is converting name to DataFrame and joins with dataFromHive. The inner join does the same as filtering only intersected data.
val nameDf = List("Sherley","Joe","Shalyan","Dan").toDF("Col1")
var dataFromHive = sqlCon.table("default.NameInfo").join(nameDf, "Col1").select("Col1", "Col2", "Col3")
Try to use the DataFrame API. It will make the code easy to read.

convert your List to String (with proper format to use in hive query)
val name = List("Sherley","Joe","Shalyan","Dan")
val name_string = name.mkString("('","','", "')")
//name_string: String = ('Sherley','Joe','Shalyan','Dan')
var dataFromHive = sqlCon.sql("select Col1,Col2,Col3 from default.NameInfo where Col1 in " + name_string )

Related

.isin() with a column from a dataframe

How can I query a table using isin() with another dataframe? For example there is this dataframe, df1:
| id | rank |
|---------|------|
| SE34SER | 1 |
| SEF3445 | 2 |
| 5W4G4F | 3 |
I want to query a table where a column in the table isin(df1.id). I tried doing so like this:
t = (
spark.table('mytable')
.where(sf.col('id').isin(df1.id))
.select('*')
).show()
However it errors:
AttributeError: 'NoneType' object has no attribute 'id'
Unfortunately, you can't pass another dataframe's column to isin() method. You can get all the values of that column in a list and pass list to isin() method but this is not a better approach.
You can do inner join between those 2 dataframes.
df2 = spark.table('mytable')
df2.join(df1.select('id'),df1.id == df2.id, 'inner')

Setting the column value based on the column value of complete df in spark scala

Can some help me to solve the use case:
Below is the dataset
+-----------+-------------+------------+
|artistId |musicalGroups|displayName |
+-----------+-------------+------------+
|wa_16 |wa_31 |Exods |
|wa_38 |wa_16 |Kirk |
+-----------+-------------+------------+
I want to populate a column name based on the musicalGroups value and the set the name as per the artistId displayName columns value to it.
Like in the below example we have wa_16 as the artistId whose name is Exods, so name column should have displayName as per the artistID of it.
Example:
+-----------+-------------+------------+
|artistId |musicalGroups|displayName |name
+-----------+-------------+------------+
|wa_16 |wa_31 |Exods |null
|wa_38 |wa_16 |Kirk |Exods
+-----------+-------------+------------+
Tried via self join on artistId and musicalGroups, but it was not working.
Can some help me to solve this usecase?
val df = `your existing dataframe`
// Derive new dataset from the original dataset
val newDF = df.select("artistId", "displayName").distinct()
// Join new dataset with original dataset based on the common key and select the relevant columns
val combinedDF = df.join(newDF, df.col("musicalGroups") === newDF.col("artistId"), "leftOuter").select(df.col("artistId") as "artistId", df.col("musicalGroups") as "musicalGroups", df.col("displayName") as "displayName", newDF.col("displayName") as "name")
IIUC, You can use pivot() and groupBy()
df = spark.createDataFrame([("wa_16","wa_31","Exods"),("wa_38","wa_16","Krik")],["artistId","musicalGroups","displayName"])
df_grp = df.groupBy("artistId", "musicalGroups", "displayName").pivot("displayName").agg(F.first(F.col("artistId")))
df.show()
df_grp.show()
+--------+-------------+-----------+
|artistId|musicalGroups|displayName|
+--------+-------------+-----------+
| wa_16| wa_31| Exods|
| wa_38| wa_16| Krik|
+--------+-------------+-----------+
+--------+-------------+-----------+-----+-----+
|artistId|musicalGroups|displayName|Exods| Krik|
+--------+-------------+-----------+-----+-----+
| wa_16| wa_31| Exods|wa_16| null|
| wa_38| wa_16| Krik| null|wa_38|
+--------+-------------+-----------+-----+-----+

How to Split Json format column values in Spark dataframe using foreach [duplicate]

This question already has answers here:
How to query JSON data column using Spark DataFrames?
(5 answers)
Closed 4 years ago.
I want to split the JSON format column results in a Spark dataframe:
allrules_internal table in Hive :
----------------------------------------------------------------
|tablename | condition | filter |
|---------------------------------------------------------------|
| documents | {"col_list":"document_id,comments"} | NA |
| person | {"per_list":"person_id, name, age"} | NA |
---------------------------------------------------------------
Code:
val allrulesDF = spark.read.table("default" + "." + "allrules_internal")
allrulesDF.show()
val df1 = allrulesDF.select(allrulesDF.col("tablename"), allrulesDF.col("condition"), allrulesDF.col("filter"), allrulesDF.col("dbname")).collect()
Here I want to split the condition column values. From the example above, I want to keep the "document_id, comments" part. In other words, the condition column have a key/value pair but I only want the value part.
If more than one row in allrules_internal table how to split the value.
df1.foreach(row => {
// condition = row.getAs("condition").toString() // here how to retrive ?
println(condition)
val tableConditionDF = spark.sql("SELECT "+ condition + " FROM " + db_name + "." + table_name)
tableConditionDF.show()
})
You can use the from_jsonfunction:
import org.apache.spark.sql.functions._
import spark.implicits._
allrulesDF
.withColumn("condition", from_json($"condition", StructType(Seq(StructField("col_list", DataTypes.StringType, true)))))
.select($"tablename", $"condition.col_list".as("condition"))
It will print:
+---------+---------------------+
|tablename|condition |
+---------+---------------------+
|documents|document_id, comments|
+---------+---------------------+
Explanation:
With the withColumn method, you can create a new column by using a function combining one or more columns. In this case, we're using the from_json function, which receives the column that contains a JSON String, and a StructType object, with the schema of the JSON string represented in the column. Finally, you just have to select the columns you that need.
Hope it helped!

How to refer broadcast variable in dataframes

I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames?
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name | Emp_Age
------------------
1 | john | 25
2 | David | 35
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
import scala.collection.Map
val df_emp = hiveContext.sql("select * from emp")
val df_dept = hiveContext.sql("select * from dept")
val rdd = df_emp.rdd.map(row => (row.getInt(0),row.getString(1)))
val lkp = rdd.collectAsMap()
val bc = sc.broadcast(lkp)
print(bc.value.get(1).get)
--Below statement doesn't work
val combinedDF = df_dept.withColumn("emp_name",bc.value.get($"emp_id").get)
How do I refer the broadcast variable in the above combinedDF statement?
How to handle if the lkp doesn't return any value?
Is there a way to return multiple records from the lkp (lets assume if there are 2 records for emp_id=1 in the look up, I would like to get both records)
How to return more than one value from broadcast...(emp_name & emp_age)
How do I refer the broadcast variable in the above combinedDF statement?
Use udf. If emp_id is Int
val f = udf((emp_id: Int) => bc.value.get(emp_id))
df_dept.withColumn("emp_name", f($"emp_id"))
How to handle if the lkp doesn't return any value?
Don't use get as shown above
Is there a way to return multiple records from the lkp
Use groupByKey:
val lkp = rdd.groupByKey.collectAsMap()
and explode:
df_dept.withColumn("emp_name", f($"emp_id")).withColumn("emp_name", explode($"emp_name"))
or just skip all the steps and broadcast:
import org.apache.spark.sql.functions._
df_emp.join(broadcast(df_dep), Seq("Emp Id"), "left")

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.