I have spaces in the columns coming into and out of this pyspark command:
results2 = results.select( cast("`HCAHPS Base Score`" as int), `HCAHPS Base Score`, cast("HCAHPS Consistency Score" as int) `HCAHPS Consistency Score` )
and I'm getting a SyntaxError at the beginning of the first as int (in front of as).
Casting in pyspark is done like,
results2 = results.select( results['HCAHPS Base Score'].cast(IntergerType()),results['HCAHPS Consistency Score'].cast(IntegerType()))
Related
I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()
I have to filter Cassandra table in spark, after getting data from a table via spark, apply filter function on the returned rdd ,we dont want to use where clause in cassandra api that can filter but that needs custom sasi index on the filter column, which has disk overhead issue due to multiple ss table scan in cassandra .
for example:
val ct = sc.cassandraTable("keyspace1", "table1")
val fltr = ct.filter(x=x.contains "zz")
table1 fields are :
dirid uuid
filename text
event int
eventtimestamp bigint
fileid int
filetype int
Basically we need to filter data based on filename with arbitrary string. since returned rdd is of type
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD
and filter operations are restricted only to the methods of CassandraRow type which are enter image description here
val ct = sc.cassandraTable("keyspace1", "table1")
scala> ct
res140: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[171] at RDD at CassandraRDD.scala:19
when i hit tab after "x." in the below filter function, which shows the below methods of CassandraRow class`enter code here
scala> ct.filter(x=>x.
columnValues getBooleanOption getDateTime getFloatOption getLongOption getString getUUIDOption length
contains getByte getDateTimeOption getInet getMap getStringOption getVarInt metaData
copy getByteOption getDecimal getInetOption getRaw getTupleValue getVarIntOption nameOf
dataAsString getBytes getDecimalOption getInt getRawCql getTupleValueOption hashCode size
equals getBytesOption getDouble getIntOption getSet getUDTValue indexOf toMap
get getDate getDoubleOption getList getShort getUDTValueOption isNullAt toString
getBoolean getDateOption getFloat getLong getShortOption getUUID iterator
You need to get string field from the CassandraRow object, and then perform filtering on it. So this code will look as following:
val fltr = ct.filter(x => x.getString("filename").contains("zz"))
I have the following object which mimic an enumeration:
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
Then I want to group a DF by a column. The following does not compile:
userJobBehaviourDF.groupBy($(ColumnNames.JobSeekerID))
If I change it to
userJobBehaviourDF.groupBy($"JobSeekerID")
It works.
How can I use $ and ColumnNames.JobSeekerID together to do this?
$ is a Scala feature called string interpolator.
Starting in Scala 2.10.0, Scala offers a new mechanism to create strings from your data: String Interpolation. String Interpolation allows users to embed variable references directly in processed string literals.
Spark leverages string interpolators in Spark SQL to convert $"col name" into a column.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type $"hello"
org.apache.spark.sql.ColumnName
ColumnName type is a subtype of Column type and that's why you can use $-prefixed strings as column references where values of Column type are expected.
import org.apache.spark.sql.Column
val c: Column = $"columnName"
scala> :type c
org.apache.spark.sql.Column
How can I use $ and ColumnNames.JobSeekerID together to do this?
You cannot.
You should either map the column names (in the "enumerator") to the Column type using $ directly (that would require changing their types to Column) or using col or column functions when Columns are required.
col(colName: String): Column Returns a Column based on the given column name.
column(colName: String): Column Returns a Column based on the given column name.
$s Elsewhere
What's interesting is that Spark MLlib uses $-prefixed strings for ML parameters, but in this case $ is just a regular method.
protected final def $[T](param: Param[T]): T = getOrDefault(param)
It's also worth mentioning that (another) $ string interpolator is used in Catalyst DSL to create logical UnresolvedAttributes that could be useful for testing or Spark SQL internals exploration.
import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
String Interpolator in Scala
The string interpolator feature works (is resolved to a proper value) at compile time so either it is a string literal or it's going to fail.
$ is akin to the s string interpolator:
Prepending s to any string literal allows the usage of variables directly in the string.
Scala provides three string interpolation methods out of the box: s, f and raw and you can write your own interpolator as Spark did.
You can only use $ with string literals(values) If you want to use ColumnNames you can do as below
userJobBehaviourDF.groupBy(userJobBehaviourDF(ColumnNames.JobSeekerID))
userJobBehaviourDF.groupBy(col(ColumnNames.JobSeekerID))
From the Spark Docs for Column, here are different ways of representing a column:
df("columnName") // On a specific `df` DataFrame.
col("columnName") // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
Hope this helps!
I am new to Spark. I have two tables in HDFS. One table(table 1) is a tag table,composed of some text, which could be some words or a sentence. Another table(table 2) has a text column. Every row could have more than one keyword in the table 1. my task is find out all the matched keywords in table 1 for the text column in table 2, and output the keyword list for every row in table 2.
The problem is I have to iterate every row in table 2 and table 1. If I produce a big list for table 1, and use a map function for table 2. I will still have to use a loop to iterate the list in the map function. And the driver shows the JVM memory limit error,even if the loop is not large(10 thousands time).
myTag is the tag list of table 1.
def ourMap(line: String, myTag: List[String]): String = {
var ret = line
val length = myTag.length
for (i <- 0 to length - 1) {
if (line.contains(myTag(i)))
ret = ret.replaceAll(myTag(i), "_")
}
ret
}
val matched = result.map(b => ourMap(b, tagList))
Any suggestion to finish this task? With or without Spark
Many thanks!
An example is as follows:
table1
row1|Spark
row2|RDD
table2
row1| Spark is a fast and general engine. RDD supports two types of operations.
row2| All transformations in Spark are lazy.
row3| It is for test. I am a sentence.
Expected result :
row1| Spark,RDD
row2| Spark
MAJOR EDIT:
The first table actually may contain sentences and not just simple keywords :
row1| Spark
row2| RDD
row3| two words
row4| I am a sentence
Here you go, considering the data sample that you have provided :
val table1: Seq[(String, String)] = Seq(("row1", "Spark"), ("row2", "RDD"), ("row3", "Hashmap"))
val table2: Seq[String] = Seq("row1##Spark is a fast and general engine. RDD supports two types of operations.", "row2##All transformations in Spark are lazy.")
val rdd1: RDD[(String, String)] = sc.parallelize(table1)
val rdd2: RDD[(String, String)] = sc.parallelize(table2).map(_.split("##").toList).map(l => (l.head, l.tail(0))).cache
We'll build an inverted index of the second data table which we will join to the first table :
val df1: DataFrame = rdd1.toDF("key", "value")
val df2: DataFrame = rdd2.toDF("key", "text")
val df3: DataFrame = rdd2.flatMap { case (row, text) => text.trim.split( """[^\p{IsAlphabetic}]+""")
.map(word => (word, row))
}.groupByKey.mapValues(_.toSet.toSeq).toDF("word", "index")
import org.apache.spark.sql.functions.explode
val results: RDD[(String, String)] = df3.join(df1, df1("value") === df3("word")).drop("key").drop("value").withColumn("index", explode($"index")).rdd.map {
case r: Row => (r.getAs[String]("index"), r.getAs[String]("word"))
}.groupByKey.mapValues(i => i.toList.mkString(","))
results.take(2).foreach(println)
// (row1,Spark,RDD)
// (row2,Spark)
MAJOR EDIT:
As mentioned in the comment : The specifications of the issue changed. Keywords are no longer simple keywords, they might be sentences. In that case, this approach wouldn't work, it's a different kind of problem. One way to do it is using Locality-sensitive hashing (LSH) algorithm for nearest neighbor search.
An implementation of this algorithm is available here.
The algorithm and its implementation are unfortunately too long to discuss on SO.
From what I could gather from your problem statement is that you are kind of trying to tag the data in Table 2 with the keywords which are present in Table 1. For this, instead of loading the Table1 as a list and then doing each keyword pattern matching for each row in Table2, do this :
Load Table1 as a hashSet.
Traverse the Table2 and for each word in that phrase, do a search in the above hashset. I assume the words that you shall have to search from here are less as compared to pattern matching for each keyword. Remember, search now is O(1) operation whereas pattern matching is not.
Also, in this process, you can also filter words like " is, are, when, if " etc as they shall never be used for tagging. So that reduces words you need to find in hashSet.
The hashSet can be loaded into memory(I think 10K keywords should not take more than few MBs). This variable can be shared across executors through broadcast variables.
I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.
Output from writing parquet write
_common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS
_metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet
Hive table
CREATE TABLE testhive
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/home/gz_files/result';
FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified
How can I infer the meta data from parquet file?
If I open the _common_metadata I have below content,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadataâ–’{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}
Or how to parse meta data file?
Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.
First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
val df=sqlContext.parquetFile("/path/to/_common_metadata")
def creatingTableDDL(tableName:String, df:DataFrame): String={
val cols = df.dtypes
var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
//looks at the datatypes and columns names and puts them into a string
val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
ddl1
}
val test_tableDDL=creatingTableDDL("test_table",df,"test_db")
It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet.
E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'
I'd just like to expand on James Tobin's answer. There's a StructField class which provides Hive's data types without doing string replacements.
// Tested on Spark 1.6.0.
import org.apache.spark.sql.DataFrame
def dataFrameToDDL(dataFrame: DataFrame, tableName: String): String = {
val columns = dataFrame.schema.map { field =>
" " + field.name + " " + field.dataType.simpleString.toUpperCase
}
s"CREATE TABLE $tableName (\n${columns.mkString(",\n")}\n)"
}
This solves the IntegerType problem.
scala> val dataFrame = sc.parallelize(Seq((1, "a"), (2, "b"))).toDF("x", "y")
dataFrame: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> print(dataFrameToDDL(dataFrame, "t"))
CREATE TABLE t (
x INT,
y STRING
)
This should work with any DataFrame, not just with Parquet. (e.g., I'm using this with a JDBC DataFrame.)
As an added bonus, if your target DDL supports nullable columns, you can extend the function by checking StructField.nullable.
A small improvement over Victor (adding quotes on field.name) and modified to bind the table to a local parquet file (tested on spark 1.6.1)
def dataFrameToDDL(dataFrame: DataFrame, tableName: String, absFilePath: String): String = {
val columns = dataFrame.schema.map { field =>
" `" + field.name + "` " + field.dataType.simpleString.toUpperCase
}
s"CREATE EXTERNAL TABLE $tableName (\n${columns.mkString(",\n")}\n) STORED AS PARQUET LOCATION '"+absFilePath+"'"
}
Also notice that:
A HiveContext is needed since SQLContext does not support creating
external table.
The path to the parquet folder must be an absolute path
I would like to expand James answer,
The following code will work for all datatypes including ARRAY, MAP and STRUCT.
Have tested in SPARK 2.2
val df=sqlContext.parquetFile("parquetFilePath")
val schema = df.schema
var columns = schema.fields
var ddl1 = "CREATE EXTERNAL TABLE " tableName + " ("
val cols=(for(column <- columns) yield column.name+" "+column.dataType.sql).mkString(",")
ddl1=ddl1+cols+" ) STORED AS PARQUET LOCATION '/tmp/hive_test1/'"
spark.sql(ddl1)
I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:
http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1
For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).
Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation"
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging
So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from
$ parquet-tools schema /home/gz_files/result/000000_0
Actually, Impala supports
CREATE TABLE LIKE PARQUET
(no columns section altogether):
https://docs.cloudera.com/runtime/7.2.15/impala-sql-reference/topics/impala-create-table.html
Tags of your question have "hive" and "spark" and I don't see this is implemented in Hive, but in case you use CDH, it may be what you were looking for.