How to remove the first 2 rows using zipwithindex using spark scala - scala

I have two headers in the file. have to remove them. i tried with zipwithindex. it will assign the index from zero onwards. But its showing error while performing filter condition on it.
val data=spark.sparkContext.textFile(filename)
val s=data.zipWithIndex().filter(row=>row[0]>1) --> throwing error here
Any help here please.
Sample data:
===============
sno,empno,name --> need to remove
c1,c2,c3 ==> need to remove
1,123,ramakrishna
2,234,deepthi
Error: identifier expected but integer literal found
value row of type (String,Long) does not take type parameters.
not found: type <error>

If you have to remove rows, you can use row_num and remove rows easily by filter row_num>2.

Related

Drools Rule Engine: Is it possible to have a params be a list and you run the condition for each value in the list in a Decision table?

I am trying to create a rules engine via a decision table. I want to run a rule on each value in a given list.
For example, I have the following condition column:
Decision Table Column Picture
I am trying to follow the following snippet from section 6.1.4.3 from
docs
where it states:
A text according to the pattern forall(delimiter){snippet} is expanded by repeating the snippet once for each of the values of the comma-separated list of values in each of the cells below, inserting the value in place of the symbol $ and by joining these expansions by the given delimiter. Note that the forall construct may be surrounded by other text.
However when I try the above snippet condition, I get the following error:
java.lang.RuntimeException: Error while creating KieBase[Message [id=1, kieBase=rules, level=ERROR, path=rules_for_jpmc.xls, line=7, column=0
text=[ERR 102] Line 7:123 mismatched input 'param' in rule "Green Scenario 1.2"], Message [id=2, kieBase=rules, level=ERROR, path=rules_for_jpmc.xls, line=0, column=0
text=Parser returned a null Package]]
I just want to run productCurrent == $param on both pizza and calzone, and if one is met, the condition is true, without having to use $1, $2, etc. Is there a pattern on how to run a condition on a paramater list?

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

How to keep keep original column after applying data validation in same column

I have a task to validate decimal and date field.I am able to validate decimal and date filed on same column but not able to keep old column values.
Input:
id,amt1
1,123
2,321
3,345
4,543
5,789
Current Output:
id,amt1
1,12.3
2,32.1
3,34.5
4,54.3
5,78.9
Expected Output:
id,amt1,original_amt1_values
1,12.3,123
2,32.1,321
3,34.5,345
4,54.3,543
5,78.9,789
Below is the code, I am able to validate decimal filed but not able to keep original values. Kindly help me on this. I want to keep its original column in dataframe itself.
SourceFileDF = SourceFileDF.withColumn("amt1", DecimalConversion(col(amt1)))
DecimalConversion is my UDF and SourceFileDF is my dataframe.
You can use a temporary column name for "amt1" and the use column rename
SourceFileDF.withColumn("amt1_converted", DecimalConversion(col(amt1)))
SourceFileDF.withColumnRenamed("amt1", "original_amt1_values")
SourceFileDF.withColumnRenamed("amt1_converted", "amt1")
You can use select and provide the alias in a single line :
sourceFileDF.select(
DecimalConversion($"amt1").as("amt1") ,
$"amt1".as("original_amt1_values")
)

dataframe: how to groupBy/count then filter on count in Scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex
When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
I think a solution is to put count in back ticks
.filter("`count` >= 2")
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF#fmsmsx124.amr.corp.intel.com%3E

search a name in dataset error :Undefined function 'eq' for input arguments of type 'cell'

I load a file which has some columns with data.The first line contains ,CITY,YEAR2000 .
The first column has name of cities.The other columns contain number data.
I am trying to search for a specific city using:
data(data.CITY=='Athens',3:end)
where
data = dataset('File','cities.txt','Delimiter',',')
but I receive an error
Undefined function 'eq' for input arguments of type 'cell'.
--------UPDATE-----------------------------
Ok, use :
data(find(strncmp(data.CITY,'Athens',length('Athens'))),3:end)
Have you tried with using strncmp tangled with find?
I would use it this way
find(strncmp(data.CITY,'ATHENS',length('ATHENS')))
EDIT
Other opportunities to exploit would encompass strfind
strfind(data.CITY,'ATHENS')
EDIT 2
You could also try with
data(ismember(data.CITY,'ATHENS'),3:end)
This should lead you to the results you expect (at least I guess so).
EDIT 3
Given your last request I would go for this solution:
inp = input('Name of the CITY: ','s')
Name of the City: ATHENS
data(find(strncmp(data.CITY,inp,length(inp))),3:end)