I have a text file with the below content:
.....
Phone: 123-456-7899, 555-555-5555, 999-333-7890
Names: Bob Jones, Mary Smith, Bob McAlly,
Sally Fields, Tom Hanks, Jeffery Cook,
Betty White, Tom McDonald, Bruce Harris
Address: 1234 Main, 445 Westlake, 3332 Front Street
.....
I am looking to grab all of the names starting from Bob Jones and ending with Bruce Harris from the file. I have this Scala code, but it only gets the first line:
Bob Jones, Mary Smith, Bob McAlly,
Here is the code:
val addressBookRDD = sc.textFile(file);
val myRDD = addressBookRDD.filter(line => line.contains("Names: ")
I don’t know how to deal with the returns or newlines in the text file, so the code only grabs the first line of the names, but not the rest of the names which are separate lines. I am looking for this type of result:
Bob Jones, Mary Smith, Bob McAlley, Sally Fields, Tom Hanks, Jeffery
Cook, Betty White, Tom McDonald, Bruce Harris
As I pointed out in a comment, to read a file structured this way is not really something Spark is very suitable for. If the file is not very large, using only Scala would probably be a better way to do it. Here is a Scala implementation:
val lines = scala.io.Source.fromFile(file).getLines
val nameLines = lines
.dropWhile(line => !line.startsWith("Names: "))
.takeWhile(line => !line.startsWith("Address: "))
.toSeq
val names = (nameLines.head.drop(7) +: nameLines.tail)
.mkString(",")
.split(",")
.map(_.trim)
.filter(_.nonEmpty)
Printing names using names foreach println will give you:
Bob Jones
Mary Smith
Bob McAlly
Sally Fields
Tom Hanks
Jeffery Cook
Betty White
Tom McDonald
Bruce Harris
Related
I have a table
users
name: varchar(20)
data:jsonb
Records look something like this
adam, {"car": "chevvy", "fruit": "apple"}
john, {"car": "toyota", "fruit": "orange"}
I want to extract all the fields like this
name. |.type |. value
adam. car chevrolet
adam. fruit apple
john. car toyota
john. car orange
For your example you can do:
SELECT name, d.key AS type, d.value
FROM users u,
JSONB_EACH_TEXT(u.data) AS d
;
output:
name | type | value
------+-------+--------
adam | car | chevvy
adam | fruit | apple
john | car | toyota
john | fruit | orange
(4 rows)
There are good explanations here PostgreSQL - jsonb_each
Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA
I have a df and I need to search if there is any set of elements from the list of keywords or not .. if yes I need to put all these keywords # separated in a new column called found or not.
My df is like
utid | description
123 | my name is harry and I live in newyork
234 | my neighbour is daniel and he plays hockey
The list is quite big something like list ={harry,daniel,hockey,newyork}
the output should be like
utid | description | foundornot
123 | my name is harry and I live in newyork | harry#newyork
234 | my neighbour is daniel and he plays hockey | daniel#hockey
The list is quite big like some 20k keywords ..also in case not found print NF
You can check for the elements in the list if exists each row of description column in the udf function and make the list of the elements as a string separated by # to return it or else NF string as
val list = List("harry","daniel","hockey","newyork")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) list.filter(strCol.contains(_)).mkString("#") else "NF")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
which should give you
+----+------------------------------------------+-------------+
|utid|description |foundornot |
+----+------------------------------------------+-------------+
|123 |my name is harry and i live in newyork |harry#newyork|
|234 |my neighbour is daniel and he plays hockey|daniel#hockey|
+----+------------------------------------------+-------------+
this is my df with 2 coloumns:
utid|description
12342|my name is 123 amrud and nitesh
2345|my name is anil
2122|my name is 1234 mohan
and a list like list {"mohan","nitesh"}
need to search if a elemnet from this list is present in the description coloumn ..if yes then print "found" else print "not found " in a different coloumn of the dataframe.the output df should be somewhat like below:
the list is far bigger than this of around 20k elements ..
the output dataframe should be like below
utid|description|foundornot
12342|my name is 123 amrud and nitesh|found
2345|my name is xyz |not found
2122|my name is 1234 mohan|found
Any help is welcomed
You can simply define a udf function check for the condition and return on of the found or not found strings
val list = List("mohan","nitesh")
import org.apache.spark.sql.functions._
def checkUdf = udf((strCol: String) => if (list.exists(strCol.contains)) "found" else "not found")
df.withColumn("foundornot", checkUdf(col("description"))).show(false)
Thats it and you should be getting
+-----+-------------------------------+----------+
|utid |description |foundornot|
+-----+-------------------------------+----------+
|12342|my name is 123 amrud and nitesh|found |
|2345 |my name is anil |not found |
|2122 |my name is 1234 mohan |found |
+-----+-------------------------------+----------+
I hope the answer is helpful
With the SPLIT function, I'm trying to split an array of vertical bar delimited names (firstname lastname) and return a string of names (initial lastname) each on a new line. Thanks for the assistance.
--data
Tom Smith | Tim Jones | Mary Adams
--output
T Smith
T Jones
M Adams