pyspark: how to get_json_object for names with spaces (or other weird characters in the name)? - pyspark

Normally,
F.get_json_object(name, "$.element_name")
works fine to extract the element_name from a JSON object like this
{"element_name" : 1}
But what if the name has a space in this? How do I quote the name?
{"element name" : 1}
this doesn't work obviously
F.get_json_object(name, "$.elementname")
Normally, this is not a pyspark specific problem but it seems like pyspark (and maybe java) can have slightly different specs for the jsonpath.

For Spark, one of the following two should be working: (1) dot-notation .name with name excluding any dot . or opening bracket [; or (2) bracket-notation ['name'] with name excluding any single quote ' or question-mark ?, for example:
F.get_json_object('name', "$['element name']")
F.get_json_object('name', "$.element name")
see below from the source code with Scala JsonPathParser:
// parse `.name` or `['name']` child expressions
def named: Parser[List[PathInstruction]] =
for {
name <- '.' ~> "[^\\.\\[]+".r | "['" ~> "[^\\'\\?]+".r <~ "']"
} yield {
Key :: Named(name) :: Nil
}
Thus, if the name contains dot or opening bracket, use ['name'], if the name contains single quote or question mark, use .name. otherwise you can select either one. more examples of working expressions:
F.get_json_object('name', "$.Trader Joe's")
F.get_json_object('name', "$['amazon.com']")

For JSON keys that have names that are unfriendly to properties, you'll need to use the indexer syntax.
$["element name"]
(Single quotes should also work.)

Related

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

Postgres - append to jsonb string

In Postgres, I have a jsonb column foo which stores an array of strings
["a","b","c"]
I need a query which appends another string to whatever is currently there, at a specified index
e.g. Append "!" at index 1
run query: ["a","b","c"] -> ["a","b!","c"]
run again: ["a","b","c"] -> ["a","b!!","c"]
run again: ["a","b","c"] -> ["a","b!!!","c"]
I've implemented this in Postgres v11.2 as follows
UPDATE my_table
SET foo = jsonb_set(foo, '{1}', CONCAT('"', foo->>1, '!', '"')::jsonb)
WHERE id = '12345';
Note the index 1 and the string '!' are just hardcoded here for simplicity - but they'd be variables.
It works, but I find it quite inelegant. As you can see, I'm selecting out the string at the given index as text using the ->> operator, using that as input to CONCAT to append the '!', and also to build that back into the correct syntax to convert back to a jsonb string. There is just a lot more work going on here than seems necessary, simply to append to a string at a given path.
Is there a simpler way to do this? A built-in function or operator perhaps, or a simpler way of appending than using CONCAT? (I tried using the || operator in various ways but couldn't seem to make anything work with the syntax & types)
I don't think there is a better way than jsonb_set().
The concat can be replaced by || as follows:
jsonb_set(foo, '{1}', ('"' || (foo->>1) || '!"')::jsonb)

How to get substring using patterns and replace quotes in json value field using scala?

I have few json messages like
{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}
{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}
I need to add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
You have mentioned in the comment above
I have huge dataset in kafka.I am trying to read from kafka and write to hdfs through spark using scala.I am using json parser but unable to parse because of column3 issue.so need to manipulate the message to change into json
So you must be having collecting of malformed jsons as in the question. I have created a list as
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""")
and you are reading it through Spark so you must be having rdds as
val rdd = sc.parallelize(kafkaMsg)
All you need is some parsing in the malformed text json to make it valid json string as
val validJson = rdd.map(msg => msg.replaceAll("[}\"{]", "").split(",").map(_.split(":").mkString("\"", "\":\"", "\"")).mkString("{", ",", "}"))
validJson should be
{"column1":"abc","column2":"123","column3":"qwerty","column4":"abc123"}
{"column1":"defhj","column2":"45","column3":"asdfgh","column4":"def12d"}
You can create a dataframe from the validJson rdd as
sqlContext.read.json(validJson).show(false)
which should give you
+-------+-------+-------+-------+
|column1|column2|column3|column4|
+-------+-------+-------+-------+
|abc |123 |qwerty |abc123 |
|defhj |45 |asdfgh |def12d |
+-------+-------+-------+-------+
Or you can do as per your requirement.
Goal
add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
I would recommend to use RegEx because you have more flexibility with it.
Here is the solution:
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""", """{"column1":"defhj","column2":"45","column3":without-quotes,"column4":"def12d"}""")
val rdd = sc.parallelize(kafkaMsg)
val rePattern = """(^\{.*)("column3":)(.*)(,"column4":.*)""".r
val newRdd = rdd.map(r =>
r match {
case rePattern(start, col3, col3Value, end) => (start + col3 + '"' + col3Value.replaceAll("\"", "'") + '"' + end)
case _ => r }
)
newRdd.foreach(println)
Explanation:
First and second statements are rdd initialization.
Third line defines the regex pattern. You may need to adjust it to your situation.
Regex produce 4 groups of values (whatever is in a () is a group):
string starting with "{" and whatever after until we meet "column3":
"column3": itself
whatever comes after "column3": but before ,"column4":
whatever comes starting ,"column4":
I use these 4 groups in next statement.
Iterate over your rdd, run it against regex, and change it: replace double quotes with single, and add open/close quotes. In case there is no match the original string will be returned.
Because regex was defined with 4 groups, I use 4 variables to map matches:
case rePattern(start, col3, col3Value, end) =>
Note: Code doesn't check if you have double quote in the value or not, it just runs update. You can add validation on your own if you need.
Show the results.
Important notes:
Regex that I used is strictly linked to your source string format. Keep in mind that you have JSON, so order of your keys is not guaranteed. As a result you can end up with "column4" (which is used as a column3 value ending) may come before "column3".
If you use comma as a key/value ending, make sure you don't have it as part of column3 value.
Bottom line: you need to adjust my regex to properly identify the end of column3 value.
Hope it helps.

MongoDB - Contains (LIKE) query on concatenated field

I am new in MongoDB.
I am programming an application with spring data and mongodb and I have one class with two fields: firstname and lastname.
I need one query for documents that contain one string in the full name (firstname + lastname).
For example: firstname = "Hansen David", lastname = "Gonzalez Sastoque" and I have a query to find David Gonzalez. In this example I expect there to be a match.
Concatenate two strings solves it but I don't know how to perform this.
Create a new array field (call it names) in the document and in that array put each name split by space. In your example the array would have the following contents:
hansen
david
gonzalez
sastoque
(make them all lower case to prevent case insensitivity issues)
Before you do your query, convert your input to lower case and split it by spaces as well.
Now, you can use the $all operator to achieve your objective:
db.persons.find( { names: { $all: [ "david", "gonzalez" ] } } );
You can use $where modifier in your queries:
db.users.findOne({$where: %JavaScript to match the document%})
In your case it may look like this:
db.users.findOne({$where: "this.firstname + ' ' + this.lastname == 'Gonzalez Sastoque'"})
or this:
db.users.findOne({$where: "this.firstname.match(/Gonzalez/) && this.lastname.match(/Sastoque/)"})
My last example does exactly what you want.
Update: Try following code:
db.users.findOne({$where: "(this.firstname + ' ' + this.lastname).match('David Gonzalez'.replace(' ', '( .*)? '))"})
You should split your full name into a first name and a last name, then do your query on both fields, using the appropriate MongoDB query selectors.

Scala Play Framework Anorm SQL.on disable wrapping replacements with ' '

Whenever I replace placeholders in the SQL query using on it surrounds the replacement with '', is there a way to prevent this?
It means I can't do things like
SQL("SELECT * FROM {table} blah").on("table" -> tabletouse)
because it wraps the table name with '' which causes an SQL syntax error.
you could certainly combine both approaches, using the format function for data you don't want to be escaped
SQL(
"""
select %s from %s
where
name = {name} and
date between {start} and {end}
order by %s
""".format(fields, table, order)
).on(
'name -> name,
'start -> startDate,
'end -> endDate
)
Just take into account that the data you are sending using the format function should NOT come from user input, otherwise it should be properly sanitized
You cannot do what you are trying. Anorm's replacement is based on PreparedStatements. Meaning all data will automatically be escaped, meaning you cannot use replacement for :
table names,
column names,
whatever operand, SQL keyword, etc.
The best you can do here is a String concatenation (and what is really a bad way in my opinion) :
SQL("SELECT * FROM " + tabletouse + " blah").as(whatever *)
PS : Checkout this question about table names in PreparedStatements.