Sublime Text 2 increment numbers - numbers

I have a JSON file that looks like this:
"Algeriet" :
[
{
"name" : "Nyårsdagen",
"date" : "2013-01-01",
"ID" : "1"
},
{
"name" : "Mawlid En Nabaoui Echarif",
"date" : "2013-01-24",
"ID" : "2"
},
{
"name" : "Första maj",
"date" : "2013-05-01",
"ID" : "3"
},
...
]
Now I would like to begin incrementing the IDs from 0 instead of 1. How can I do this with Sublime Text 2? I have installed the Text Pastry plugin but I'm not sure how to select the IDs in the text so that I can replace these values.

Solved it by doing these steps:
Do a find and replace for regex "ID" : "\d+" and replacing it with a string which I know does not exist anywhere in the file (I replaced it with "ID" : "xyz"
Make a multiple selection on "ID" : "xyz"
Using the Text Pastry plugin "Number Sequence (\i)" on the multiple selection

You can use Increment Selection. Just press Ctrl+Alt+I over multiple selections of numbers.
You will need Package Control first. All of this shouldn't take more than 30 seconds to install.
Steps:
Install Package Control.
Open Command Palette: Ctrl+Shift+P (Windows/Unix) / Command+Shift+P (Mac).
Type Install Package Control and click to install.
Install Increment Selection package.
Open Command Palette again.
Type Package Control: Install Package, click on it and wait a short period.
Type Increment Selection and click on it to install.
Select multiple numbers and press Ctrl+Alt+I (Windows/Unix) / Command+Control+I (Mac).
Result:
Other examples
Increment Selection can also prefix numbers with leading zeroes, increment letters, increment by a step, generate line numbers and more.
[1] text [1] text [1] -> 1| text 2| text 3|
[a] text [a] text [a] -> a| text b| text c|
[01] text [01] text [01] -> 01| text 02| text 03|
[05,3] text [05,3] text [05,3] -> 05| text 08| text 11|
[5,-1] text [5,-1] text [5,-1] -> 5| text 4| text 3|
[#] line -> 1| line
[#] line -> 2| line
[#] line -> 3| line
[#] line -> 4| line
[#] line -> 5| line
Hint: [] stands for a selection, | stands for a caret.

With the new add-text-with-sequence feature of the Text Pastry plugin, it's even a step less:
Find all "ID" : "\d+" (with regex search activated)
In the Text Pastry Command Line, enter as "ID" : "0"

Related

How to check if at least one element of a list is included in a text column?

I have a data frame with a column containing text and a list of keywords. My goal is to build a new column showing if the text column contains at least one of the keywords. Let's look at some mock data:
test_data = [('1', 'i like stackoverflow'),
('2', 'tomorrow the sun will shine')]
test_df = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])
With a single keyword ("sun") the solution would be:
test_df.withColumn(
'text_contains_keyword', F.array_contains(F.split(test_df.text, ' '), 'sun')
).show()
The word "sun" is included in the second row of the text column, but not in the first. Now, let's say I have a list of keywords:
test_keywords = ['sun', 'foo', 'bar']
How to check for each of the words in test_keywords if they are included in the text column? Unfortunately, if I simply replace "sun" with the list, it leads to this error:
Unsupported literal type class java.util.ArrayList [sun, foo, bar]
You can do that using the built in rlike function with the following code.
from pyspark.sql import functions
test_df = (test_df.withColumn("text_contains_word",
functions.col('text')
.rlike('(^|\s)(' + '|'.join(test_keywords)
+ ')(\s|$)')))
test_df.show()
+---+--------------------+------------------+
| id| text|text_contains_word|
+---+--------------------+------------------+
| 1|i like stackoverflow| false|
| 2|tomorrow the sun ...| true|
+---+--------------------+------------------+

Spark DataFrameWriter omits trailing tab delimiters when saving (Spark 1.6)

I am leaving my question below as it was originally posted for the sake of future developers who run into this problem. The issue was resolved once I moved to Spark2.0 - I.e. the output was as I expected without making any chnages to my original code. Looks like some implementation difference in the 1.6 version I used at first.
I have Spark 1.6 Scala code that reads a TSV (CSV with tab delimiter) and writes it to TSV output (without changing the input - just filtering the input).
The input data has sometimes null values in the last columns of rows.
When I use the delimiter "," the output has trailing commas.
E.g.
val1, val2, val3,val4,val5
val1, val2, val3,,
but if I use tab (\t) as a delimiter the output does not include the trailing tabs. E.g (I am writing here TAB where \t appears):
val1 TAB val2 TAB val3 TAB val4 TAB val5
val1 TAB val2 TAB val3 <= **here I expected two more tabs (as with the comma delimiter)**
I also tried other delimiters and saw that when the delimiter is a white space character (e.g. the ' ' characted) the trailing delimeters are not in the output.
If I use other visible delimiter (e.g. the letter 'z') it works fine as with the comma separator and I have trailing delimiters.
I thought this might have to do with the options ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace but setting them both to false when writing didn't help either.
My code looks like this:
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load(params.inputPathS3)
df_filtered = df.filter(...)
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").save(outputPath)
I also tried (as I wrote above):
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").save(outputPath)
Below is a working example(with spark 1.6):
Input file (with some trailing spaces in the end):
1,2,3,,
scala> val df = sqlContext.read.option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").format("com.databricks.spark.csv").option("delimiter", ",").load("path")
df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: string, C4: string]
scala> df.show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
scala> df.write.option("nullValue", "null").option("quoteMode", "ALL").mode("overwrite").format("com.databricks.spark.csv").option("delimiter", "\t").save("path")
scala> sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("path").show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
Please refer: this for all options while reading, writing with databricks csv library.

Multiple regex replace together in scala

I get as input to a function in scala a dataframe that has a column named vin.
The column has values in the below format
1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467
and so on.
The requirement is to clean the column vin based on the following rules.
1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.
I would like to know is there any simplified way to achieve the above.
I can only think of doing multiple .withcolumn during regex replace every time.
I would combine all Regex related changes in a single transformation and the length condition in another, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
"UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
"-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")
df.
withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
show
// +------------+----------+
// | vin| vin2|
// +------------+----------+
// | UJ123QR8467| null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467| null|
// |-UJ123QR8467| null|
// | UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+
Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0, [/_-].

Spark dataframe - replace all values from key/value list in Scala

I've found some similar solutions, but none that accomplish exactly what I want to do. I have a set of key/value pairs that I want to use for string substitution. e.g.
val replacements = Map( "STREET" -> "ST", "STR" -> "ST")
I am reading a table into a dataframe, and I would like modify a column to replace all instances of the key in my map with their values. So in the above map, look at the "street" column and replace all values of "STREET" with "ST" and all values of "STR" with "ST" etc.
I've been looking at some foldLeft implementations, but haven't been able to finagle it into working.
A basic solution would be great, but an optimal solution would be something I could plug into a Column function that someone wrote that I was hoping to update. Specifically a line like this:
val CleanIt: Column = trim(regexp_replace(regexp_replace(regexp_replace(colName," OF "," ")," AT "," ")," AND "," "))
You can create this helper method that transforms a given column and a map of replacements into a new Column expression:
def withReplacements(column: Column, replacements: Map[String, String]): Column =
replacements.foldLeft[Column](column) {
case (col, (from, to)) => regexp_replace(col, from, to)
}
Then use it on your street column with your replacements map:
val result = df.withColumn("street", withReplacements($"street", replacements))
For example:
df.show()
// +------------+------+
// | street|number|
// +------------+------+
// | Main STREET| 1|
// |Broadway STR| 2|
// | 1st Ave| 3|
// +------------+------+
result.show()
// +-----------+------+
// | street|number|
// +-----------+------+
// | Main ST| 1|
// |Broadway ST| 2|
// | 1st Ave| 3|
// +-----------+------+
NOTE: the keys in the map must be valid regular expressions. That means, for example, that if you want to replace the string "St." with "ST", you should use Map("St\\." -> "ST) (escaping the dot, which otherwise would be interpreted as regex's "any")

fetch more than 20 rows and display full value of column in spark-shell

I am using CassandraSQLContext from spark-shell to query data from Cassandra. So, I want to know two things one how to fetch more than 20 rows using CassandraSQLContext and second how do Id display the full value of column. As you can see below by default it append dots in the string values.
Code :
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KeySpace")
val maxDF = csc.sql("SQL_QUERY" )
maxDF.show
Output:
+--------------------+--------------------+-----------------+--------------------+
| id| Col2| Col3| Col4|
+--------------------+--------------------+-----------------+--------------------+
|8wzloRMrGpf8Q3bbk...| Value1| X| K1|
|AxRfoHDjV1Fk18OqS...| Value2| Y| K2|
|FpMVRlaHsEOcHyDgy...| Value3| Z| K3|
|HERt8eFLRtKkiZndy...| Value4| U| K4|
|nWOcbbbm8ZOjUSNfY...| Value5| V| K5|
If you want to print the whole value of a column, in scala, you just need to set the argument truncate from the show method to false :
maxDf.show(false)
and if you wish to show more than 20 rows :
// example showing 30 columns of
// maxDf untruncated
maxDf.show(30, false)
For pyspark, you'll need to specify the argument name :
maxDF.show(truncate = False)
You won't get in nice tabular form instead it will be converted to scala object.
maxDF.take(50)