str_detect for multiple patterns - stringr

I am using str_detect within the stringr package and I am having trouble searching a string with more than one pattern.
Here is the code I am using, however it is not returning anything even though my vector ("Notes-Title") contains these patterns.
filter(str_detect(`Notes-Title`, c("quantity","single")))
The logic I want to code is:
Search each row and filter it if it contains the string "quantity" or "single".

You need to use the | separator in your search, all within one set of "".
> words <- c("quantity", "single", "double", "triple", "awful")
> set.seed(1234)
> df = tibble(col = sample(words,10, replace = TRUE))
> df
# A tibble: 10 x 1
col
<chr>
1 triple
2 single
3 awful
4 triple
5 quantity
6 awful
7 triple
8 single
9 single
10 triple
> df %>% filter(str_detect(col, "quantity|single"))
# A tibble: 4 x 1
col
<chr>
1 single
2 quantity
3 single
4 single

Related

how to find max value from multiple columns in dataframe in spark [duplicate]

This question already has an answer here:
Scala/Spark dataframes: find the column name corresponding to the max
(1 answer)
Closed 3 years ago.
I have input spark dataframe as
sample A B C D
1 1 3 5 7
2 6 8 10 9
3 6 7 8 1
I need to find the max among A,B,C,D columns which are subject marks.
I need to create a new dataframe with max_marks as the new column.
sample A B C D max_marks
1 1 3 5 7 7
2 6 8 10 9 10
3 6 7 8 1 8
I have done this using scala as
val df = df.columns.toSeq
val df1=df.foldLeft(df){(df,colName)=> df.withColumn("max_sub",max((colName)))
df.show()
I am getting an error message
"main" org.apache.spark.sql.AnalysisException:grouping expression sequence is empty
this dataframe has about 100 columns so how to iterate over this dataframe
It would be helpful to iterate over the data frame as the columns where the mean has to be found out are about 10 out of 100 column dataframe with about 10000 records
I am looking to dynamically pass the columns without giving the column names manually which means to loop over the columns that i choose and perform any mathematical operation
There are many ways to accomplish this one of the ways would be using map.
Simple pseudo code to do what you want (It wont work in anyway but I think the idea is clear)
df = df.withColumn("max_sub", "A")
df.map({x=> {
max = "A"
maxVal = 0
for col in x{
if(col != "max_sub" && x.col > maxVal){
max = col
maxVal = x.col
}
}
x.max_sub = max
x
})

Create new binary column based off of join in spark

My situation is I have two spark data frames, dfPopulation and dfSubpopulation.
dfSubpopulation is just that, a subpopulation of dfPopulation.
I would like a clean way to create a new column in dfPopulation that is binary of whether the dfSubpopulation key was in the dfPopulation key. E.g. what I want is to create the new DataFrame dfPopulationNew:
dfPopulation = X Y key
1 2 A
2 2 A
3 2 B
4 2 C
5 3 C
dfSubpopulation = X Y key
1 2 A
3 2 B
4 2 C
dfPopulationNew = X Y key inSubpopulation
1 2 A 1
2 2 A 0
3 2 B 1
4 2 C 1
5 3 C 0
I know this could be down fairly simply with a SQL statement, however given that a lot of Sparks optimization is now using the DataFrame construct, I would like to utilize that.
Using SparkSQL compared to DataFrame operations should make no difference from a performance perspective, the execution plan is the same. That said, here is one way to do it using a join
val dfPopulationNew = dfPopulation.join(
dfSubpopulation.withColumn("inSubpopulation", lit(1)),
Seq("X", "Y", "key"),
"left_outer")
.na.fill(0, Seq("inSubpopulation"))

OpenRefine: Fill down with increasing counter

Is it possible in OpenRefine to fill down blank cells with a counter instead of copying the top non-blank value?
In this example image:
Or here the same example as typed text - image this as a column from top to bottom:
1
1
blank
1
blank
blank
blank
blank
blank
1
I would like to see the column filled as follows (again, imagine top to bottom):
1
1
2
1
2
3
4
5
6
1
Thanks, help is very much appreciated.
It's not really simple. You have to:
1 Replace the blanks with something else, such as an "x"
2 Create a unique record for the entire dataset
3 Use this Jython script:
import itertools
data = row['record']['cells']['YOUR COLUMN NAME']['value']
x = itertools.count(2)
liste = []
for i, el in enumerate(data):
if data[i] == "x":
liste.append(x.next())
else:
x = itertools.count(2)
liste.append(el)
return ",".join([str(x) for x in liste])
4 Use Blank down to clear duplicates
5 Split the first multivalued cell.
Here is a screencast of the operations described above.
If you know a little Python, you can also transform your file using pandas. I do not know what is the most elegant way to do it, but this script should work.
import itertools
import pandas as pd
x = itertools.count(2)
def set_x():
global x
x = itertools.count(2)
set_x()
def increase(value):
if not value:
return next(x)
else:
set_x()
return value
data = pd.read_csv("your_file.csv", na_values=['nan'], keep_default_na=False)
data['column 1'] = data['column 1'].apply(lambda row: increase(row))
print(data)
data.to_csv("final_file.csv")
Here are two simple solutions using GREL.
Use records
You could move the column to the beginning, telling OpenRefine to use the numbers as records. You might need to transform the column to text to really convince OpenRefine to use it as records.
Then either add a new column or transform the existing one with the following expression.
1 + row.index - row.record.fromRowIndex
Use record markers
In case you don't want to use records or don't have a static number, you can create a similar setup. Imagine you have an incomplete counter like in the following table and want to fill it.
Origin
Desired
1
1
2
1
1
2
2
3
1
1
To fill the missing cells first add a new column based on your orignal column using the following expression and name it record_row_index.
if(isNonBlank(value), row.index, "")
After that fill down the original column and the new column record_row_index.
Then create a new column based on the original filled column using the following expression.
value + row.index - cells["record_row_index"].value
Hint: the expression is expecting both columns to be of type number.
If one of them is of type text, you can either transform the column beforehand or use toNumber() in the expression.
The following table shows how these operations are working together.
Origin
Origin filled
row.index
record_row_index
Desired
1
1
0
0
1 + 0 - 0 = 1
1
1
0
1 + 1 - 0 = 2
1
1
2
2
1 + 2 - 2 = 1
2
2
3
3
2 + 3 - 3 = 2
2
4
3
2 + 4 - 3 = 3
1
1
5
5
1 + 5 - 5 = 1

How can I quickly extract rows based on content in a specific content in a fixed width text file using command line tools?

I have a large text file (> 4 gb) which is in fixed-width format. I want to get a subset of that file based on content in specific columns. What would be the fastest way to do this?
For example the file will have the following format:
Column width 1 = 3
Column width 2 = 3
Column width 3 = 2
Column width 4 = 2
Column width 5 = 1
Column width 6 = 2
Column width 7 = 2
Column width 8 = 2
Colwidth 9 = 2
And a line of the file might look like:
150-9912 17 7 1 0 0
If I wanted to search based on the values of column 2 (e.g. where value of column 2 == -99), what would be the most efficient way to do this? I have multiple files ~ 4GB in size with close to 10 million lines in each file. Appreciate the help!
Using GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 3 2 2 1 2 2 2 2"} $2==-99'
The above will get you well on the way.

select rows by comparing columns using HDFStore

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.
import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')
This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:
A B C
1 1 3
1 2 4
. . .
2 2 5
1 3 3
You cannot directly do this, but the following will work
In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})
In [24]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df,data_columns=True)
In [27]: store.select('df')
Out[27]:
A B
0 1 2
1 2 2
2 3 2
In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]:
0 False
1 True
2 False
dtype: bool
This should be pretty efficient.