How to create subset groups by identifying unique strings within column names - group-by

I have a dataframe:
`df <- data.frame(Year = 1940:2000,
sp99_002_04A_01 = rnorm(61, 1:100),
sp99_002_04B_01 = rnorm(61, 1:100),
sp99_002_05A_01 = rnorm(61, 1:100),
sp99_006_01A_14 = rnorm(61, 1:100),
sp99_023a_02B_06 = rnorm(61, 1:100),
sp99_023a_05B_06 = rnorm(61, 1:100),
sp99_010_03B_03 = rnorm(61, 1:100))`
Each name is formatted: speciesyear_plot#(subset)_sample#_trial#, as shown above.
I need to group columns for further analysis by plot number. That means all columns sharing the same unique character string in xxxx_THIS_xxxx_xx location of the column names, without having to call each plot # by name.

I ended up doing this:
'mydt <- fread("data.csv")
mydt <- melt.data.table(mydt, id.vars = "Year", value.name = 'measure', variable.name
= "ID")
mydt <- na.omit(mydt)
mydt[, ID := as.character(ID)]
splitNames <- lapply(mydt$ID, FUN = function(x) {
splitUpID <- strsplit(x, split = "_")
plot <- splitUpID[[1]][2]
return(plot)
})
plots <- unlist(splitNames)
mydt[, plot := plots]'

Related

stats::step failed in function because can't find the data in lm object

everyone!
I tried using step function in my own function, but it seems that step function only check global variable but not variables in function.
here is my example code :
library(tidyverse)
# simple test function
my_step_function <- function(model_data, formula) {
mod <- lm(formula, model_data, x = TRUE, y = TRUE)
step_mod <- step(mod, direction = "both", trace = FALSE)
summary(step_mod)
}
# test data
test <- tibble(
x1 = 1:100,
x2 = -49:50+9*rnorm(100),
x3 = 50+5*rnorm(100),
x4 = 10*rnorm(100),
x5 = sqrt(1:100),
y = 5*x1 + 2*x2 + 10*x5 + rnorm(100)
) %>% nest(data = everything())
# can't work in map() function, this is where I first find the problem
test %>%
mutate(RW = map(
data,
~ my_step_function(.x,formula = formula(y~.))
))
# error:can't find object 'model_data'
# can't work when used directly
my_step_function(test$data[[1]],formula = (y~.))
# error:can't find object 'model_data'
# still can't work when give a test variable name
test_data <- test$data[[1]]
my_step_function(test_data,formula = (y~.))
# error:can't find object 'model_data'
# work when the global variable name is same with the variable name in the function
model_data <- test$data[[1]]
my_step_function(model_data,formula = (y~.))
# success!
I will appreciate it if someone can solve my puzzle !Thank everyone!

How to remove a single number from a list with multiples of that number

As I'm a beginner in coding I wanted to try to find the first three repeated numbers in a list. My problem is that in my code when there is a number repeated three, the code breaks.
The usual, remove, pop, and del, don't work as they delete one element in the list.
import random
r = random.randint
string = ""
def first_repeat(myList):
myList = sorted(list(myList))
print(myList)
number = 0
final_numbers = []
loop = 0
while loop < 2:
try:
if number == 0:
number += 1
else:
if myList[loop] == myList[loop-1]:
final_numbers.append(myList[loop])
else:
myList.pop(loop)
myList.pop (loop-1)
number = 0
if loop == 0 :
loop += 1
else:
loop -= 1
if len(final_numbers) > 3:
return final_numbers[0], final_numbers[1], final_numbers[2]
if len(myList) <=1:
loop += 2
except:
continue
return final_numbers
for n in range(20):
string = string+str(r(0,9))
print(first_repeat(string))
the expected result should be at the first three repeated numbers.
I added some print statements so you can go through your program and find out where the logic is wrong with your code.
import random
r = random.randint
string = ""
def first_repeat(myList):
myList = sorted(list(myList))
print(myList)
number = 0
final_numbers = []
loop = 0
while loop < 2:
print( 'inside while loop: loop = {}'.format( loop ))
try:
if number == 0:
number += 1
else:
if myList[loop] == myList[loop-1]:
print( 'in -> if myList[loop] == myList[loop-1]' )
final_numbers.append(myList[loop])
print( 'final_numbers: [{}]'.format( ','.join( final_numbers )))
else:
print( 'in first -> else' )
myList.pop(loop)
myList.pop (loop-1)
number = 0
print( 'myList: [{}]'.format( ','.join( myList ) ))
if loop == 0 :
loop += 1
else:
loop -= 1
if len(final_numbers) > 3:
print( 'returning final numbers' )
print( final_numbers )
return final_numbers[0], final_numbers[1], final_numbers[2]
if len(myList) <=1:
loop += 2
except:
continue
print( 'at end of this loop final numbers is: [{}]'.format( ','.join( final_numbers)))
print( 'press any key to continue loop: ')
input()
return final_numbers
for n in range(20):
string = string+str(r(0,9))
print(first_repeat(string))
Following is a method to do it taking advantage of pythons defaultdict
https://docs.python.org/2/library/collections.html#collections.defaultdict
#import defaultdict to keep track of number counts
from collections import defaultdict
#changed parameter name since you are passing in a string, not a list
def first_repeat( numbers_string ):
#create a dictionary - defaulddict( int ) is a dictionary with keys
#instantiated to 0 - (instead of throwing a key error)
number_count = defaultdict( int )
#convert your string to a list of integers - look up list iterations
numbers = [ int( s ) for s in list( numbers )]
# to store the repeated numbers
first_three_repeats = []
for number in numbers:
# for each number in the list, increment when it is seen
number_count[number] += 1
#at first occurence of 3 numbers, return the number
if number_count[number] == 2:
first_three_repeats.append( number )
if len( first_three_repeats ) == 3:
return first_three_repeats
#if here - not three occurrences of repeated numbers
return []
for n in range(20):
string = string+str(r(0,9))
print( findFirstThreeNumbers( string ))

Code not training fast. I gave 3500000 rows of input as 'data.csv' and system hanged. Even after 24 hours no output

Trying to return the category of input data. Training data is 'data.csv' which is 3500000 rows of sentence and its class.
import nltk
from nltk.stem.lancaster import LancasterStemmer
import os
import csv
import json
import datetime
stemmer = LancasterStemmer()
training_data = []
with open('data.csv') as f:
training_data = [{k: str(v) for k, v in row.items()}
for row in csv.DictReader(f, skipinitialspace=True)]
words = []
classes = []
documents = []
ignore_words = ['?','.','_','-'] #words to be ignored in input data file
for pattern in training_data:
w = nltk.word_tokenize(pattern['sentence'])
words.extend(w)
documents.append((w, pattern['class']))
if pattern['class'] not in classes:
classes.append(pattern['class'])
words = [stemmer.stem(a.lower()) for a in words if a not in ignore_words]
words = list(set(words)) #remove duplicates
classes = list(set(classes))
create our training data
training = []
output = []
output_empty = [0] * len(classes)
for doc in documents:
# initialize our bag of words
bag = []
# list of tokenized words for the pattern
pattern_words = doc[0]
# stem each word
pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
for w in words:
bag.append(1) if w in pattern_words else bag.append(0)
training.append(bag)
output_row = list(output_empty)
output_row[classes.index(doc[1])] = 1
output.append(output_row)
import numpy as np
import time
def sigmoid(x):
output = 1/(1+np.exp(-x))
return output
def sigmoid_output_to_derivative(output):
return output*(1-output)
def clean_up_sentence(sentence):
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
return sentence_words
def bow(sentence, words, show_details=False):
# tokenize the pattern
sentence_words = clean_up_sentence(sentence)
# bag of words
bag = [0]*len(words)
for s in sentence_words:
for i,w in enumerate(words):
if w == s:
bag[i] = 1
return(np.array(bag))
returns the calculated value of the output after multiplying with the sigmoids
def think(sentence, show_details=False):
x = bow(sentence.lower(), words, show_details)
# input layer is our bag of words
l0 = x
# matrix multiplication of input and hidden layer
l1 = sigmoid(np.dot(l0, synapse_0))
# output layer
l2 = sigmoid(np.dot(l1, synapse_1))
return l2

F# SQLProvider Columns Order Doesn't match the order in the table

I select from a postgresql view\table and export the values into excel file.
The excel file column order need to be the same as the table, but the sqlProvider select them with abc order...
My Code is:
module ViewToExcel
open System
open System.IO
//open Microsoft.Office.Interop.Excel
open System.Drawing
open Npgsql
open FSharp.Data.Sql
open OfficeOpenXml
open Casaubon
open NpgsqlTypes
let [<Literal>] connectionString = #"Server=localhost;Database=db;User Id=postgres;Password=;"
let [<Literal>] npgPath = #"..\packages\Npgsql.3.1.7\lib\net451"
type sqlConnection = SqlDataProvider<ConnectionString = connectionString,
DatabaseVendor = Common.DatabaseProviderTypes.POSTGRESQL,
ResolutionPath = npgPath,
IndividualsAmount = 1000,
UseOptionTypes = true>
let functionParseViewToExcel (excelPath:string, serverName:string, dbName:string) =
/////////////////////////////////Get Data Connection///////////////////////
printf "connect to db\n"
let connectionUserString = #"Server="+serverName+";Database="+dbName+";User Id=postgres;Password=;"
let ctx = sqlConnection.GetDataContext(connectionUserString)
let weekCalcView = ctx.Public.CcVibeWeeklyCalculations
// weekCalcView|> Seq.toList
let weekCalcViewSeq = ctx.Public.CcVibeWeeklyCalculations|> Seq.toArray
////////////////////////////////// Start Excel//////////////////////////////
let newExcelFile = FileInfo(excelPath + "cc_vibe_treatment_period_"+ DateTime.Today.ToString("yyyy_dd_MM")+".xlsx");
if (newExcelFile.Exists) then
newExcelFile.Delete();
let pck = new ExcelPackage(newExcelFile);
//Add the 'xxx' sheet
let ws = pck.Workbook.Worksheets.Add("xxx");
//printf "success to start the excel file\n"
let mutable columNames = "blabla"
for col in weekCalcViewSeq.[0].ColumnValues do
let columnName = match col with |(a, _) -> a
//printf "a %A\n" columnName
let columnNamewithPsic = "," + columnName
columNames <- columNames + columnNamewithPsic
ws.Cells.[1, 1].LoadFromText(columNames.Replace("blabla,",""))|> ignore
ws.Row(1).Style.Fill.PatternType <- Style.ExcelFillStyle.Solid
ws.Row(1).Style.Fill.BackgroundColor.SetColor(Color.FromArgb(170, 170, 170))
ws.Row(1).Style.Font.Bold <- true;
ws.Row(1).Style.Font.UnderLine <- true;
let mutable subject = weekCalcViewSeq.[0].StudySubjectLabel.Value // in order to color the rows according to subjects
let mutable color = 0
for row in 1.. weekCalcViewSeq.Length do
let mutable columValues = "blabla"
for col in weekCalcViewSeq.[row-1].ColumnValues do
let columnValue = match col with |(_, a) -> a
//printf "a %A\n" columnValue
match columnValue with
| null -> columValues <- columValues + "," + ""
| _ -> columValues <- columValues + "," + columnValue.ToString()
ws.Cells.[row + 1, 1].LoadFromText(columValues.Replace("blabla,",""))|> ignore
/////////////////////Color the row according to subject///////////////
if (weekCalcViewSeq.[row - 1].StudySubjectLabel.Value = subject) then
if (color = 0) then
ws.Row(row + 1).Style.Fill.PatternType <- Style.ExcelFillStyle.Solid
ws.Row(row + 1).Style.Fill.BackgroundColor.SetColor(Color.FromArgb(255,255,204))
else
ws.Row(row + 1).Style.Fill.PatternType <- Style.ExcelFillStyle.Solid
ws.Row(row + 1).Style.Fill.BackgroundColor.SetColor(Color.White)
else
subject <- weekCalcViewSeq.[row - 1].StudySubjectLabel.Value
if (color = 0) then
color <- 1
ws.Row(row + 1).Style.Fill.PatternType <- Style.ExcelFillStyle.Solid
ws.Row(row + 1).Style.Fill.BackgroundColor.SetColor(Color.White)
else
color <- 0
ws.Row(row + 1).Style.Fill.PatternType <- Style.ExcelFillStyle.Solid
ws.Row(row + 1).Style.Fill.BackgroundColor.SetColor(Color.FromArgb(255,255,204))
pck.Save()
The Excel Output fields is:
bloating_avg,caps_fail,caps_success,date_of_baseline_visit,discomfort_avg and etc...
But the order in the table isn't the same.
Could someone help me?
Thanks!
You can write a small helper function to extract the field (column) names via npgqsl. After that you can just use this list of column names to create your excel table. The getColNames function gets it from a DataReader. Obviously you can refactor it further, to get at the tablename as parameter, etc.
#r #"..\packages\SQLProvider.1.0.33\lib\FSharp.Data.SqlProvider.dll"
#r #"..\packages\Npgsql.3.1.7\lib\net451\Npgsql.dll"
open System
open FSharp.Data.Sql
open Npgsql
open NpgsqlTypes
let conn = new NpgsqlConnection("Host=localhost;Username=postgres;Password=root;Database=postgres;Pooling=false")
conn.Open()
let cmd = new NpgsqlCommand()
cmd.Connection <- conn
cmd.CommandText <- """ SELECT * FROM public."TestTable1" """
let recs = cmd.ExecuteReader()
let getColNames (recs:NpgsqlDataReader) =
let columns = recs.GetColumnSchema() |> Seq.toList
columns |> List.map (fun x -> x.BaseColumnName)
let colnames = getColNames recs
//val colnames : string list = ["ID"; "DT"; "ADAY"]
rec.Dispose()
conn.Dispose()
You can see that the column names are not in alphabetical order. You could use this column name list to get at the records in the correct order. Or just use the Reader object directly without the type provider.
Edit: Using records to map the table
It is also possible to extract the data, using the type provider, in the required format, by wiring up the types, and then using .MapTo<T>:
type DataRec = {
DT:DateTime
ADAY:String
ID:System.Int64
}
type sql = SqlDataProvider<dbVendor,connString2,"",resPath,indivAmount,useOptTypes>
let ctx = sql.GetDataContext()
let table1 = ctx.Public.TestTable1
let qry = query { for row in table1 do
select row} |> Seq.map (fun x -> x.MapTo<DataRec>())
qry |> Seq.toList
val it : DataRec list = [{DT = 2016/09/27 00:00:00;
ADAY = "Tuesday";
ID = 8L;}; {DT = 2016/09/26 00:00:00;
ADAY = "Monday";
ID = 9L;}; {DT = 2016/09/25 00:00:00;
ADAY = "Sunday";

ggvis with tooltip not working with layer_smooths

This code works as expected:
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- mtc[mtc$id == x$id, ]
paste0(names(row), ": ", format(row), collapse = "<br />")
}
mtc %>% ggvis(x = ~wt, y = ~mpg, key := ~id) %>%
layer_points() %>%
add_tooltip(all_values, "hover")
but when I add layer_smooths(stroke := "red", se = T) the code give me an error:
mtc %>% ggvis(x = ~wt, y = ~mpg, key := ~id) %>%
layer_points() %>%
layer_smooths(stroke := "red", se = T) %>%
add_tooltip(all_values, "hover")
Error in eval(expr, envir, enclos) : object 'id' not found
Why? how can I fix it?
Thanks!
If I hadn't recognized this as an example from one of the ggvis help pages, I wouldn't have known where mtc came from. The problem seems to be that you set the key property in the ggvis() statement, but layer_smooths() evidently doesn't support it, so you need to move it into layer_points(). I got the visualization to run with the following code:
library(ggvis)
mtc <- mtcars
mtc$id <- seq_len(nrow(mtc))
all_values <- function(x)
{
if(is.null(x)) return(NULL)
row <- mtc[mtc$id == x$id, ]
paste0(names(row), ": ", format(row), collapse = "<br />")
}
mtc %>% ggvis(x = ~wt, y = ~mpg) %>%
layer_smooths(stroke := "red", se = T) %>%
layer_points(key := ~id) %>%
add_tooltip(all_values, "hover")
However, when you hover over the smooth or the confidence bands, all of the values associated with the variables are labeled 'character(0)' in the tooltip.