Neo4j import csv to database - import

I want to import Publications from csv to neo4j. And make Query which will select all authors which are authors of publication or at least one author.
I have csv file in format
Author,Publication
Sanjeev Saxena,Parallel Integer Sorting and Simulation Amongst CRCW Models.
Hans-Ulrich Simon,Pattern Matching in Trees and Nets.
Nathan Goodman,NP-complete Problems Simplified on Tree Schemas.
Oded Shmueli,NP-complete Problems Simplified on Tree Schemas.
Norbert Blum,On the Power of Chain Rules in Context Free Grammars.
Arnold Sch,rpern der Charakteristik 2.
nhage,rpern der Charakteristik 2.
Juha Honkala,A characterization of rational D0L power series.
Chua-Huang Huang,The Derivation of Systolic Implementations of Programs.
Christian Lengauer,The Derivation of Systolic Implementations of Programs.
I used this query:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:/home/kanion/studia/bazy/clean.csv' AS line
CREATE (:Publikacja { author: line[1], title: line[2]})
and after import i have:
http://imgur.com/3lpWM3O
So i think that authors isn't imported? How to deal with that?

In most if not all programming languages, the first key of an array is 0 so it should be line[0] for the author and line[1] for the title
USING PERIODIC COMMIT
LOAD CSV FROM 'file:/home/kanion/studia/bazy/clean.csv' AS line
CREATE (:Publikacja { author: line[0], title: line[1]})

Related

What are missing attributes as defined in the hdf5 specification and metadata in group h5md?

I have a one hdf5 format file Data File containing the molecular dynamics simulation data. For quick inspection, the h5ls tool is handy. For example:
h5ls -d xaa.h5/particles/lipids/positions/time | less
now my question is based on the comment I received on the data format! What attributes are missing according the hdf5 specifications and metadata in group?
Are you trying to get the value of the Time attribute from a dataset? If so, you need to use h5dump, not h5ls. And, the attributes are attached to each dataset, so you have to include the dataset name on the path. Finally, attribute names are case sensitive; Time != time. Here is the required command for dataset_0000 (repeat for 0001 thru 0074):
h5dump -d /particles/lipids/positions/dataset_0000/Time xaa.h5
You can also get attributes with Python code. Simple example below:
import h5py
with h5py.File('xaa.h5','r') as h5f:
for ds, h5obj in h5f['/particles/lipids/positions'].items():
print(f'For dataset={ds}; Time={h5obj.attrs["Time"]}')

Using tbl_regression with imputed data/pooled regression models

I've had great success using the gtsummary::tbl_regression function to display regression model results. I can't see how to use tbl_regression with pooled regression models from imputed data sets, however, and I'd really like to.
I don't have a reproducible example handy, I just wanted to see if anyone else has found a way to work with, say, mids objects created by the mice package in tbl_regression.
In the current development version of gtsummary, it's possible to summarize models estimated on imputed data from the mice package. Here's an example
# install dev version of gtsummary
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
packageVersion("gtsummary")
#> [1] ‘1.3.5.9012’
# impute the data
df_imputed <- mice::mice(trial, m = 2)
# build the model
imputed_model <- with(df_imputed, lm(age ~ marker + grade))
# present beautiful table with gtsummary
tbl_regression(imputed_model)
#> pool_and_tidy_mice: Tidying mice model with
#> `mice::pool(x) %>% mice::tidy(exponentiate = FALSE, conf.int = TRUE, conf.level = 0.95)`
Created on 2020-12-16 by the reprex package (v0.3.0)
It's important to note that you pass the mice model object to tbl_regression() BEFORE you pool the results. The tbl_regression() function needs access to the individual models in order to correctly identify the reference row and variable labels (among other things). Internally, the tidying function used on the mice model will first pool the results, then tidy the results. The code used for this process is printed to the console for transparency (as seen in the example above).

java.lang.NumberFormatException: For input string: "Some(12)"

CAn anyone tell me please what is wrong with my code:
Below is my spark code in scala:
import java.text.SimpleDateFormat
import org.apache.spark.sql.SparkSession
import scala.xml.XML
object TopTenTags09 {
def main(args:Array[String]){
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS")
val format2 = new SimpleDateFormat("yyyy-MM")
val spark = SparkSession.builder().appName("Number of posts which are questions and contains specified words").master("local").getOrCreate()
val data = spark.read.textFile("/home/harsh/Hunny/HadoopPractice/Spark/DF/StackOverFlow/Posts.xml").rdd
val result = data.filter{line=>{line.trim().startsWith("<row")}}
.filter{line=>{line.contains("PostTypeId=\"1\"")}}
.map { line=>{
val xml = XML.loadString(line)
if(xml.attribute("Tags").mkString.toLowerCase().contains("hadoop") ||
xml.attribute("Tags").mkString.toLowerCase().contains("spark")){
(Integer.parseInt(xml.attribute("Score").toString()),Integer.parseInt(xml.attribute("Score").toString()))
}
}}/*.filter(line=>line._1>2)
.sortByKey(false)*/
result.foreach(println) //throwing error while printing
spark.stop
}
}
And below is the error I am getting while running it:
java.lang.NumberFormatException: For input string: "Some(12)"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
I am new to spark and the the error is making me crazy because as mentioned in error ther is no "Some" in code or in data.Can anyone help me please.
Sample data
<row Id="5" PostTypeId="1" CreationDate="2014-05-13T23:58:30.457" Score="7" ViewCount="286" Body="<p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p>
<p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p>
<p>Obviously, randomly generating code would be impractical, so how could I do this?</p>
" OwnerUserId="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="<machine-learning>" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950" />
<row Id="7" PostTypeId="1" AcceptedAnswerId="10" CreationDate="2014-05-14T00:11:06.457" Score="2" ViewCount="266" Body="<p>As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.</p>
" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237" LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><open-source>" AnswerCount="3" CommentCount="4" FavoriteCount="1" ClosedDate="2014-05-14T08:40:54.950" />
<row Id="9" PostTypeId="2" ParentId="5" CreationDate="2014-05-14T00:36:31.077" Score="4" Body="<p>Not sure if this fits the scope of this SE, but here's a stab at an answer anyway.</p>
<p>With all AI approaches you have to decide what it is you're modelling and what kind of uncertainty there is. Once you pick a framework that allows modelling of your situation, you then see which elements are "fixed" and which are flexible. For example, the model may allow you to define your own network structure (or even learn it) with certain constraints. You have to decide whether this flexibility is sufficient for your purposes. Then within a particular network structure, you can learn parameters given a specific training dataset.</p>
<p>You rarely hard-code behavior in AI/ML solutions. It's all about modelling the underlying situation and accommodating different situations by tweaking elements of the model.</p>
<p>In your example, perhaps you might have the robot learn how to detect obstacles (by analyzing elements in the environment), or you might have it keep track of where the obstacles were and which way they were moving.</p>
" OwnerUserId="51" LastActivityDate="2014-05-14T00:36:31.077" CommentCount="0" />
<row Id="10" PostTypeId="2" ParentId="7" CreationDate="2014-05-14T00:53:43.273" Score="9" Body="<p>One book that's freely available is "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman (published by Springer): <a href="http://statweb.stanford.edu/~tibs/ElemStatLearn/">see Tibshirani's website</a>.</p>
<p>Another fantastic source, although it isn't a book, is Andrew Ng's Machine Learning course on Coursera. This has a much more applied-focus than the above book, and Prof. Ng does a great job of explaining the thinking behind several different machine learning algorithms/situations.</p>
" OwnerUserId="22" LastActivityDate="2014-05-14T00:53:43.273" CommentCount="1" />
<row Id="14" PostTypeId="1" CreationDate="2014-05-14T01:25:59.677" Score="14" ViewCount="686" Body="<p>I am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed.</p>
<p>My particular question is in regards to Data Mining. I took a graduate class in Data Mining a few years back. What are the differences between Data Science and Data Mining and in particular what more would I need to look at to become proficient in Data Mining?</p>
" OwnerUserId="66" LastEditorUserId="322" LastEditDate="2014-06-17T16:17:20.473" LastActivityDate="2014-06-20T17:36:05.023" Title="Is Data Science the Same as Data Mining?" Tags="<data-mining><definitions>" AnswerCount="4" CommentCount="1" FavoriteCount="2" />
I assume that
(Integer.parseInt(xml.attribute("Score").toString())
throws the above mentioned exception, because xml is of type Elem, and if you call the method attribute on it, it returns an Option[Seq[Node]], not just a single string with the number.
You probably want to replace both pieces of the above type by
(Integer.parseInt(xml.attribute("Score").get.toString())
Moreover, you could also replace the cumbersome Integer.parseInt by
xml.attribute("Score").get.toString.toInt
Isolated demo:
scala> val e = XML.loadString("""<foo Score="42" Bar="58"/>""")
e: scala.xml.Elem = <foo Bar="58" Score="42"/>
scala> e.attribute("Score").get.toString.toInt
res4: Int = 42

Tagging and Training NER dataset

I have a data set and I want to tag it for Named Entity Recognition. My dataset is in Persian.
I want to know how should I tag expressions like :
*** آقای مهدی کاظمی = Mr Mehdi Kazemi / Mr will Smith. >>> (names with titles) should I tag all as a person or just the first name and last name should be tagged? (I mean should i also tag "Mr")
Mr >> b_per || Mr >> O
Mehdi >> i_per || Mehdi >> b_per
Kazemi >> i_per || Kazemi >> i_per
*** بیمارستان نور = Noor hospital >>> Should I tag the name only or the name and hospital both as Named Entity?
*** Eiffel tower / The Ministry of Defense (I mean the us DOD for example) >>> in Persian it is called :
وزارت دفاع (vezarate defa)
should I only tag Defense ? or all together?
There are many more examples for schools, movies, cities, countries and.... since we use the entity class before the named entity.
I would appreciate if you can help me with tagging this dataset.
I'll give you some examples from the CoNLL 2003 training data:
"Mr." is not tagged as part of the person, so titles are ignored.
"Columbia Presbyterian Hospital" is tagged as (LOC, LOC, LOC)
"a New York hospital" (O, LOC, LOC, O)
"Ministry of Commerce" is (ORG, ORG, ORG)
I think "Eiffel Tower" should be (LOC, LOC)
In general, you tag as the way you want the output to look. It's up to you if you want titles included, for example. However, Core NLP won't tag overlapping entities, so you have to make a decision in for cases like the hospital named after someone.
I believe you are heading to Stanford NLP and BIO format. But in case you'd also consider other options, you may have a look a structured entities such as: http://www.afcp-parole.org/etape/docs/etape-06022012-quaero-en.pdf.
Those allow to describe entities as trees, providing a finer analysis for information extraction. More tedious to annotate but probably relevant if you intend to use annotation for semantic purposes, not only indexing.

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.