Hive Table Creation for Dynamic Schemas - mongodb

We are investigating whether Hive will allow us to run some SQL-like queries on
mongo style dynamic schema as a precursor to our map-reduce jobs.
The data comes in the form of several TiB of BSON files; each of the files contains
JSON "samples". An example sample is given as such:
{
"_id" : "SomeGUID",
"SomeScanner" :
{
"B64LR" : 22,
"Version" : 192565886128245
},
"Parser" :
{
"Size" : 73728,
"Headers" :
[
{
"VAddr" : 4096,
"VSize" : 7924.
. . . etc. . . .
As a dynamic schema, only a few of the fields are guaranteed to exist.
We would like to be able to run a query against an input set that may be something
like
SomeScanner.Parser.Headers.VSize > 9000
Having looked up the table-mapping, I'm not sure whether this is do-able with Hive . . . how would one map a column that may or may not be there . . . not to mention that there are about 2k-3k query-able values in a typical sample.
Hence, my questions to the Experts:
Can Hive build a dynamic schema from the data it encounters?
How can one go about building a Hive table with ~3k columns?
Is there a better way?
Appreciated, as always.

OK--with much ado, I can now answer my own questions.
Can Hive build a dynamic schema from the data it encounters?
A: No. However, an excellent tool for this exists. q.v., inf.
How can one go about building a Hive table w/~3K columns
A: Ibidem
Is there a better way?
A: Not that I found; but, with some help, it isn't too difficult.
First, a shout out to Michael Peterson at http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html, whose blog post served as the toe-hold to figure all of this out.
Definitely check it out if you're starting out w/Hive.
Now, Hive cannot natively import a JSON document and deduce a schema from it . . . however, Michael Peterson has developed tool that does: https://github.com/midpeter444/hive-json-schema
Some caveats with it:
* Empty arrays and structs are not handled, so remove (or populate) them. Otherwise, things like { "something" : {} } or {"somethingelse": []} will throw errors.
If any field has the name "function", it will have to be re-named prior to executing the CREATE TABLE statement. E.g., the following code would throw an error: 1{ "someThing": { "thisIsOK": "true", "function": "thatThrowsAnError" } }`
Presumably, this is because "function" is an Hive keyword.
And, with dynamic schema in general, I have not found a way to handle a nested leading underscore name even if the schema is valid: { "somethings": { "_someVal": "123", "otherVal": "456" } } will fail.
For the common MongoDB "ID" field, this is map-able with the following addition: with serdeproperties("mapping.id" = "_id"), which appears to be similar to a macro-substitution.
Serialization/De-Serialization for JSON can be achieved with https://github.com/rcongiu/Hive-JSON-Serde by adding the following: ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
N.B., the JsonSerDe JAR must have been added to .hiverc or "add jar"'d into Hive to be used.
Thus, the schema:
CREATE TABLE samplesJSON
( id string,
. . . rest of huge schema . . . . )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH serdeproperties("mapping.id" = "_id");
The JSON data can be loaded into the table with a command along the lines of:
LOAD DATA LOCAL INPATH '/tmp/samples.json' OVERWRITE INTO TABLE samplesJSON;
Finally, queries are actually intuitive straight-forward. Using the above example from the original question:
hive> select id, somescanner.parser.headers.vaddr from samplesjson;
OK
id vaddr
119 [4096,53248,57344]

Related

Pharo, Voyage and MongoDB

I want to build a relatively simple web app using Pharo, Voyage and MongoDB + TeaPot. Before I start the project I did a lot of research and one question remains: How do I initially upload a bunch of data into the MongoDB? I basically have the data in CSV format. Do I have to program an importer in Smalltalk that does that? If I were to do it without smalltalk it would be missing all the object IDs etc. How do you go about things like that?
Thanks,
Henrik
If you have data in CSV format, then I would recommend creating a simple importer. You could use NeoCSV and then save it via Pharo. I presume you know how to setup Mongo repository (#workspace) do:
| repository |
repository := VOMongoRepository
host: VOMongoRepository defaultHost
database: 'MyMongoDb'.
VORepository setRepository: repository.
First create your two class methods for Voyage:
Kid class >> isVoyageRoot
^ true "instances of this object will be root"
Kid class >> voyageCollectionName
^ 'Kids' "The collection name in MongoDB"
The Kid class should have firstName(:), surname(:), age(:) accesors and instance variables of the same name.
Then simply have a reading from CSV and then saving it into mongoDB:
| personalInformation readData columnName columnData aKid |
"init variable"
personalInformation := OrderedDictionary new.
"emulate CSV reading"
readData := (NeoCSVReader on: 'firstName, surname, age\John, Smith, 5' withCRs readStream) upToEnd.
columnName := readData first.
columnData := readData second.
"Repeat for as many number of columns you may have"
1 to: columnName size do: [ :index |
personalInformation at: (columnName at: index) put: (columnData at: index)
].
aKid := Kid new.
"Storing Kid object information"
personalInformation keysAndValuesDo: [ :key :value |
aKid perform: (key asString,$:) asSymbol with: value "For every column store the information into a Kid object (you have to have accessors for that)"
].
aKid save "Saving into mongoDB"
This is only to give you rough idea
To query in your MongoDB do:
db.Kids.find()
You should see the stored information.
Disclaimer: Even thou the code should be fine, I did not have time to actually test it on mongoDB.

Cypher query equivalent for neo4j-import

I'm trying to create a about 27 millions relationships along with 15 million nodes, initially I was using Cypher, but it was taking a lot of time, so I switched neo4j-import tool utility.
I have confusion whether the result of cypher query is same as that of neo4j-import.
My Cypher query was:
load csv from "file://dataframe6.txt" as line fieldterminator" "
MERGE (A :concept{name:line[0]})
WITH line, A
MERGE (B :concept{name:line[1]})
WITH B,A
MERGE (A)-[:test]->(B);
Content in dataframe6 :
C0000005,C0036775,RB_
C0000039,C0000039,SY_sort_version_of
C0000039,C0000039,SY_entry_version_of
C0000039,C0000039,SY_permuted_term_of
C0000039,C0001555,AQ_
C0000039,C0001688,AQ_
My neo4j-import script:
neo4j-import --into graph.db --nodes:concept "nheader,MRREL-nodes" --relationships "rheader,MRREL-relations" --skip-duplicate-node true
rheader : :START_ID,:END_ID,:TYPE
nheader : :ID,name
MRREL-nodes :
C0000005,C0000005
C0000039,C0000039
C0000052,C0000052
C0036775,C0036775
C0001555,C0001555
MRREL-relations
C0000005,C0036775,RB_
C0000039,C0000039,SY_sort_version_of
C0000039,C0000039,SY_entry_version_of
C0000039,C0000039,SY_permuted_term_of
C0000039,C0001555,AQ_
C0000039,C0001688,AQ_
Somehow I don't see same result
[EDITED]
If you want your relationships to have dynamically assigned types, then you need to change your Cypher code to make use of line[2] to specify the relationship type (e.g., via the APOC procedure apoc.create.relationship). It is currently always using test as the type.
If, instead, you actually wanted all the relationships imported by neo4j-import to have the same test type, then you need to use the right syntax.
Try removing ",:TYPE" from rheader, and use this import command line ( --relationships has been changed to --relationships:test):
neo4j-import --into graph.db --nodes:concept "nheader,MRREL-nodes" --relationships:test "rheader,MRREL-relations" --skip-duplicate-node true

Entity cannot be found by elasticsearch

I have the following entity in ElasticSearch:
{
"id": 123,
"entity-id": 1019,
"entity-name": "aaa",
"status": "New",
"creation-date": "2014-08-06",
"author": "bubu"
}
I try to query for all entities with status=New, so the above entity should appear there.
I run this code:
qesponse.setQuery(QueryBuilders.termQuery("status", "New"));
return qResponse.setFrom(start).setSize(size).execute().actionGet().toString();
But it return no result.
If I use this code (general search, not of specific field) I get the above entity.
qResponse.setQuery(QueryBuilders.queryString("New");
return qResponse.setFrom(start).setSize(size).execute().actionGet().toString();
Why?
The problem is a mismatch between a Term Query and using the Standard Analyzer when you index. The Standard Analyzer, among other things, lowercases the field when it's indexed:
Standard Analyzer
An analyzer of type standard is built using the Standard Tokenizer
with the Standard Token Filter, Lower Case Token Filter, and Stop
Token Filter.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html
The Term query, however, matches without analysis:
Term Query
Matches documents that have fields that contain a term (not analyzed).
The term query maps to Lucene TermQuery.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
So in your case when you index the field status it becomes "new". But when you search with a Term Query it's looking for "New" - they don't match. They do match with a general search it works because the general search also uses the Standard Analyzer.
The default value of index for a string field is analyzed . So, when you write "status" = "New" , it will use standard_analyzer, and after analyzing it will write as "new" .
So, term Query doesn't seems to be working, If you wish to query like you specified ,write mapping for the field as "not_analyzed".
For more info. link

How to escape some characters in postgresql

I have this data in one column in postgresql
{
"geometry":{
"status":"Point",
"coordinates":[
-122.421583,
37.795027
]
},
and i using his query
select * from students where data_json LIKE '%status%' ;
Above query return results but this one does not
select * from students where data_json LIKE '%status:%' ;
How can fix that
Of course the 2nd one doesn't find a match, there's no status: text in the value. I think you wanted:
select * from students where data_json LIKE '%"status":%'
... however, like most cases where you attempt text pattern matching on structured data this is in general a terrible idea that will bite you. Just a couple of problem examples:
{
"somekey": "the value is \"status\": true"
}
... where "status": appears as part of the text value and will match even though it shouldn't, and:
{
status : "blah"
}
where status has no quotes and a space between the quotes and colon. As far as JavaScript is concerned this is the same as "status": but it won't match.
If you're trying to find fields within json or extract fields from json, do it with a json parser. PL/V8 may be of interest, or the json libraries available for tools like pl/perl, pl/pythonu, etc. Future PostgreSQL versions will have functions to get a json key by path, test if a json value exists, etc, but 9.2 does not.
At this point you might be thinking "why don't I use regular expressions". Don't go there, you do not want to try to write a full JSON parser in regex. this blog entry is somewhat relevant.

exporting MongoDB to CSV using pymongo

I would like write a script to generate a CSV file from my mongoDB database and I would like to know the most convenient version !
first let me begin with the structure of collections.
MyDataBase -> setting
users
fruits
in setting I have something like
setting -> _id
data
_tenant
and the thing I am after, is making a CSV file out of profiles in data
which they have some fields/properties like "name", "address", "postalcode", "email", age and etc. and not neccessary all of these profile have all files/properties and even some of them look like collection (have sub-branches) which I am not interested in at all !
so, my code is python so far is look like these
myquery = db.settings.find() # I am getting everything !
output = csv.writer(open('some.csv', 'wt')) # writng in this file
for items in myquery[0:10]: # first 11 entries
a = list(items['data']['Profile'].values()) # collections are importent as dictionary and I am making them as list
tt = list()
for chiz in a:
if chiz is not None:
tt.append(chiz.encode('ascii', 'ignore')) #encoding
else:
tt.append("none")
output.writerow(tt)
these fields/properties dont have neccessary all fields, and also even some of them are collection(with sub-branch) and will be imported as dictionary ! so, I have to convert them to list and all and all, there are quite few things to take care in such a process and in all doesn't look that straightforward !
My question might be sounds very general but is it a typical way to make such report ?! if not, can you someone make it clear ?!
Yes, I am using the same way.
It is clear and fast, also it works without of any additional libraries.