Going through the Hudi documentation I saw the Metadata Config section and was curious about how it is used. I created a table enabling the metadata and the directory got created under /.hoodie/metadata. Has anybody experimented with this feature? Is the metadata exposed or only used internally to Hudi? What is it used for? I couldn't understand it from the docs.
I used the following Hudi options to create a table in S3 using PySpark.
hudi_options_insert = {
"hoodie.table.name": "table_p5",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.operation": "bulk_insert",
"hoodie.datasource.write.partitionpath.field": "ds",
"hoodie.datasource.write.precombine.field": "id",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.table": "table_p5",
"hoodie.datasource.hive_sync.database": "poc_hudi",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_fields": "ds",
"hoodie.insert.shuffle.parallelism": 6,
"hoodie.metadata.enable": "true",
"hoodie.metadata.insert.parallelism": 6
}
Thanks a mil.
I tried several answers from stackoverflow but cannot solve the problem, I want to connect the firestore to loopback using this package: loopback-connector-firestore (https://www.npmjs.com/package/loopback-connector-firestore), after create the datasource using lb datasource command and start the system, error below will show out:
TypeError: Cannot initialize connector undefined: Cannot read property 'replace' of undefined
The loopback already connected to other datasource. How can I add firestore into it?
This is the datasources.json file:
{
-----other db datasources here-----
"Firestore": {
"name": "Firestore",
"projectId": "project id",
"clientEmail": "client email",
"privateKey": "key here",
"databaseName": "name here",
"connector": "loopback-connector-firestore"
}
}
In server.js file:
var ds = loopback.createDataSource({
connector: require('loopback-connector-firestore'),
provider: 'Firestore'
});
var storage = ds.createModel('storage');
app.model(storage);
The environment settings:
* Kubuntu 18.04
* nodejs v10.16
* npm v6.9
* loopback v3
I think for now the some of the connectors aren't supported by Loopback, and 'Loopback-firebase-connector' is one of them.
you may need to check the following:
Reference: https://loopback.io/doc/en/lb3/Community-connectors.html
I am trying to index data from a PostgreSQL database to elastic search using logstash. I have a column in the database which is of type JSON. When I run logstash, I get an error called "Missing Converter handling". I do understand that is java/logstash doesn't recognize the JSON type column.
I can typecast the column to "col_name::TEXT" which works fine. However, I need that column as JSON in an elastic search index. Is there any workaround?
Note: Keys in the JSON object is not fixed, it varies.
Logstash postgres query example:
input {
jdbc {
# Postgres jdbc connection string to our database, bot
jdbc_connection_string => "***"
# The user we wish to execute our statement as
jdbc_user => "***"
jdbc_password => "***"
# The path to our downloaded jdbc driver
jdbc_driver_library => "<ja_file_path>"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
# our query
statement => "SELECT col_1, col_2, col_3, json_col_4 from my_db;"
}
}
filter{
}
output {
stdout { codec => json_lines }
}
Need JSON object from PostgreSQL to elastic search.
I'm having difficulty with starting Logstash.
My logstash.conf looks like this:
input {
beats {
port => "5044"
}
}
filter {
grok {
patterns_dir => ["./patterns"]
match => { "message" => "%{WORD:event_type}\t%{NUMBER:server_time}\t%{NUMBER:market_time}\t%{WORD:instrument}\t%{C_NUMBER:last_price}\t%{C_NUMBER:trade_quantity}\t%{C_NUMBER:bid_price}\t%{C_NUMBER:bid_quantity}\t%{C_NUMBER:ask_price}\t%{C_NUMBER:ask_quantity}\t%{GREEDYDATA:flags}\t%{GREEDYDATA:additional_infos}"}
}
# ... and other stuff here...
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "%{[#metadata][beat]}"
}
}
Logstash works fine if I comment the match => line. But with it, it does not start, meaning nothing shows up when I run netstat -na | grep 5044 in the container. It is simply not listening on 5044.
And when I try to run Logstash manually by /opt/logstash/bin/logstash --path.data /tmp/logstash/data -f /etc/logstash/conf.d/filebeat-config.conf, I get the following:
Sending Logstash's logs to /opt/logstash/logs which is now configured via log4j2.properties
[2018-08-27T09:35:25,883][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.queue", :path=>"/tmp/logstash/data/queue"}
[2018-08-27T09:35:25,887][INFO ][logstash.setting.writabledirectory] Creating directory {:setting=>"path.dead_letter_queue", :path=>"/tmp/logstash/data/dead_letter_queue"}
[2018-08-27T09:35:26,177][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-08-27T09:35:26,213][INFO ][logstash.agent ] No persistent UUID file found. Generating new UUID {:uuid=>"5abcdba2-475f-46a9-b192-a343ca15ce89", :path=>"/tmp/logstash/data/uuid"}
[2018-08-27T09:35:26,727][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.3.2"}
[2018-08-27T09:35:29,016][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-08-27T09:35:29,316][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2018-08-27T09:35:29,325][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2018-08-27T09:35:29,467][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://localhost:9200/"}
[2018-08-27T09:35:29,510][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2018-08-27T09:35:29,513][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2018-08-27T09:35:29,533][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[2018-08-27T09:35:29,549][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2018-08-27T09:35:29,565][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>60001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"#timestamp"=>{"type"=>"date"}, "#version"=>{"type"=>"keyword"}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2018-08-27T09:35:29,689][ERROR][logstash.pipeline ] Error registering plugin {:pipeline_id=>"main", :plugin=>"#<LogStash::FilterDelegator:0x68bd7527 #metric_events_out=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - name: out value:0, #metric_events_in=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - name: in value:0, #metric_events_time=org.jruby.proxy.org.logstash.instrument.metrics.counter.LongCounter$Proxy2 - name: duration_in_millis value:0, #id=\"e473071da674c7efab2a8ee71c9e682afff58b8a4725d076964bc668f3b2c724\", #klass=LogStash::Filters::Grok, #metric_events=#<LogStash::Instrument::NamespacedMetric:0x5867faed #metric=#<LogStash::Instrument::Metric:0x61ef1454 #collector=#<LogStash::Instrument::Collector:0x51306706 #agent=nil, #metric_store=#<LogStash::Instrument::MetricStore:0x5227344a #store=#<Concurrent::Map:0x00000000000fb4 entries=2 default_proc=nil>, #structured_lookup_mutex=#<Mutex:0x7efeb9ea>, #fast_lookup=#<Concurrent::Map:0x00000000000fb8 entries=75 default_proc=nil>>>>, #namespace_name=[:stats, :pipelines, :main, :plugins, :filters, :e473071da674c7efab2a8ee71c9e682afff58b8a4725d076964bc668f3b2c724, :events]>, #filter=<LogStash::Filters::Grok patterns_dir=>[\"./patterns\"], match=>{\"message\"=>\"%{WORD:event_type}\\\\t%{NUMBER:server_time}\\\\t%{NUMBER:market_time}\\\\t%{WORD:instrument}\\\\t%{C_NUMBER:last_price}\\\\t%{C_NUMBER:trade_quantity}\\\\t%{C_NUMBER:bid_price}\\\\t%{C_NUMBER:bid_quantity}\\\\t%{C_NUMBER:ask_price}\\\\t%{C_NUMBER:ask_quantity}\\\\t%{GREEDYDATA:flags}\\\\t%{GREEDYDATA:additional_infos}\"}, id=>\"e473071da674c7efab2a8ee71c9e682afff58b8a4725d076964bc668f3b2c724\", enable_metric=>true, periodic_flush=>false, patterns_files_glob=>\"*\", break_on_match=>true, named_captures_only=>true, keep_empty_captures=>false, tag_on_failure=>[\"_grokparsefailure\"], timeout_millis=>30000, tag_on_timeout=>\"_groktimeout\">>", :error=>"pattern %{C_NUMBER:last_price} not defined", :thread=>"#<Thread:0x20b6525c run>"}
[2018-08-27T09:35:29,699][ERROR][logstash.pipeline ] Pipeline aborted due to error {:pipeline_id=>"main", :exception=>#<Grok::PatternError: pattern %{C_NUMBER:last_price} not defined>, :backtrace=>["/opt/logstash/vendor/bundle/jruby/2.3.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:123:in `block in compile'", "org/jruby/RubyKernel.java:1292:in `loop'", "/opt/logstash/vendor/bundle/jruby/2.3.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:93:in `compile'", "/opt/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-grok-4.0.3/lib/logstash/filters/grok.rb:281:in `block in register'", "org/jruby/RubyArray.java:1734:in `each'", "/opt/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-grok-4.0.3/lib/logstash/filters/grok.rb:275:in `block in register'", "org/jruby/RubyHash.java:1343:in `each'", "/opt/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-grok-4.0.3/lib/logstash/filters/grok.rb:270:in `register'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:340:in `register_plugin'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:351:in `block in register_plugins'", "org/jruby/RubyArray.java:1734:in `each'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:351:in `register_plugins'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:729:in `maybe_setup_out_plugins'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:361:in `start_workers'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:288:in `run'", "/opt/logstash/logstash-core/lib/logstash/pipeline.rb:248:in `block in start'"], :thread=>"#<Thread:0x20b6525c run>"}
[2018-08-27T09:35:29,724][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<main>, action_result: false", :backtrace=>nil}
Also, next to my logstash.conf, I have the directory patterns including a file containing the following:
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
C_NUMBER (?:[+-]?(?:[(0-9)|(*,#,.)]+))
C_NUMBER2 (?:[+-]?(?:[(0-9)|(*,#,.)|null]+))
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
POSINT \b(?:[1-9][0-9]*)\b
NONNEGINT \b(?:[0-9]+)\b
WORD \b\w+\b
NOTSPACE \S+
SPACE \s*
DATA .*?
GREEDYDATA .*
QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>(?>\\.|[^\\]+)+`)|``))
UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC})
CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
MONTHNUM (?:0?[1-9]|1[0-2])
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
YEAR (?>\d\d){1,2}
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
SECOND (?:(?:[0-5][0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
ISO8601_SECOND (?:%{SECOND}|60)
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
TIMESTAMP_CUSTOM %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND}.?%{NUMBER})?%{ISO8601_TIMEZONE}?
DATE %{DATE_US}|%{DATE_EU}
DATESTAMP %{DATE}[- ]%{TIME}
TZ (?:[PMCE][SD]T|UTC)
DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
What is wrong with the match => line??
I highly appreciate your help.
You're attempting to use a grok pattern, {C_NUMBER}, that Logstash doesn't know about. It doesn't appear to be a standard pattern bundled with Logstash. put NUMBER in that place, and restart logstash.
I was able to resolve the issue by changing patterns_dir => ["./patterns"] to patterns_dir => ["/etc/logstash/conf.d/patterns"].
The match line is referencing a grok pattern that Logstash didn't find because of the relative path to the patterns directory.
I am running OrientDB 2.1.2 from the AWS Marketplace AMI. I have already used ETL to load up two sets of vertices. Now I'm trying to load up a file of Edges into OrientDB with ETL and getting: IllegalArgumentException: destination vertex is null. I've looked at the documentation and some other examples on the net and my ETL config looks correct to me. I was hoping someone might have an idea.
My two V subclasses are:
Author (authorId, authGivenName, authSurname) and an index on authorId
Abstract (abstractId) with an index on abstractId
My E subclass
Authored - no properties or indices defined on it
My Edge file
(authorId, abstractId) - \t separated fields with one header line with those names
My ETL config:
{
"config": { "log":"debug"},
"source" : { "file": { "path":"/root/poc1_Datasets/authAbstractEdge1.tsv" }},
"extractor":{ "row":{} },
"transformers":[
{ "csv":{ "separator": "\t" } },
{ "merge": {
"joinFieldName": "authorId",
"lookup":"Author.authorId"
} },
{ "vertex":{ "class":"Author" } },
{ "edge" : {
"class": "Authored",
"joinFieldName": "abstractId",
"lookup": "Abstract.abstractId",
"direction": "out"
}}
],
"loader":{
"orientdb":{
"dbURL":"remote:localhost/DataSpine1",
"dbType":"graph",
"wal":false,
"tx":false
} }
}
When I run ETL with this config and file I get:
OrientDB etl v.2.1.2 (build #BUILD#) www.orientdb.com
BEGIN ETL PROCESSOR
[file] DEBUG Reading from file /root/poc1_Datasets/authAbstractEdge1.tsv
[0:csv] DEBUG Transformer input: authorId abstractId
[0:csv] DEBUG parsing=authorId abstractId
[0:csv] DEBUG Transformer output: null
2016-06-09 12:15:04:088 WARNI Transformer [csv] returned null, skip rest of pipeline execution [OETLPipeline][1:csv] DEBUG Transformer input: 9-s2.0-10039026700 2-s2.0-29144536313
[1:csv] DEBUG parsing=9-s2.0-10039026700 2-s2.0-29144536313
[1:csv] DEBUG document={authorId:9-s2.0-10039026700,abstractId:2-s2.0-29144536313}
[1:csv] DEBUG Transformer output: {authorId:9-s2.0-10039026700,abstractId:2-s2.0-29144536313}
[1:merge] DEBUG Transformer input: {authorId:9-s2.0-10039026700,abstractId:2-s2.0-29144536313}
[1:merge] DEBUG joinValue=9-s2.0-10039026700, lookupResult=Author#12:10046021{authorId:9-s2.0-10039026700,authGivenName:M. A.,authSurname:Turovskaya,abstractId:2-s2.0-29144536313} v2
[1:merge] DEBUG merged record Author#12:10046021{authorId:9-s2.0-10039026700,authGivenName:M. A.,authSurname:Turovskaya,abstractId:2-s2.0-29144536313} v2 with found record={authorId:9-s2.0-10039026700,abstractId:2-s2.0-29144536313}
[1:merge] DEBUG Transformer output: Author#12:10046021{authorId:9-s2.0-10039026700,authGivenName:M. A.,authSurname:Turovskaya,abstractId:2-s2.0-29144536313} v2
[1:vertex] DEBUG Transformer input: Author#12:10046021{authorId:9-s2.0-10039026700,authGivenName:M. A.,authSurname:Turovskaya,abstractId:2-s2.0-29144536313} v2
[1:vertex] DEBUG Transformer output: v(Author)[#12:10046021]
[1:edge] DEBUG Transformer input: v(Author)[#12:10046021]
[1:edge] DEBUG joinCurrentValue=2-s2.0-29144536313, lookupResult=Abstract#13:16626366{abstractId:2-s2.0-29144536313} v1
Error in Pipeline execution: java.lang.IllegalArgumentException: destination vertex is null
java.lang.IllegalArgumentException: destination vertex is null
at com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:888)
at com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:832)
at com.orientechnologies.orient.etl.transformer.OEdgeTransformer.createEdge(OEdgeTransformer.java:188)
at com.orientechnologies.orient.etl.transformer.OEdgeTransformer.executeTransform(OEdgeTransformer.java:117)
at com.orientechnologies.orient.etl.transformer.OAbstractTransformer.transform(OAbstractTransformer.java:37)
at com.orientechnologies.orient.etl.OETLPipeline.execute(OETLPipeline.java:114)
at com.orientechnologies.orient.etl.OETLProcessor.executeSequentially(OETLProcessor.java:487)
at com.orientechnologies.orient.etl.OETLProcessor.execute(OETLProcessor.java:291)
at com.orientechnologies.orient.etl.OETLProcessor.main(OETLProcessor.java:161)
ETL process halted: com.orientechnologies.orient.etl.OETLProcessHaltedException: java.lang.IllegalArgumentException: destination vertex is null
As I look at the debug, it appears that the MERGE successfully found the Author vertex and the EDGE found the Abstract Vertex successfully (based on seeing the RIDs in the output). I'm stumped as to why I'm getting the Exception. Thanks in advance for any pointers.
Have you already tried to see if with the new etl, teleporter, Version 2.2 solves this problem?
At this link there is description about new etl product.
I actually discovered that the ETL loader in OrientDB version 2.2.2 seems to have solved this issue. (Note: version 2.2.0 still had the same issue)