Problems to activate nutch headings plugin - plugins

I try to activate the headings plugin in nutch 1.8, but somehow it does not work. Here are the parts of my nutch-site.xml:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|headings)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>activates metatag parsing </description>
</property>
<property>
<name>headings</name>
<value>h1;h2</value>
<description>Comma separated list of headings to retrieve from the document</description>
</property>
<property>
<name>headings.multivalued</name>
<value>false</value>
<description>Whether to support multivalued headings.</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.title, metatag.keywords, metatag.author,
metatag.author, headings.h1, headings.h2</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)
</description>
</property>
can someone help?
Thanks Chris

After struggeling with this myself I've found that the following should work (Apache Nutch 1.9):
<property>
<name>plugin.includes</name>
<value>protocol-http|headings|parse-(html|tika|metatags)|...</value>
</property>
<property>
<name>index.parse.md</name>
<value>h1,h2,h3</value>
</property>
<property>
<name>headings</name>
<value>h1,h2,h3</value>
</property>
<property>
<name>headings.multivalued</name>
<value>true</value>
</property>
The following should be added to your schema.xml file (when using Apache Solr):
<!-- fields for the headings plugin -->
<field name="h1" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h2" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h3" type="text" stored="true" indexed="true" multiValued="true"/>

Within
<name>index.parse.md</name>
check for metatag.h1 and metatag.h2
<property>
<name>index.parse.md</name>
<value>metatag.h1,metatag.h2/value>
...
btw. Headings is no parse-... filter.
You have to use
<name>plugin.includes</name>
<value>headings|parse-(html|tika|metatags)|...
Now it should work...

Related

Spark2 unable to find table or view on remote hdfs cluster

I am using HiveContext to query a hive table on a hdfs cluster remotely through spark 1.6.0 and am able to do so successfully. However, when doing so through spark 2.3.0, throws the following:
org.apache.spark.sql.AnalysisException:
Table or view not found: `hiveorc_replica`.`appointment`; line 1 pos 21;
'Aggregate [unresolvedalias(count(1), None)]
+- 'UnresolvedRelation `hiveorc_replica`.`appointment`
Through this message, I am able to interpret only one thing that it might be searching for the database locally instead of remotely. I am creating a spark context using:
val conf = new SparkConf().setAppName("SparkApp").setMaster("local")
val sc=new SparkContext(conf)
val hc = new HiveContext(sc)
val actualRecordCountHC = hc.sql("select count(*) from hiveorc_replica.appointment")
val records = hc.sql("select * from hiveorc_replica.appointment")
All the config files are present in the resources folder of my project. The following is my hive-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://fqdn:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>300</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>20971520</value>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>false</value>
</property>
<property>
<name>hive.smbjoin.cache.rows</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.logging.operation.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/var/log/hive/operation_logs</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value>
</property>
<property>
<name>hive.exec.copyfile.maxsize</name>
<value>33554432</value>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1099</value>
</property>
<property>
<name>hive.vectorized.groupby.checkinterval</name>
<value>4096</value>
</property>
<property>
<name>hive.vectorized.groupby.flush.percent</name>
<value>0.1</value>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.reduce.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.merge.mapfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>false</value>
</property>
<property>
<name>hive.cbo.enable</name>
<value>false</value>
</property>
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
</property>
<property>
<name>hive.fetch.task.conversion.threshold</name>
<value>268435456</value>
</property>
<property>
<name>hive.limit.pushdown.memory.usage</name>
<value>0.1</value>
</property>
<property>
<name>hive.merge.sparkfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.smallfiles.avgsize</name>
<value>16777216</value>
</property>
<property>
<name>hive.merge.size.per.task</name>
<value>268435456</value>
</property>
<property>
<name>hive.optimize.reducededuplication</name>
<value>true</value>
</property>
<property>
<name>hive.optimize.reducededuplication.min.reducer</name>
<value>4</value>
</property>
<property>
<name>hive.map.aggr</name>
<value>true</value>
</property>
<property>
<name>hive.map.aggr.hash.percentmemory</name>
<value>0.5</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>1</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.dynamicAllocation.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors</name>
<value>2147483647</value>
</property>
<property>
<name>hive.metastore.execute.setugi</name>
<value>true</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>fqdn</value>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
</property>
<property>
<name>hive.zookeeper.namespace</name>
<value>hive_zookeeper_namespace_CD-HIVE-WAyDdBlP</value>
</property>
<property>
<name>hive.cluster.delegation.token.store.class</name>
<value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>spark.shuffle.service.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
</configuration>
fqdn is being replaced by remote hdfs FQDN at runtime. Also, when I run the same code on the remote cluster itself where the hive database is present, through spark2, it gives results.
So, how do i run the code remotely ?
creating spark session for spark2 did the job. On seeing the logs, I found that somehow it was unable to get the value of hive.metastore.uris from hive-site.xml and setting it through spark-session was the answer.
val spark = SparkSession.builder.master("local").config("hive.metastore.uris", "thrift://"+hdfsFQDN+":9083").enableHiveSupport.getOrCreate
However, I still have a doubt as to why is it not able to get value of hive.metastore.uri for running remotely when its able to get the file from resources when running through HiveContext ?

Mapreduce job when submitted in a queue is not running on the labelled node and instead running under default partition

I have a Hortonworks(HDP 2.4.2) cluster with three data nodes (node1,node2 & node3).
I wanted to check the node labelling feature in hortoworks . For that I created a node label(x) and mapped it to a data node node1.
So now, there are two partitions:
Default partition - contains node2 & node3
Partition x - contains node1
Later on I created a queue named "protegrity* and mapped it to node label X.
Now, whene I am running any mapreduce job, this job is getting executed but on the "protegrity* queue of the default partition which I was not expecting. It ws supposed to get executed on queue "protegrity* on node1(labelled partition x). Please refer the attached screenshot of scheduler.
Scheduler
The job executed was : hadoop jar ./hadoop-mapreduce/hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=protegrity -Dnode_label_expression=x 2 2
The configs of capacity_seheduler.xml file is mentiond below:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.2</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-applications</name>
<value>10000</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>40</value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels</name>
<value>x</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.x.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.x.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.acl_administer_queue</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.acl_administer_queue</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.acl_submit_applications</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.capacity</name>
<value>20</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.maximum-capacity</name>
<value>80</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.user-limit-factor</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels</name>
<value>x</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels.x.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels.x.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.acl_administer_queue</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.acl_submit_applications</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.capacity</name>
<value>30</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.maximum-capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.ordering-policy</name>
<value>fifo</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,hawqque,protegrity</value>
</property>
But when I executed an example query given in - https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_yarn_resource_mgt/content/using_node_labels.html which doesn't triggers mapreduce job, the job executed on the labelled node and used the mentioned queue.
Query was : sudo su yarn
hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar
-shell_command "sleep 100000" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar
-queue protegrity -node_label_expression x
So, I am a bit confused here, node labelling work for mapreduce jobs or not !!??.
If yes, then I need a little help
Node labelling doesn't work with MapReduce job until and unless we add a node labelling to a queue and run the MapReduce job using a queue with a node label. Per your given configuration: Queue protegrity will always use default partition with a capacity of 30 % and not inside the node which you have labeled as x because you need to make your queue protegrity attach to x as default node label. Please add the following configuration to make it work:
<property>
<name>yarn.scheduler.capacity.root.protegrity.default-node-label-expression</name>
<value>x</value>
</property>

HTTP/1.1 400 Bad Request executing oozie spark job

I'm trying to execute the spark oozie example on the oozie_spark branch against a BigInsights for Apache Hadoop basic cluster.
The workflow.xml looks like this:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkWordCount'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Spark-Wordcount</name>
<class>org.apache.spark.examples.WordCount</class>
<jar>/iop/apps/4.2.0.0/spark/jars/spark-assembly.jar,${jobDir}/lib/spark-wordcount-example.jar</jar>
<spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.0.0</spark-opts>
<arg>${inputDir}/FILE</arg>
<arg>${outputDir}</arg>
<capture-output/>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
The configuration.xml:
<configuration>
<property>
<name>master</name>
<value>local</value>
</property>
<property>
<name>queueName</name>
<value>default</value>
</property>
<property>
<name>user.name</name>
<value>default</value>
</property>
<property>
<name>nameNode</name>
<value>default</value>
</property>
<property>
<name>jobTracker</name>
<value>default</value>
</property>
<property>
<name>jobDir</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>inputDir</name>
<value>/user/snowch/test/input</value>
</property>
<property>
<name>outputDir</name>
<value>/user/snowch/test/output</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>/user/snowch/test</value>
</property>
</configuration>
However, the error is:
Exception in thread "main" org.apache.hadoop.gateway.shell.HadoopException:
org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 400 Bad Request
What am I doing wrong?
The problem appears to be due to the element <capture-output/>.
I removed this and the error disappeared.

connecting to Oracle XE with myBatis using JDBC in Eclipse

I'm using Eclipse with Maven, in myBatis-config.xml I have the following codes. The H2 part of the code works as I can connect to H2 with my program and access the database. The Oracle part of my code doesn't work. I'm using ORACLE DATABASE XE 11.2, application express with a workspace: test, username: name, password: 123. When I run a testing class in Eclipse, I could pass the H2 tests, but when I run the same test using oracle instead, it gets an error. "Error selecting key or setting result to parameter object. Case: java.sql.SQLSyntaxErrorException: ORA-02289: sequence does not exist.
<environment id="H2">
<transactionManager type="JDBC" />
<dataSource type="POOLED">
<property name="driver" value="org.h2.Driver" />
<property name="url" value="jdbc:h2:tcp://localhost:9096/sample/testDB" />
<property name="username" value="sa" />
<property name="password" value="123" />
</dataSource>
</environment>
<environment id="ORACLE">
<transactionManager type="JDBC" />
<dataSource type="POOLED">
<property name="driver" value="oracle.jdbc.OracleDriver" />
<property name="url" value="jdbc:oracle:thin:#localhost:1521:xe" />
<property name="username" value="system" />
<property name="password" value="123" />
</dataSource>
</environment>
Hello reading the documentation from the official site of MyBatis I could obtain the following information:
In case of using the multi-db feature you will need to inform the databaseIdProvider property in the following way:
In case of using the multi-db feature you will need to inform the databaseIdProvider property in the following way:
<bean id="vendorProperties" class="org.springframework.beans.factory.config.PropertiesFactoryBean">
<property name="properties">
<props>
<prop key="SQL Server">sqlserver</prop>
<prop key="DB2">db2</prop>
<prop key="Oracle">oracle</prop>
<prop key="MySQL">mysql</prop>
</props>
</property>
</bean>
<bean id="databaseIdProvider" class="org.apache.ibatis.mapping.VendorDatabaseIdProvider">
<property name="properties" ref="vendorProperties"/>
</bean>
<bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="mapperLocations" value="classpath*:sample/config/mappers/**/*.xml" />
<property name="databaseIdProvider" ref="databaseIdProvider"/>
</bean>
Hope it has been helpful.
Greetings.
NOTE Since 1.3.0, configuration property has been added. It can be specified a Configuration instance directly without MyBatis XML configuration file. For example:
mybatis.org/spring/es/factorybean.html

Using optional fields with StaxEventItemReader

I have a Spring Batch application and I'm using the StaxEventItemReader as my ItemReader. By default XStream requires us to declare a property for each possible XML tag or else it throws an UnknownFieldException exception. There are ways to code around this with Java but with Spring Batch, the InputReader doesn't seem to have a way to modify it. Is there a way to flag fields as optional in the xml?
My bean is configured basically like this
<job id="synchronizecustomerData" xmlns="http://www.springframework.org/schema/batch">
<step id="readWritecustomers">
<tasklet>
<chunk reader="customerReader"
processor="customerProcessor"
writer="customerSyncWriter"
commit-interval="1"
skip-policy="alwaysSkip" >
</chunk>
</tasklet>
</step>
</job>
<bean id="customerReader" class="org.springframework.batch.item.xml.StaxEventItemReader">
<property name="fragmentRootElementName" value="customer" />
<property name="resource" ref="inputResource" />
<property name="unmarshaller" ref="customerMarshaller" />
</bean>
<bean id="inputResource" class="org.springframework.core.io.FileSystemResource">
<constructor-arg value="c:/sf/data.xml" />
</bean>
<bean id="customerMarshaller" class="org.springframework.oxm.xstream.XStreamMarshaller">
<property name="aliases">
<util:map id="aliases">
<entry key="customer" value="com.company.batchmaster.sf.beans.customer" />
<entry key="name" value="java.lang.String" />
</util:map>
</property>
</bean>
<bean id="customerProcessor" class="org.springframework.batch.item.support.CompositeItemProcessor">
<property name="delegates">
<list>
<ref bean="customerTransformer" />
</list>
</property>
</bean>
<bean id="customerTransformer" class="com.company.batchmaster.sf.chunk.customerTransformer" />
<bean id="customerSyncWriter" class="com.company.batchmaster.sf.chunk.customerSyncWriter" />
My import file looks like this, just getting it up and running
<?xml version="1.0" encoding="UTF-8"?>
<records>
<customer xmlns="http://springframework.org/batch/sample/io/oxm/domain">
<name>ABC Dealer</name>
<types>CR</types>
</customer>
</records>
Thanks for any help.
I am assuming Customer class has properties name and type.
Annotate it as XmlAttribute.defaultValue() as described in Jaxb guide
No need to this (<entry key="name" value="java.lang.String" />) alias because you are unmarshalling a complete Customer object from a node as specified with <property name="fragmentRootElementName" value="customer" />