Support Questions

Find answers, ask questions, and share your expertise

What would be the right command to start Druid Hadoop Indexer for HDP 2.6.3?

avatar

I read http://druid.io/docs/latest/ingestion/command-line-hadoop-indexer.html and tried the following command:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Dhdp.version=2.6.3.0-235 -classpath /usr/hdp/current/druid-overlord/conf/_common:/usr/hdp/current/druid-overlord/lib/*:/etc/hadoop/conf io.druid.cli.Main index hadoop ./hadoop_index_spec.json

But this job fails with below:

2018-03-14T07:37:06,132 INFO [main] io.druid.indexer.JobHelper - Deleting path[/tmp/druid/mmcellh/2018-03-14T071308.731Z_55fbb15cd4d4454885d909c870837f93]
2018-03-14T07:37:06,150 ERROR [main] io.druid.cli.CliHadoopIndexer - failure!!!!
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_151]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_151]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_151]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_151]
        at io.druid.cli.CliHadoopIndexer.run(CliHadoopIndexer.java:117) [druid-services-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        at io.druid.cli.Main.main(Main.java:108) [druid-services-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
Caused by: io.druid.java.util.common.ISE: Job[class io.druid.indexer.IndexGeneratorJob] failed!
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:389) ~[druid-indexing-hadoop-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:95) ~[druid-indexing-hadoop-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        at io.druid.indexer.JobHelper.runJobs(JobHelper.java:369) ~[druid-indexing-hadoop-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        at io.druid.cli.CliInternalHadoopIndexer.run(CliInternalHadoopIndexer.java:131) ~[druid-services-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        at io.druid.cli.Main.main(Main.java:108) ~[druid-services-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
        ... 6 more

And the yarn application log shows "xxxx is not a valid DFS filename":

2018-03-14T07:31:41,369 ERROR [main] io.druid.indexer.JobHelper - Exception in retry loop
java.lang.IllegalArgumentException: Pathname /tmp/data/index/output/mmcellh/2014-02-11T10:00:00.000Z_2014-02-11T11:00:00.000Z/2018-03-14T07:13:08.731Z/0/index.zip.3 from hdfs://sandbox-hdp.hortonworks.com:8020/tmp/data/index/output/mmcellh/2014-02-11T10:00:00.000Z_2014-02-11T11:00:00.000Z/2018-03-14T07:13:08.731Z/0/index.zip.3 is not a valid DFS filename.
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:217) ~[hadoop-hdfs-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480) ~[hadoop-hdfs-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:476) ~[hadoop-hdfs-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:491) ~[hadoop-hdfs-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:417) ~[hadoop-hdfs-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:930) ~[hadoop-common-2.7.3.2.6.3.0-235.jar:?]
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:891) ~[hadoop-common-2.7.3.2.6.3.0-235.jar:?]
        at io.druid.indexer.JobHelper$4.push(JobHelper.java:415) [druid-indexing-hadoop-0.10.1.2.6.3.0-235.jar:0.10.1.2.6.3.0-235]
...

https://github.com/druid-io/druid/pull/1121 looks similar but this should have been fixed in HDP 2.6.3.

So I'm wondering if the classpath I'm using is correct.

1 ACCEPTED SOLUTION

avatar
New Contributor

I'm running this index job via the command line using the jars as described here:

http://druid.io/docs/latest/ingestion/command-line-hadoop-indexer.html

Have determined Druid 0.12.0 has something weird going on in conjunction with the druid-parquet-extensions as the fs.defaultFs set in the conf/druid/_common/common.runtime.properties is seemingly not respected at some point (don't exactly have a ton of time to trace through their open source project). So here is what I have done as a successful workaround, hopefully this will be helpful

java -Xmx512m -Ddruid.storage.storageDirectory=hdfs://{my_namenode_ip}:{my_namename_port}/{my_segments_path} -Ddruid.storage.type=hdfs -Dfile.encoding=UTF-8 -classpath extensions/druid-parquet-extensions/*:extensions/druid-avro-extensions:extensions/druid-hdfs-storage:lib/*:conf/druid/_common:{HADOOP_PATH}{HADOOP_JAR} io.druid.cli.Main index hadoop {DRUID_INDEXER_DATA}

View solution in original post

7 REPLIES 7

avatar
Contributor

Please also share spec file - hadoop_index_spec.json and complete yarn application logs.

avatar

Thank you, @Nishant Bangarwa

I sent those by email.

avatar
New Contributor

@Hajime

Having the same problem while testing an update to 0.12.0. Ran into your thread, thought i'd share a link that is seemingly related from awhile ago..
https://groups.google.com/forum/#!topic/druid-development/8u5orNnQlwE

"Druid checks the default file system for replacing ":" with "_" and making a valid DFS file path, What is the value of fs.defaultFS set in hadoop config files ? can you try pointing this to hdfs filesystem, If its not already doing that ?"

avatar

The core-site.xml under /etc/hadoop/conf shows:

    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://sandbox-hdp.hortonworks.com:8020</value>
      <final>true</final>
    </property>

So... I guess my config is OK?

Do I need to add "druid.indexer.fork.property.druid.indexer.task.hadoopWorkingPath" in some property file and add this in the -cp?

avatar
Expert Contributor

I think your classpath is missing the HDFS module that is under extensions directory...

avatar
New Contributor

I'm running this index job via the command line using the jars as described here:

http://druid.io/docs/latest/ingestion/command-line-hadoop-indexer.html

Have determined Druid 0.12.0 has something weird going on in conjunction with the druid-parquet-extensions as the fs.defaultFs set in the conf/druid/_common/common.runtime.properties is seemingly not respected at some point (don't exactly have a ton of time to trace through their open source project). So here is what I have done as a successful workaround, hopefully this will be helpful

java -Xmx512m -Ddruid.storage.storageDirectory=hdfs://{my_namenode_ip}:{my_namename_port}/{my_segments_path} -Ddruid.storage.type=hdfs -Dfile.encoding=UTF-8 -classpath extensions/druid-parquet-extensions/*:extensions/druid-avro-extensions:extensions/druid-hdfs-storage:lib/*:conf/druid/_common:{HADOOP_PATH}{HADOOP_JAR} io.druid.cli.Main index hadoop {DRUID_INDEXER_DATA}

avatar

Thanks a lot!

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Dhdp.version=2.6.3.0-235 -Ddruid.storage.storageDirectory=hdfs://`hostname -f`:8020/tmp/data/index/output -Ddruid.storage.type=hdfs -classpath /usr/hdp/current/druid-overlord/extensions/druid-hdfs-storage/*:/usr/hdp/current/druid-overlord/lib/*:/usr/hdp/current/druid-overlord/conf/_common:/etc/hadoop/conf/ io.druid.cli.Main index hadoop ./hadoop_index_spec.json

Above worked.
Mine is sandbox so using `hostname -f`.