Member since
02-11-2014
162
Posts
2
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3012 | 12-04-2015 12:46 PM | |
3482 | 02-12-2015 01:06 PM | |
2869 | 03-20-2014 12:41 PM | |
4649 | 03-19-2014 08:54 AM |
02-01-2019
02:39 PM
@Harsh J How would I do this just for one job ?. I tried using below setting but it is not working. The issue is that I want to use a version of jersey which I bundled into my fat jar,however gateway node has an older version of that jar and it loads a class from there resulting in a NosSuchMethodException .My application is not a map reduce job and I run it by using hadoop jar and running on 5.14.4 export HADOOP_USER_CLASSPATH_FIRST=true export HADOOP_CLASSPATH=/projects/poc/test/config:$HADOOP_CLASSPATH
... View more
11-20-2017
01:48 PM
Hello, We have a use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain different record types that are variable in length .The first 4 bytes denotes the length of the record.I have written a stand alone java program to ingest the data to hdfs using Avro DataFileWriter.Now these files from mainframe are much smaller in size (under a block size) and creates small files . Some of the options we came up with to avoid these are 1. Convert the batch process to more of a service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation for this right now . 2. Write the data into an hdfs tmp location,merge the files every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before data is made available to consumers. 3. Make use of avro append functionality. Appreciate your help!
... View more
09-12-2017
10:37 AM
If your search based on a subset of the data that is present in hbase table , you could create a solr schema and index the records to solr when it is inserted to hbase.I have used lily indexer/morphlines to do that a couple of years back . You could index the row key in this schema as a hidden value ,so if a user needs a detailed search information you could use that rowkey to query hbase . If the search informatin required is minimal it could be served from cloudera search itself
... View more
09-12-2017
10:33 AM
Hello, I have a java batch process that reads binary files from legacy platform ,converts to avro and wirtes to HDFS .I have variable length records on that file and do not want to land them directly on hdfs .So they land on an edge Node as of now . The batch process runs every hour and creates files that are 10 to 15 Mb in size .Since this is not ideal for name node we run a merge process every 24 hours to merge these files .. This hadoop cluster is not used for analytics and all the canned reports via hive queries access only data for the last 24 hours (hence we are able to run the merge job after 24 hours ). Now since we plan to bring in more of these data from legacy systems we have a need to run the merge process more often (every couple of hours or so instead of 24 hours). There could be other processes like hive queries running in the system during this time accessing the files that are candidates for merge then the merge process should wait for the process to be completed(since there is a mv involved).Is there a technical solution on how this could be accomplished? Thanks, Nishanth
... View more
09-12-2017
10:22 AM
What is the error you are getting ? Ar e you trying to connect to Hbase?
... View more
06-02-2017
09:10 AM
Hello hive users, We are looking at migrating files(less than 5 Mb of data in total) with variable record lengths from a mainframe system to hive.You could think of this as metadata.Each of these records can have columns ranging from 3 to n( means each record type have different number of columns) based on record type.What would be the best strategy to migrate this to hive .I was thinking of converting these files into one variable length csv file and then importing them to a hive table .Hive table will consist of 4 columns with the 4th column having comma separated list of values from column column 4 to n.Are there other alternative or better approaches for this solution.Appreciate any feedback on this. Thanks, Nishanth
... View more
Labels:
10-26-2016
01:38 PM
Hello, We have a use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain different record types that are variable in length .The first 4 bytes denotes the length of the record.I have written a stand alone java program to ingest the data to hdfs using Avro DataFileWriter.Now these files from mainframe are much smaller in size (under a block size).I have been creating one output avro file for one input file .What is the best option in this case from performance and maintainability stand point?.Can we have just one file for one record type and append to that file .This file would become very huge in future.Another option would be to have one avro file per day.Please let me know which one would be best suited. Thanks, Nishanth
... View more
08-19-2016
11:00 AM
Hello, I have a requirement to decrypt few columns in a hive.The key is managed in its own appliance and cannot be brought to the hadoop cluster for security reasons.The udfs that we wrote did not give me much performance.We would want to pass multiple row values in the same request to decrypt.Should I look at something like generic udaf for this solution?. Thanks
... View more
05-17-2016
10:58 AM
Hello, I have a data processing logic where in we need to remove the duplicates from a dataset.There are 3 phases where we remove the duplicates.The first two augments quiet well in hive.In the last phase we have to filter out certain records bases on some procedural code.What we need is a functionality that would take in multiple rows as input and return multiple output.The function would do teh duplicate checking and return Ids of those records.I have looked generic udtf but not sure on whether its the right approach.Any pointers would be helpful. Thanks, Nishan
... View more
03-07-2016
04:33 PM
Thanks Harsh. Yes my key was a void .So I changed the avro output to use keyOutputformat (earlier value is now the key) from Avrokeyvalueoutputformat and it worked. Thanks, Nishanth
... View more
02-12-2016
11:22 AM
Hello, I am trying to load data from from avro backed hive table in a pig script using the below command A = LOAD 'dev.avro_test' USING org.apache.hive.hcatalog.pig.HCatLoader(); We are running into the below error.Request to give some direction.We are using CDH 5.5.1. 2016-02-12 19:21:26,515 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Type void not present Failed to parse: Type void not present at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1688) at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1421) at org.apache.pig.PigServer.parseAndBuild(PigServer.java:354) at org.apache.pig.PigServer.executeBatch(PigServer.java:379) at org.apache.pig.PigServer.executeBatch(PigServer.java:365) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:484) at org.apache.pig.Main.main(Main.java:158) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.lang.TypeNotPresentException: Type void not present at org.apache.hive.hcatalog.data.schema.HCatFieldSchema$Type.getPrimitiveHType(HCatFieldSchema.java:92) at org.apache.hive.hcatalog.data.schema.HCatFieldSchema.<init>(HCatFieldSchema.java:226) at org.apache.hive.hcatalog.data.schema.HCatSchemaUtils.getHCatFieldSchema(HCatSchemaUtils.java:122) at org.apache.hive.hcatalog.data.schema.HCatSchemaUtils.getHCatFieldSchema(HCatSchemaUtils.java:115) at org.apache.hive.hcatalog.common.HCatUtil.getHCatFieldSchemaList(HCatUtil.java:151) at org.apache.hive.hcatalog.common.HCatUtil.getTableSchemaWithPtnCols(HCatUtil.java:184) at org.apache.hive.hcatalog.pig.HCatLoader.getSchema(HCatLoader.java:216) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175) at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89) at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:853) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188) ... 19 more 2016-02-12 19:21:26,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Type void not present
... View more
02-09-2016
12:09 PM
If you want container files rather than key value pairs as part of map reduce output you should use https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/mapred/AvroOutputFormat.html than KeyValueOutputFormat
... View more
02-09-2016
11:45 AM
I have used the combination in cases where the data model was changing over time and where it was complex.Its pretty easy to create an avro schema and the java bindings..There are cases where avro is a best fit over parquet.In case you are not sure it may be worthwhile to start with avro,do performace analysize and you can always change to parquet very easily. Nishan
... View more
02-02-2016
10:27 AM
Hello, My map reduce program currently outputs avro data in the following format.Is there a way I can avoid the key and literals in the output? { "type" : "record", "name" : "KeyValuePair", "namespace" : "org.apache.avro.mapreduce", "doc" : "A key/value pair", "fields" : [ { "name" : "key", "type" : "null", "doc" : "The key" }, { "name" : "value", "type" : { "type" : "record", "name" : "RecordType", "namespace" : "model", "fields" : [ { ]} }
... View more
Labels:
02-02-2016
10:26 AM
Hello, My map reduce program currently outputs avro data in the following format.Is there a way I can avoid the key and literals in the output? { "type" : "record", "name" : "KeyValuePair", "namespace" : "org.apache.avro.mapreduce", "doc" : "A key/value pair", "fields" : [ { "name" : "key", "type" : "null", "doc" : "The key" }, { "name" : "value", "type" : { "type" : "record", "name" : "RecordType", "namespace" : "model", "fields" : [ { ]} }
... View more
01-29-2016
11:37 AM
Abhinay, Can you try placing the jarsas described in thiks post?. http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/ Thanks, Nishan
... View more
01-29-2016
11:33 AM
Hey Alex, So I am encountering this even on HIve(cdh 5.5) when running a DDL that used to work with cdh 5.3.2(since it was failing silently on the metastore).I am guessing I would have to restructure my schema to avoid this(have less than 4000 bytes added up in the when aggregating the column names in struct).The problem here is that much of my java code which operates on this schema has already been done.What would be best way to re organize the schema.Would really appreciate if you could give me some pointers. Thanks, Nishanth
... View more
12-30-2015
10:37 AM
Thanks.Please update if there is work in this direction.I could then upgrade my cluster.
... View more
12-30-2015
08:15 AM
Hey Alex, If I try to create table using a DDL with explicit column names in hive it will fail,but if I use the avro schema to define the table it would not.I guess the only way for me now is to flatten out the schema.I was hopeful that impala would be able to handle structs if the nesting is less than 100 columns.Looks like there is a limit on the length of the column names in the struct as well.Thanks for your help!
... View more
- Tags:
- y Al
12-22-2015
02:27 PM
.I created a hive table pointing to an Avro schema and that Avro schema has this particular nesting which has columns more than 4000 character.It did not create any issues for me to run queries against that hive table.Since impala supports complex data types only with parquet I went ahead and created a parquet file with data in it and tried creating the table in impala and encountered this issue.Let me know if you have more questions. Nishan
... View more
12-22-2015
11:18 AM
Hello, I am trying to create a table in impala (cdh 5.5) but it fails with the below error message.The parquet file has many struct data types which is more than 4000 characters long.I did not face this issue when using hive but suprisingly came up for Impala.I would appreciate any solution. [quickstart.cloudera:21000] > CREATE TABLE columns_from LIKE PARQUET '/projects/dps/test/parquet/test.parquet' STORED AS PARQUET; Query: create TABLE columns_from LIKE PARQUET '/projects/dps/test/parquet/test.parquet' STORED AS PARQUET ERROR: AnalysisException: Type of column 'segment1' exceeds maximum type length of 4000 characters:
... View more
12-04-2015
12:46 PM
I tried using AvroParquetOutputFormat and MultipleOutputs class and was able to generate parquet files for a specific schema type.For the other schema type I am running into the below error.Any help is appreciated? java.lang.ArrayIndexOutOfBoundsException: 2820 at org.apache.parquet.io.api.Binary.hashCode(Binary.java:489) at org.apache.parquet.io.api.Binary.access$100(Binary.java:34) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.hashCode(Binary.java:382) at org.apache.parquet.it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap.getInt(Object2IntLinkedOpenHashMap.java:587) at org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.writeBytes(DictionaryValuesWriter.java:235) at org.apache.parquet.column.values.fallback.FallbackValuesWriter.writeBytes(FallbackValuesWriter.java:162) at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203) at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:347) at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:257) at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167) at org.apache.parquet.avro.AvroWriteSupport.writeRecord(AvroWriteSupport.java:149) at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:262) at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:167) at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:142) at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.write(LazyOutputFormat.java:115) at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:457) at com.visa.dps.mapreduce.logger.LoggerMapper.map(LoggerMapper.java:271)
... View more
12-04-2015
10:16 AM
Hello All, We have a java map reduce application which reads in binary files does some data processing and converts to avro data.Currently we have two avro schemas and use Avromultipleoutputs class to write to multiple locations based on the schema.After we did some research we found that it would be beneficial if we could store the data as parquet.What is the best way to do this?Should I change the native map reduce to convert from avro to parquet or is there some other utility that I can use?. Thanks, Nishan
... View more
11-23-2015
09:24 PM
Can you get the trace for that particular solr replica logs?.If your tlogs are large it could be replaying those but then it would normally been in recovery state.Are your tlogs are corrupt?.
... View more
11-23-2015
02:49 PM
The issue in my case was I was not closing the avromultipleoutputs instance in the mapper.Combination of lazyoutputformat and closing the avromultipleoutputs instance in the mapper fixed the issue for me.
... View more
09-10-2015
03:19 PM
Hello Harsh, I tried the same for AvroMultipleOut files and this still generates empty avro files.Should something in addition be done when we are using Avro MultipleOutputs?I am using avro 1.7.7 and CDH 5.4.Please let me know if you have faced this issue. Thanks, Nishanth
... View more
02-12-2015
01:06 PM
I would probably suggest using curl command to add replica and not the solr UI.This is the wiki reference on how you can do that using the collection APIs. https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica You may want to do this during a quite time as it would have more I/O on the system,again depends on what your cluster environment and index size looks like. Thanks, Nishan
... View more
09-02-2014
03:32 PM
Hey Senthi, Can you restart your cloudera scm agents and try again? Thanks, Nisha
... View more
07-07-2014
02:05 PM
Hi, Is there a way to find the number for hlogs that is yet to be replicated.I suspect that my hbase replication is slow.Please let me know if there is a way to check this.
... View more
Labels:
05-30-2014
03:35 PM
Hi, Thank you guys.I set up my email server and on sending a test alert I am getting the below exception.On netstat I see that the port is being used by some process.Can some one help? 2014-05-30 16:30:48,215 WARN org.apache.camel.impl.DefaultPollingConsumerPollStrategy: Consumer Consumer[event://hostname:7184?eventStoreHttpPort=7185&eventsQueryTimeoutMillis=60000] could not poll endpoint: event://hostname:7184?eventStoreHttpPort=7185&eventsQueryTimeoutMillis=60000 caused by: java.net.ConnectException: Connection refused org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88) at com.sun.proxy.$Proxy9.queryEvents(Unknown Source) at com.cloudera.cmf.event.query.AvroEventStoreQueryProxy.doQuery(AvroEventStoreQueryProxy.java:160) at com.cloudera.enterprise.alertpublisher.component.EventStoreConsumer.poll(EventStoreConsumer.java:167) at org.apache.camel.impl.ScheduledPollConsumer.run(ScheduledPollConsumer.java:97) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at org.apache.avro.ipc.HttpTransceiver.writeBuffers(HttpTransceiver.java:71) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:58) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:72) at org.apache.avro.ipc.Requestor.request(Requestor.java:147) at org.apache.avro.ipc.Requestor.request(Requestor.java:101) at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:72)
... View more