About aervits

aervits · ‎03-18-2016

@Sunile Manjee this is the tool I was looking for to help you https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/replication/regionserver/ReplicationSyncUp.html

aervits · ‎03-18-2016

you can also identify inconsistencies using this tool https://hbase.apache.org/book.html#_verifying_replicated_data

aervits · ‎03-18-2016

what version of HBase is it? Here's a sync tool introduced in 1.2 https://issues.apache.org/jira/browse/HBASE-13639

aervits · ‎03-18-2016

@Benjamin Leonhardi it doesn't matter whether table is text or ORC, percentage for tablesample is not working. @gopal is this a bug? hive> SELECT * FROM medicare_part_b.medicare_part_b_2013_orc TABLESAMPLE(20 percent); FAILED: SemanticException 1:67 Percentage sampling is not supported in org.apache.hadoop.hive.ql.io.HiveInputFormat. Error encountered near token '20' hive> SELECT * FROM medicare_part_b.medicare_part_b_2013_text TABLESAMPLE(20 percent); FAILED: SemanticException 1:68 Percentage sampling is not supported in org.apache.hadoop.hive.ql.io.HiveInputFormat. Error encountered near token '20' hive> SELECT * FROM medicare_part_b.medicare_part_b_2013_raw TABLESAMPLE(20 percent); FAILED: SemanticException 1:67 Percentage sampling is not supported in org.apache.hadoop.hive.ql.io.HiveInputFormat. Error encountered near token '20'

aervits · ‎03-18-2016

@sivasaravanakumar k I don't see hadoop included. Why not use maven instead of including each jar manually? Take a look at my example https://github.com/dbist/HBaseNewApi.git specifically pom.xml that's all you need.

aervits · ‎03-18-2016

@sivasaravanakumar k you need to include hbase-client and hadoop-client in your dependencies. What version of HBase and Hadoop are you using?

aervits · ‎03-18-2016

excellent, please accept the best answer.

aervits · ‎03-18-2016

here's a full script, piggybank is both in pig-client/lib and in pig-client directory REGISTER /usr/hdp/current/pig-client/piggybank.jar; A = LOAD 'data2' USING PigStorage() as (url, count); fs -rm -R output; STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0'); my dataset is 1 2 3 4 5 output would be -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/1/1-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/2/2-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/3/3-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/4/4-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/5/5-0,000 -rw-r--r-- 3 root hdfs 0 2016-03-18 01:51 /user/root/output/_SUCCESS and each file would contain one line [root@sandbox ~]# hdfs dfs -cat /user/root/output/5/5-0,000 5 in case of @Rich Raposa example the output directory would look like so: [root@sandbox ~]# hdfs dfs -ls output3 Found 6 items -rw-r--r-- 3 root hdfs 0 2016-03-18 01:59 output3/_SUCCESS -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00000 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00001 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00002 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00003 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00004 which means with PARALLEL it creates multiple files within the same directory. In terms of MultiStorage, it created a separate directory and separate file. Additionally with MultiStorage you can pass compression, granted it's bz2, gz, no snappy and delimiter. It's clunky and documentation is not the best but if you need that type of control, it's an option.

aervits · ‎03-17-2016

yep, Java is the way to go. Try mapreduce with Java, it's not too bad, https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

aervits · ‎03-17-2016

@Marco Lanza you can take a look at hadoop pipes or hadoop streaming to leverage a different language than Java. I think if you plan to learn MapReduce on Hortonworks platform, then invest into Java. There's also http://www.cascading.org/, then there are a couple of higher level languages like Apache Pig or Apache Hive that have smaller learning curve. You can also look at Apache Spark as that's where Big Data industry is going and there you have multiple language support including C# http://research.microsoft.com/en-us/projects/spark-clr/ hadoop streaming reference below https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html

Online	Offline
Last Visited	‎08-15-2019 06:35 AM

Member Since	‎10-01-2015 11:46 AM
Last Visited	‎08-15-2019 06:35 AM
Posts	3,933
Kudos received	1074

Cloudera Community

Re: Where can I get latest resource_management.c...

Re: How to Kerberize Flume?

Re: Load Hive Table form Pig Output File.

Re: HDP 2.6 Cluster Issues with Hive Metastore

Re: which HDP release will storm 1.1.0 be packaged...

Re: Resynchronize the HBase data betweentwo cluste...

Re: Resynchronize the HBase data betweentwo cluste...

Re: Resynchronize the HBase data betweentwo cluste...

Re: Is CombineHiveInputFormat deprecated by OrcInp...

Re: Exception in thread “main” java.lang.NoClassDe...

Re: Exception in thread “main” java.lang.NoClassDe...

Re: There is a problem when install HDP on the ste...

Re: Store output file as 3 files using pig

Re: Can I use HortonWorks and Its Technologies to ...

Re: Can I use HortonWorks and Its Technologies to ...