Member since
10-01-2015
3933
Posts
1150
Kudos Received
374
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3966 | 02-28-2017 09:12 PM |
08-08-2016
04:54 PM
2 Kudos
UPDATE: I'm happy to report that my patch for PIG-4931 was accepted and merged to trunk. I was browsing through Apache Pig Jiras and stumbled on Jira https://issues.apache.org/jira/browse/PIG-4931 requiring to document Pig "IN" operator. Turns out Pig had IN operator since days of 0.12 and no one had a chance
to document it yet. The associated JIRA is https://issues.apache.org/jira/browse/PIG-3269. In this short article I will go over the IN operator and until I'm able to submit a patch to close
out the ticket this should serve as its documentation. Now, IN operator in Pig works like in SQL. You provide a
list of fields and it will return just those rows. It is a lot more useful than for example a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY
(i == 1) OR
(i == 22) OR
(i == 333) OR
(i == 4444) OR
(i == 55555); You can rewrite the same statement as a = LOAD '1.txt' USING PigStorage(',') AS (i:int);
b = FILTER a BY i IN (1,22,333,4444,55555); The best thing about it is that it accepts more than just Integers, you can pass float, double, BigDecimal, BigInteger, bytearray and String.
Let's review each one in detail grunt> fs -cat data;
1,Christine,Romero,Female
2,Sara,Hansen,Female
3,Albert,Rogers,Male
4,Kimberly,Morrison,Female
5,Eugene,Baker,Male
6,Ann,Alexander,Female
7,Kathleen,Reed,Female
8,Todd,Scott,Male
9,Sharon,Mccoy,Female
10,Evelyn,Rice,Female Passing an integer to IN clause A = load 'data' using PigStorage(',') AS (id:int, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN (4, 6);
dump X;
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female) Passing a String A = load 'data' using PigStorage(',') AS (id:chararray, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN ('2', '4', '8');
dump X;
(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(8,Todd,Scott,Male) Passing a ByteArray A = load 'data' using PigStorage(',') AS (id:bytearray, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY id IN ('1', '9');
dump X;
(1,Christine,Romero,Female)
(9,Sharon,Mccoy,Female) Passing a BigInteger and using NOT operator, thereby negating the passed list of fields in the IN clause A = load 'data' using PigStorage(',') AS (id:biginteger, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY NOT id IN (1, 3, 5, 7, 9);
dump X;
(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
(8,Todd,Scott,Male)
(10,Evelyn,Rice,Female) Now I understand that most cool kids these days are using Spark; I strongly believe Pig has a place in any Big Data stack and it's livelihood depends on comprehensive and complete documentation. Happy learning!
... View more
Labels:
07-16-2016
02:08 AM
5 Kudos
Update: apparently when you initiate a support case resolution capture, let's say for HBase service, it will pull HDFS namenode logs in addition to the HBase logs. You may be faced with the same issue and may have to apply the approach below to overcome timeouts. In SmartSense 1.3.0 this will no longer be an issue. Until then, this is a way to avoid capture time outs. Firstly, lets discuss the difference between capture for analysis and support case resolution. Analysis bundles do not collect service logs. For support cases, you're going to fetch configuration and logs. Then based on how much anonymization you will want to apply, large log files will take a long time to collect. This is especially prominent with HDFS namenode logs. They tend to be big and this is exactly the scenario we're trying to address. Firstly, increase the threshold for agent time out in Ambari. In my case it was 30min. Feel free to raise it up to 2hrs on the Ambari SmartSense Operations page. Then, we're going to exclude anything but hadoop-hdfs-namenode-*.log logs. That leaves .out and .out.* and .log.* files out of the collection. On the HST server host, where HST is analogous to SmartSense, go to /var/lib/smartsense/hst-agent/resources/scripts directory. Notice we're accessing hst-agent not hst-server directory. The collection scripts exist on agent hosts not on hst-server. Edit hdfs-scripts.xml file and go to line 100, it may be 10 lines give or take depending on which version of SmartSense you're running. On 1.2.2, it is line 100. Change the following lines if [ `hostname -f` == "${MASTER}" ] && [ `echo "${SLAVES}" | grep -o ',' | wc -l` -gt 1 ] ; then
find $LOG 2>/dev/null -type f -mtime -2 -iname '*' -exec cp '{}' ${outputdir} \;
find $LOG 2>/dev/null -type f -mtime -2 -iname '*' -exec cp '{}' ${outputdir} \;
else
for file in `find $LOG 2>/dev/null -type f -mtime -2 -iname '*' ;
find $LOG 2>/dev/null -type f -mtime -2 -iname '*' ; `
to if [ `hostname -f` == "${MASTER}" ] && [ `echo "${SLAVES}" | grep -o ',' | wc -l` -gt 1 ] ; then
# find $LOG 2>/dev/null -type f -mtime -2 -iname '*' -exec cp '{}' ${outputdir} \;
find $LOG 2>/dev/null -type f -mtime -2 -iname '*.log' -exec cp '{}' ${outputdir} \;
else
for file in `find $LOG 2>/dev/null -type f -mtime -2 -iname '*.log' ;
find $LOG 2>/dev/null -type f -mtime -2 -iname '*.log' ;
It is hard to see the difference, what we changed is actually comment out first find command, in 2nd find command, we replaced '*' to '*.log' and repeated the same in the for loop and again in the last find command. So for every iteration of '*', replace that with '*.log'. As the last step, let's restart SmartSense service and agents to propagate the changes to every agent; we only care about namenode nodes but depending on service and host components, I don't see why we couldn't restart all of them. One other thing I'd like to point out is that that same directory /var/lib/smartsense/hst-agent/resources/scripts contains scripts for other services, so essentially you can apply the same steps for any other service. Granted this is a pretty corner use case but when you're investigating a high severity issue and you have no means of uploading logs besides going at it the hard way, this may be a good approach. Finally, let's verify this approach. Go to SmartSense view and initiate a capture. At this point, when capture is complete, go to the SmartSense server node and navigate to the local storage directory. in that directory, you will find your latest bundle, uncompress it and cd into the new directory In that directory, there will be another compressed file, uncompress that as well. Finally CD into that directory and then into services directory. At this point you will see various services. We care about HDFS. Go inside it and finally into logs directory. There you will find your *.logs I want to highlight the fact that this is a hack and use it at your own risk. At the very least, notify your support engineer of the approach. I'd like to thank @Paul Codding and @sheetal for showing me the inner-workings of SmartSense. Your feedback is welcome.
... View more
Labels:
05-03-2016
02:30 AM
2 Kudos
I'm a long-time user of Apache Bigtop. My experience with Hadoop and Bigtop predates Ambari. I started using Bigtop with version 0.3. I remember pulling bigtop.repo file and install Hadoop, Pig and Hive for some quick development. Bigtop makes it convenient and easy. Bigtop has matured since then and there are now multiple ways of deployment. There's still a way to pull repo and install manually but there's better ways now with Vagrant and Docker. I won't rehash how to deploy Bigtop using Docker as it was beautifly described here. Admittedly, I'm running it on Mac and was not able to provision a cluster using Docker. I did not try with non-OSX. This post is about Vagrant. Let's get started: Install VirtualBox and Vagrant Download 1.1.0 release wget http://www.apache.org/dist/bigtop/bigtop-1.1.0/bigtop-1.1.0-project.tar.gz uncompress the tarball tar -xvzf bigtop-1.1.0-project.tar.gz change directory to bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-vm cd bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-vm here you can review the README but to keep it short you can edit the vagrantconfig.yaml for any additional customization like changing VM memory, OS, number of CPUs, components (e.g. hadoop, spark, tez, hama, solr) etc and also number of VMs you'd like to provision. This last part is the killer feature, you can provision a Sandbox with multiple nodes, not a single VM. Same is true with Docker provisioner but I can't confirm that for you. Feel free to read the README in bigtop-1.1.0/bigtop-deploy/vm/vagrant-puppet-docker for that approach. then you can start provisioning your custom sandbox with vagrant up wait 5-10min and then you can use standard Vagrant commands to interact with your custom Sandbox. vagrant ssh bigtop1 now just create your local user and off you go sudo -u hdfs hdfs dfs -mkdir /user/vagrant
sudo -u hdfs hdfs dfs -chown -R vagrant:hdfs /user/vagrant for your convenience, add the bigtop machine(s) to /etc/hosts Now, you're probably wondering why would I use Bigtop over regular sandbox? Well, Sandbox has been getting pretty resource heavy and has a lot of components. I like to provision a small cluster with just a few components like hadoop, spark, yarn and pig. Bigtop makes this possible and runs easily within a memory strapped VM. One downside is that with the latest release, Spark is at 1.5.0 and Hortonworks Sandbox is at 1.6.0, story is the same with other components. There are version gaps and if you can look past it, you have a quick way to prototype without much fuss! This is by no means meant to steal thunder from an excellent Ambari quick start guide, this is meant to demonstrate yet another approach from a rich ecosystem of Hadoop tools.
... View more
Labels:
02-23-2018
11:50 AM
@Kuldeep Kulkarni Add "deploy JCE policies" steps as prerequisites. I tried without JCE and it fails for me. Let me know if i am missing anything.
... View more
12-06-2017
04:07 PM
https://community.hortonworks.com/articles/149910/handling-hl7-records-part-1-hl7-ingest.html https://community.hortonworks.com/articles/149891/handling-hl7-records-and-storing-in-apache-hive-fo.html https://community.hortonworks.com/articles/149982/hl7-ingest-part-4-streaming-analytics-manager-and.html https://community.hortonworks.com/articles/150026/hl7-processing-part-3-apache-zeppelin-sql-bi-and-a.html Attribute Name Cleaner (Needed for messy C-CDA and HL7 attribute names) https://github.com/tspannhw/nifi-attributecleaner-processor
... View more
06-30-2017
08:05 PM
@riyer I'd avoid going against HBase with Hive. Generating a snapshot is so trivial that you should consider going that route first. On average, going against a snapshot should be 2.5x times better than going against HBase directly.
... View more
06-20-2016
01:10 PM
Hello @Artem Ervits To read data from WebHcat do I have to put the data inside HDFS? Is there a way to read this data directly via Rest API? I mean, the data of the table, not the metadata. Thank you so much
... View more
02-09-2018
06:55 AM
or another fix could be (if using 'guest' user. In my case I had another user): su guest; sqoop import--connect jdbc:postgresql://127.0.0.1/ambari --username ambari -P --table hosts --target-dir /user/guest/ambari_hosts_table
... View more
05-06-2016
05:10 PM
right now there are no concrete release dates, I would wait until Hadoop Summit San Jose for any announcements.
... View more
- « Previous
- Next »