About jknulst

jknulst · ‎01-27-2017

@Jacqualin jasmin Please try this from within beeline client: 0 jdbc:hive2://> !run /tmp/test.hql The file does not need to be local to the Hiveserver2, it needs to exist on the node where you run Beeline. check: 0 jdbc:hive2://> !help too, for many usefull special commands in Beeline

jknulst · ‎01-27-2017

@Reddy Please don't forget to mark the question as answered if it is answered.

jknulst · ‎01-26-2017

@Joshua Petree Don't forget to mark the question as answered, if it is answered

jknulst · ‎01-25-2017

Yes, just enter those OS level paths ( /mnt/data1,/mnt/data2,/mnt/data3 ) as comma separated value in the box for Datadir ('dfs.datadir.data.dir') on the HDFS config page on Ambari. HFDS is just a logical layer on top of the OS level filesystem, so you just hand Ambari/Hadoop the locations on the native OS filesystem where to 'host' HDFS.

jknulst · ‎01-25-2017

@Joshua Petree It is doable, no problem. You would have to mount the 4 disks to the OS anyway. So mount the OS disk to / and the other 3 HD's to /hadoop/hdfs/data1, /hadoop/hdfs/data2 and /hadoop/hdfs/data3. In Ambari you can set the OS level local folders to be used as HDFS storage like in the screenprint. property = 'dfs.datanode.data.dir'

jknulst · ‎01-25-2017

@Priyan S Maybe it is because you did not set up pre-emption on Yarn? Without pre-emption the order in which the jobs were submitted to Q1,Q2 and Q41 is determining the capacity allocations. It may be that since the jobs on Q1 and Q2 were submitted first, they both grabbed the max allocation of their respective queues (40%). When the job for Q41 comes along there is just no more than the remaining 20% for Q4 and/or 100% for any of its subleafs Q41,42,43,44. I don't get why Q41 is only getting 10% and not 20%. You can look upon pre-emption as a way to help restore the state in which all queues get at least their minimum configured allocation, even though the missing part for one queue A. operating under its minimum might be used by another queue B. operating above its minimum (since it grabbed the excess capacity, up to its maximum, because it was still available at that time). Without pre-emption queue A. would have to wait for queue B. to release capacity of finished jobs in B. With pre-emption Yarn will actively free up resources of B. to allocate to A. in the process it might even kill job parts in queue B. to do so.

jknulst · ‎01-25-2017

In my opinion it is best to still regard Hive as an analytical DB. With the ACID (updates) and streaming features the community is stretching the tool to things it wasn't designed for. These are not to be used at very large scale and very large loads. ACID and streaming will put tremendous strain on the Hive metastore. In the end the native storage model of Hive is still based on streaming through whole HDFS files, even with ORC. Without true indexes Hive will never be a real good match for high transactional workloads. Doing large analytical sweeps/scans through data is still at odds with high speed random read/write/update/delete. But that is not bad, there are just other components in HDP to do the other jobs right.

jknulst · ‎01-25-2017

@Reddy Because it is an external table there is no one-liner to do it. That is probably the whole point of having external tables So you need to do ALTER TABLE some.table DROP PARTITION (part="some") PURGE; and hdfs dfs -rm -R /path/to/table/basedir I put the 'PURGE' in there intentionally. It would work for non-external tables, but just not for external tables.

jknulst · ‎01-25-2017

With the help of the remarks by @Aaron Dossett I found a solution to this. Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution. HDFS can report the files on a path that are open for write: hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite or alternatively you can just list only NON open files on a path: hdfs fsck <storm_hdfs_state_output_path> -files The output is quite verbose but you can use sed or awk to get closed/completed files from there. (Java HDFS api has similar hooks, this is just for CLI level solution)

jknulst · ‎01-17-2017

@Arun Mahadevan @Aaron Dossett @Sriharsha Chintalapani I am kind of confused right now. So let me rephrase what I got so far in my own words: Whereas Trident can have strong exactly-once semantics for persisting stream aggregates and tuples making it to any HDFS file, the action of rotating the file itself is not protected by these same strong guarantees? Or is the rotation protected by exactly-once but not the .addRotationAction attached to it? It is just not clear in the documentation: https://github.com/apache/storm/tree/master/external/storm-hdfs#hdfs-bolt-support-for-trident-api Suppose, the file rotation is exactly-once then it could work to have the syncpolicy set to the exact same size limit as the size-based rotation policy. That way the files will only be visible to HDFS clients (synced) when that size limit is met.

Online	Offline
Last Visited	‎12-05-2024 10:10 AM

Member Since	‎08-15-2016 02:39 PM
Last Visited	‎12-05-2024 10:10 AM
Posts	189
Kudos received	63

Cloudera Community

Re: Metron Connection failed [Errno 111] error on ...

Re: How do I verify all of my data from enrichment...

Re: Metron profiler client unbale to load native h...

Re: Kafka ConsumerGroupCommand Error

Re: Metron Parser Fail to start with Ambari 2.5 an...

Re: how to run .hql scripts from beeline prompt

Re: How to delete/drop a partition of an external ...

Re: Single Node Configuration Questions

Re: Single Node Configuration Questions

Re: Single Node Configuration Questions

Re: Yarn queue limits don't apply

Re: HIVE positioning

Re: How to delete/drop a partition of an external ...

Re: Storm HDFS Bolt question (Trident api)

Re: Storm HDFS Bolt question (Trident api)