About bleonhardi

bleonhardi · ‎04-07-2016

@hoda moradi slightly unrelated the following might be useful a bit. I wrote a sample applicatoon that does some parsing and processing in Spark https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

bleonhardi · ‎04-07-2016

Is that really true? You can use the LDAP interface for all frontends and you I think can even do Linux->AD integration using LDAP as well. https://technet.microsoft.com/en-us/magazine/2008.12.linux.aspx#id0060006 I totally agree with you that this doesn't make sense but it should theoretically work I would think.

bleonhardi · ‎04-07-2016

There are lots of resources out there, for example: http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20 The main question is what you want to do consumption wise. Hive has progressed a lot with Tez, ORC, predicate pushdown, a cost based optimizer etc. and should be easily able to outdo an Oracle DW for heavy analytical queries, unless you have a really expensive Exadata system ( even then it depends on the queries ) It would be great for example for running daily/weekly aggregations. However it will not be able (yet, LLAP is coming) to compete with Oracle for smaller interactive queries and it will be hard to directly connect a reporting studio to it with interactive queries for example. So the question is what you plan to do with the Warehouse. Do you directly connect to it with reporting software that runs in interactive mode ( Hive is not good for that yet although Phoenix might be ) Or do you mostly use it for aggregation queries, cube generation etc. ( For example by daily creating a Tableau report that is then provided to users. ) In that case you might be able to get rid of your warehouse entirely. ( Some tools like Platfora provide a neat combination ) Of course an Oracle warehouse provides some other advantages like UPDATE/DELETE capabilities, ( Hive has ACID now but its pretty new ), stored procedures, foreign key relationships ( however often disabled for big loads ) integration with most common backup tools etc. So there may be some other reasons to keep a warehouse around as main store. In this case you would go to one of the following scenarios - data warehouse offloading Use hadoop for workloads like heavy aggregations that take up too much of the Warehouse capacity and are better done in hadoop - landing zone Use hadoop as an initial data landing and transformation zone before it goes into the warehouse. Utilize the capability of hadoop to store data in any format and keep large amounts of data around. - hot archiving Archive old data into Hadoop to have it still accessible - Advanced analytics Run things like sparkml or R on Hadoop that may not be possible either technically or because of costs in the warehouse - Unstructured Analytics Augment your warehouse with unstructured data ( emails, social media data , ... ) in Hadoop. - Realtime Ingestion using Flume,Kafka,Spark Streaming, Storm So summarizing all approaches is valid Source -> Hadoop ( if you can completely replace the warehouse) Source -> Hadoop -> Warehouse ( with hadoop as etl landing zone which gives you advantages like keeping source data around if desired, reducing need for massive ETL installations ( you might still have them but many can push heavy computations into hadoop so could be smaller , ... ) Source -> Warehouse -> Hadoop ( Warehouse offloading, archival ) that has the advantage that you can keep the existing environment but you can reduce pressure on your warehouse. Hope that helps.

bleonhardi · ‎04-06-2016

No idea about the empty lines sorry, I don't have these lines as you can see. Just delete them with sed? ( google delete empty lines linux ) To add a string on top of a file you could just use linux as well from the outside i.e. : echo "The count is:" > out.txt beeline .... >>out.txt the double ">>" appends to a file

bleonhardi · ‎04-05-2016

That article might help too: https://community.hortonworks.com/content/kbentry/25726/spark-streaming-explained-kafka-to-phoenix.html The github link has a complete explanation how to create the kafka topic start a producer for testing etc.

bleonhardi · ‎04-05-2016

How about this? beeline -u jdbc:hive2://localhost:10000/default --silent=true --outputformat=csv2 -e "select * from sample_07 limit 10" > out.txt [root@sandbox ~]# cat out.txt sample_07.code,sample_07.description,sample_07.total_emp,sample_07.salary 00-0000,All Occupations,134354250,40690 11-0000,Management occupations,6003930,96150 11-1011,Chief executives,299160,151370 11-1021,General and operations managers,1655410,103780 11-1031,Legislators,61110,33880 11-2011,Advertising and promotions managers,36300,91100 11-2021,Marketing managers,165240,113400 11-2022,Sales managers,322170,106790 11-2031,Public relations managers,47210,97170 11-3011,Administrative services managers,239360,76370

bleonhardi · ‎04-05-2016

So Cascade works because he forces the delete of all objects belonging to that object ( similar to delete ... cascade for row deletes ) Now the question is why your drop function did not work. And I don't know we might have to look into logs to figure that out. But I have seen flakiness with functions in hive before on an older version of hive. So it might be just a bug or a restart required or something. But again without logs hard to say.

bleonhardi · ‎04-05-2016

What does he say when you try to drop these functions? Just silent? Anything in the logs? You can also try to use the drop ... cascade command. DROP DATABASE IF EXISTS userdb CASCADE;

bleonhardi · ‎04-04-2016

Or distcp/falcon. It depends how you load the data as well, if that is well defined they could just duplicate the load. Nifi would come in in my opinion in very specific scenarios when you have control over the data source. There is also wandisco but that would be a big change.

bleonhardi · ‎04-04-2016

aaaah really the absolute same datanodes. So not even two different data folders and configs but the absolute same datanode with the same blocks pointing to two different Namenodes? How would you think that would work with file changes? Would those be merged? Or do you want the same HDFS period monitored by two ambari? I fail to even logically see how any of that could be possible. How about just implementing two Queues and have a research queue with a percentage of the cluster and perhaps a folder with a capacity as well? Sounds like about the same.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: scheduling a spark-submit job using oozie

Re: Can ranger work with AD without Kerberos?

Re: Integrate a Oracle DW with Hadoop or transit a...

Re: How to dump the output to a file from Beeline?

Re: how to create a consumer group for spark and s...

Re: How to dump the output to a file from Beeline?

Re: Not able to drop database in hive

Re: Not able to drop database in hive

Re: Can a datanode be shared between 2 separate Am...

Re: Can a datanode be shared between 2 separate Am...