About sbomma

sbomma · ‎08-12-2016

In Hadoop, the construct of a update is to a huge MapReduce and then find the record(s), that need to be updated and do an insert and delete. As you can see from MapReduce perspective its an expensive operation with levels of mapReduce. With ACID turned on, All of the above answers are correct. But you should design your Data Structures to be append only with date and time stamp and or a version reference for the latest state of your records. Even though ACID supports Updates, i would say in order to manage performance, i would recommend to insert instead of update ( More like an Upsert function).

sbomma · ‎08-11-2016

This is a great article for anyone looking to ingest data quickly and store in compressed formats. This will work very well For POC, testing and sandbox type of activities. I used something like this and made it production grade at a client by automating some of the jobs using oozie. Once the data was loaded we also had verification scripts that would audit what came in and what got dropped.. Also we had clean up scripts that would remove all the raw data from HDFS, once the data was set in Hive in ORC format that was compressed and partitioned. With the advent of Nifi and Spark, its worth looking at building an Nifi processor in conjuction with spark jobs to load the data seamlessly into Hive/Hbase in compressed formats as its being loaded.

sbomma · ‎08-04-2016

Here is a tutorial you could use Spark to load csv into hive. http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/

sbomma · ‎07-29-2016

@Robert Levas Thank you so much.

sbomma · ‎07-29-2016

Thanks Robert, This makes sense. Does it increase the complexity for the install? Also what versions are recommended, is there any issues with versions..

sbomma · ‎07-29-2016

sbomma · ‎07-25-2016

We would have to create the column as a date field instead of a string field, to store dates. That way everything would be stored as a date datatype. So once you have that determined, your ingest data can use a UDF like @Sunile Manjee suggested. If you have data being stored as the date into hive, you could use any of the hive functions to represent the data in any way you prefer. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions search for Date Functions

sbomma · ‎06-22-2016

Hi @alain One more way: 3 Step Method Step 1: You can create a external table pointing to an HDFS location conforming to the schema of your csv file. You can drop the csv file(s) into the external table location. Step 2: Create a managed Hive table with ORC format. Step 3: Do Insert into Managed table select from External table. ( Once the records are copied, delete the files from the external directory) This process can be automated using scripting via oozie or cron. I have used this to do mass batch ingestion. More recent way of doing this is using Apache Nifi with Hive table processor, makes life much more simpler..:). If you want to read about Nifi please go to http://hortonworks.com/products/hdf/ Thanks Satish

sbomma · ‎06-17-2016

Hi Jan WebHcat is interface for the HDFS metadata management tool HCatalog. So for both PIG and Hive Hcatalog would store schema related information. Please view the attached tutorial. Hiveserver2 is the actual engine that runs hive. You could http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/ Data can be accessed via WebHcat Rest APIs which would inern call the Hive API.. More Reference https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference https://cwiki.apache.org/confluence/display/Hive/Hive+APIs+Overview#HiveAPIsOverview-WebHCat%28REST%29

sbomma · ‎06-16-2016

Agree with Sindhus Comments. The link provides some basic setup to optimize the queries on Hive. @Roberto Sancho Can you please let us know how is the table partitioned, bucketed. If it is partitioned, can you please give us if the where clause makes use of the partition columns.