About TAZIMehdi

TAZIMehdi · ‎02-02-2016

The solution is to create dynamically a table from avro, and then create a new table of parquet format from the avro one. there is the source code from Hive, which this helped you CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';

TAZIMehdi · ‎02-02-2016

actually, there is no answer to my question, i'll publish soon the answer and accept it

TAZIMehdi · ‎01-29-2016

let's assume that my HDFS block size is equal to 256Mb and that i need to store 20Gb of data on OCR/Parquet file(s), is it better to store all the data on one OCR/Parquet File, or is it better to store it on many ORC/Parquet files of 256Mb (HDFS Block Size) ?

TAZIMehdi · ‎01-27-2016

Hello back ! sorry for 6days latency of my answer, otherwise i couldn't find how Ozone stores data on HDFS , in order to see how is it handling small files. do you have any idea ? thanks a lot 🙂

TAZIMehdi · ‎01-20-2016

thanks a lot for you answer once again 🙂 1 - what do you mean by source to destination ? is it somekind of ETL on raw data to put in a DW ? 2.1 - is there in recomanded MPP data by hortonworks ? 2.2 if there is no option what other alternative exists ? Thanks 😉

TAZIMehdi · ‎01-20-2016

no worry for that, i'm more talking about performance while reading, i know that hbase performs well in range scan but it is still true with huge amounts of data when it comes to run into operational issues like compaction,node rebuild and load distribution ?

TAZIMehdi · ‎01-20-2016

Hello, in short : Can I use HBase over HDFS as a datalake ? in detail : as Hadoop has been designed to store massive amounts of data(as big files), i was wondering if according to my use case (storing lot of small files) HBase is will be more suitable ? of course data in HBase is stored in HDFS but what about metadata and when HBase runs into operational issues like compaction,node rebuild and load distribution ? thanks in advance.

TAZIMehdi · ‎01-20-2016

first of all thanks for you answer the duplication wasn't about the date but more about the data on parquet and hbase, otherwise using hive over hbase is not really as good as having a columnar format... have a nice day 🙂

TAZIMehdi · ‎01-19-2016

Hello, I have some questions related to realtime analytics on hadoop, here is my use case and questions. I'm trying to use some BI solutions (like Tableau) in order to do realtime analytics on hadoop. 1 - what are the most used architectures in order to achieve my goal? 2 - does it make sense to use a MPP database as a datamart (loading data according to the business fields from hadoop to mpp)? 3 - can a nosql database like cassandra replace an mpp database? If yes, is it better?

TAZIMehdi · ‎01-19-2016

yes thanks ^^, in my case i'm using hbase because i'm handling a large amount of small files.

Online	Offline
Last Visited	‎06-19-2018 01:50 PM

Member Since	‎06-18-2018 03:05 AM
Last Visited	‎06-19-2018 01:50 PM
Posts	34
Kudos received	13

Cloudera Community

Re: Create Hive table to read parquet files from p...

Re: How to setup high availability for lily Hbase ...

Re: Create Hive table to read parquet files from p...

Re: Create Hive table to read parquet files from p...

storage strategy of OCR / Parquet file

Re: Can I use Hbase as a datalake

Re: How to achieve realtime analytics on hadoop

Re: Can I use Hbase as a datalake

Can I use Hbase as a datalake

Re: Parquet data duplication

How to achieve realtime analytics on hadoop

Re: Parquet data duplication