About csguna

csguna · ‎02-23-2017

Just put it under the user directory and set the permission just like you we do Linux fs . Using hadoop fs shell command. hadoop fs -chown Usage: hadoop fs -chmod In addition for backup We can configure HDFS Snapshots point in time file recovery . https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

csguna · ‎02-22-2017

Below are the prerequisite for your requirements. 1 . We need a timestamp interceptor to inject the timestamp to every header. if you dont have one already in your flume -conf .properties. for example tail1.sources.src1.interceptors = ic1 tail1.sources.src1.interceptors.ic1.type = timestamp 2 . if its multi tier flume agent architecture it is recommended to use hdfs.UseLocalTimestamp that will use a timestamp generated by the flume agent runining the hdfs sink. tail1.sinks.sink1.useLocalTimeStamp = true 3. To make all the files thats gets generated in month to put in a same month folder all we have to do is to you use below config - justs month and year tail1.sinks.sink1.hdfs.path = flume/collector1/%m-%Y

csguna · ‎02-22-2017

I am letting me my thoughts. Please correct me if I am wrong in understanding your issue. 1. You can use the below when you are creating table in impala . STORED AS PARQUET 2 . For example, with a school_records table partitioned on a year column, there is a separate data directory for each different year value, and all the data for that year is stored in a data file in that directory. A query that includes a WHERE condition such as YEAR=1966, YEAR IN (1989,1999), can examine only the data files from the appropriate directory " quoted from Cloudera Impala knoweldge base" 3. Would you consider writing a custom interceptor to add the field in the event header or you could you UUID interceptor for unique id , second option but i am not sure you could pull the data from hdfs and run a python script to add a new field.

csguna · ‎02-21-2017

But now I need to store them into HDFS as "partition" structure shown below. I have been told this is required in order to let Impala efectively read the data. /hive/warehouse/test/fact_my_service/year=2017/month=2/day=21 Answer : You can use hdfs sink - escape sequence like a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d %d day of month (01) %m month (01..12) %Y year (2010) refere the apache flume api - hdfs sink https://flume.apache.org/FlumeUserGuide.html Can you please share some hints how to get my data stored that way and how to make Impala to understand already stored data? if you have a common shared metastore for hive and impala you can create external table and point the location CREATE EXTERNAL TABLE table_ name > (userid INT, movid STRING, age TINYINT) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > LOCATION ' hive warehouse or any location of your data ' Note - Make sure to perform INVALIDATE METADATA; Since we have created the table outside we have to refersh hive metastore in impala to query .

csguna · ‎02-20-2017

Hive is designed for schema on Read. Meaning Hive has not control over the underlying storage . You can damage the data and still managed to query using hive . Let say if the schema does not match the file contents then Hive will try its best to read it. going down fruther it will produce null values if its non numberic strings. Where as in traditional database you write update insert and the database has control over the storage it will enforce the schema while writing thats why it is schema on write. So to sum up you wont be ablecreate Not Null constraints hive table and enforce it by design .

csguna · ‎02-20-2017

Could you run the below commands and post the results I am curious , whats your replication factor ? hadoop fsck path to directory hadoop fs -du -s path to directory The above commands should give us the same results. Both only caclulates hdfs raw data without considering the replication factor. The below will calculate the file size across the nodes( hard disk ) and replication factor . hadoop fs -count -q /path/to/directory we can compare the results pertain to how much HDFS space has been consumed and run against Namenode UI results .

csguna · ‎02-17-2017

I believe you have Configured a seperate host that act as a proxy ,making it to handle the request along with kerberos . Hence I think you wont be able to by pass the proxy because it works like a session facade https://www.cloudera.com/documentation/enterprise/5-2-x/topics/impala_proxy.html#proxy_kerberos

csguna · ‎02-16-2017

I believe you are missing the realm rule in the core-site.xml please check the tag in the core-site.xml hadoop.registry.kerberos.realm Took this reference from hadoop.apaches core-site.xml api hadoop.registry.kerberos.realm The kerberos realm: used to set the realm of system principals which do not declare their realm, and any other accounts that need the value. If empty, the default realm of the running process is used. If neither are known and the realm is needed, then the registry service/client will fail

csguna · ‎02-14-2017

You mean the host ip that impala is running or the port could you tell me

csguna · ‎02-13-2017

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: 0.0.0.0/0.0.0.0:8022 From the above error it is clear that the external datanode is having trouble connecting to Namenode. You can do one thing. Check the status of the namenode you are connect by Sudo Service hadoop-hdfs-namenode status Sudo Service hadoop-hdfs-secondarynamenode status if it has not started then you may start it by replacing the status with start if you dont have authorization you should contact the hadoop admin . Also please check the same for Secondarynamenode. Thanks

Online	Offline
Last Visited	‎10-28-2024 06:24 AM

Member Since	‎05-16-2016 09:33 PM
Last Visited	‎10-28-2024 06:24 AM
Posts	785
Kudos received	112

Cloudera Community

Re: Kerberos / Sentry Integration

Re: How to upgrade Hive from 2.1 to 3.0 via CDH 6....

Re: How does nameservice id works for HA, how does...

Re: What license does the express edition fall und...

Re: Sqoop2 over Sqoop1 in CDH6

Re: Undeletable HDFS Files

Re: CSV files stored in partition to HDFS

Re: CSV files stored in partition to HDFS

Re: CSV files stored in partition to HDFS

Re: hive create table with not null constraint

Re: HDFS storage check shows different values

Re: Impala: Querying daemon directly when using Ke...

Re: java.lang.IllegalArgumentException: Illegal pr...

Re: Connection to Impala Failed

Re: Not able to access HDFS, getting Connection ex...