About rbiswas1

rbiswas1 · ‎05-06-2016

@Nilesh Below given is your solution: Input: mysql> select * from SERDES; +----------+------+----------------------------------------------------+ | SERDE_ID | NAME | SLIB | +----------+------+----------------------------------------------------+ | 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 58 | NULL | NULL | | 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde | | 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | +----------+------+----------------------------------------------------+ 7 rows in set (0.00 sec) Command: sqoop import --connect jdbc:mysql://test:3306/hive \ --username hive \ --password test \ --table SERDES \ --hcatalog-database test \ --hcatalog-table SERDES \ --create-hcatalog-table \ --hcatalog-storage-stanza "stored as orcfile" \ --outdir sqoop_import \ -m 1 \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --driver com.mysql.jdbc.Driver Logs: ... ... 16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement: create table `demand_db`.`serdes` ( `serde_id` bigint, `name` varchar(128), `slib` varchar(4000)) stored as orcfile ... ... 16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false 16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0% 16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0% 16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully 16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=297179 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=87 HDFS: Number of bytes written=676 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=14484 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=7242 Total vcore-seconds taken by all map tasks=7242 Total megabyte-seconds taken by all map tasks=11123712 Map-Reduce Framework Map input records=8 Map output records=8 Input split bytes=87 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=92 CPU time spent (ms)=4620 Physical memory (bytes) snapshot=353759232 Virtual memory (bytes) snapshot=3276144640 Total committed heap usage (bytes)=175112192 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec) 16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records. Output: hive> select * from serdes; OK 56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde 58 NULL NULL 59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde 60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde 61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde 62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde Time taken: 2.711 seconds, Fetched: 8 row(s) hive>

rbiswas1 · ‎05-06-2016

@Kaliyug Antagonist Unicode file: [root@test test]# pwd /root/test [root@test test]# cat xyz Les caractères accentués (Français) En données nous avons confiance Données, données, partout et tous les noeuds étaient déconnecté Données, données, partout [root@test test]# External table DDL: create external table demand_db.unicode (data string) COMMENT 'External table for data cleansing' LOCATION '/tmp/test/'; External table location: [root@test ~]# hdfs dfs -mkdir -p /tmp/test [root@test ~]# hdfs dfs -chmod -R 777 /tmp/test [root@test ~]# hdfs dfs -ls /tmp Output: hive> create external table unicode > (data string) > COMMENT 'External table for data cleansing' > LOCATION '/tmp/test/'; OK Time taken: 0.502 seconds hive> select * from unicode; OK Les caractères accentués (Français) En données nous avons confiance Données, données, partout et tous les noeuds étaient déconnecté Données, données, partout Time taken: 0.897 seconds, Fetched: 8 row(s) hive> Conclusion: You do not need to covert unicode character set. Also String works perfectly in this case. Thanks

rbiswas1 · ‎05-06-2016

@omkar pathallapalli Let us know if the solution I posted worked for you. Thanks

rbiswas1 · ‎05-06-2016

@santosh rai Out of curiosity do you have any specific used case for using 2.2?

rbiswas1 · ‎05-06-2016

@omkar pathallapalli This issue should be resolved by adding --driver com.mysql.jdbc.Driver at the end of the sqoop command. For example: sqoop import --connect $DB_CONNECTION --username $DB_USERNAME --password $DB_PASSWORD --table salaries --target-dir /tmp/salaries --outdir sqoop_import -m -1 --fields-terminated-by ',' --driver com.mysql.jdbc.Driver Thanks

rbiswas1 · ‎05-06-2016

@Sunile Manjee The supported policies for late data handling are: backoff: Take the maximum late cut-off and check every specified time. exp-backoff (default): Recommended. Take the maximum cut-off date and check on an exponentially determined time. final:Take the maximum late cut-off and check once. For example, a late cut-off of hours (8) means data can be delayed by up to 8 hours: <late-arrival cut-off="hours(6)”/> The, late input in the following process specification is handled by the /apps/myapp/latehandle workflow: <late-process policy="exp-backoff" delay="hours(2)”> <late-input input="input" workflow-path="/apps/myapp/latehandle" /> </late-process> So this means that for 8 hours till feed arrives the workflow will be retried. Once the feed arrives within that window, the window will be reset. Now inside /apps/myapp/latehandle you can put your own logic (It may be a sqoop/hive/shell etc etc). The processing here will determine what will happen to that late feed. For simplified scenarios we can run the actual workflow or might modify for a special workflow which handles the dependencies and boundary cases. Thanks

rbiswas1 · ‎04-20-2016

After my initial research below is what I found about the security options in HDF: 1. To enable the User Interface to be accessed over HTTPS instead of HTTP, the "security properties" heading in the nifi.properties file needs to be edited. 2. The user authentication is aided by the Login Identity Provider which is a pluggable mechanism for authenticating users via their username/password. a. Login Identity Provider integrates with a Directory Server to authenticate users using LDAP. username/password authentication can be enabled by referencing this provider in nifi.properties. b. .Login Identity Provider also integrates with a Kerberos Key Distribution Center (KDC) to authenticate users. NiFi can be configured to use Kerberos SPNEGO (or "Kerberos Service") for authentication. Note: By default NiFi will require client certificates for authenticating users over HTTPS. So explicitly which Login Identity Provider to use needs to be configured in nifi.properties file. 3. Levels of Access in HDF can be controlled by setting up the user of the Authority Provider (Admin) who can then give the corresponding roles to the requesting users. Below roles are supported: i) Administrator ii) Data Flow Manager iii) Read Only iv) Provenance v) NiFi 4. Out of the box NiFi provides several options to encrypt and decrypt the data. The EncryptContent processor allows for the encryption and decryption of data, both internal to NiFi and integrated with external systems, such as openssl and other data sources and consumers. Detailed information can be found in HDF documentation: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_AdminGuide/content/ch_administration_guide.html Thanks

rbiswas1 · ‎04-20-2016

@emaxwell Thank you

rbiswas1 · ‎04-20-2016

I am planning to use HDF for a particular used case for ingestion of a lot of flat files and some sensitive metadata from relation databases. In conjunction it will work with HDP 2.4 cluster. My question is apart from the out of the box security provided by Apache nifi itself what are the other security best practices which should be implemented for HDF. For more info the HDP cluster will be secured using kerberos, ranger and knox. Thanks.

rbiswas1 · ‎04-20-2016

Hi, @Peter Coates Assuming you have moderate number of files did you tried the below option: bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo Where srclist contains (you can populate this file by recursive listing) hdfs://nn1:8020/foo/dir1/a hdfs://nn1:8020/foo/dir2/b More info here: https://hadoop.apache.org/docs/r1.2.1/distcp2.html Please let me know if this works. Thanks

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

Re: Sqoop Import to Hive with Compression

Re: How to load and store nvarchar

Re: iam triying to load data from mysql to hdfs on...

Re: Where can I find HDP 2.2 version sandbox for d...

Re: iam triying to load data from mysql to hdfs on...

Re: How does falcon handle late arriving data on t...

Re: Best practices for securing data ingestion thr...

Re: Best practices for securing data ingestion thr...

Best practices for securing data ingestion through...

Re: How to discp a partitioned multi-level directo...