About rajdip_chaudhur

stevel · ‎04-02-2018

If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server

balavignesh_nag · ‎05-02-2017

Hi @rajdip chaudhuri Update is supported only for hive table in which ACID property is enabled. Also by seeing your querying im afraid as I hive will not support such use case as of now. However once MERGE statement is added you will be able to update using join. Check on this Jira ticket . https://issues.apache.org/jira/browse/HIVE-10924 If you use your existing query it will fail. Alternatively load the data after performing join into a temp table and then update the target based on your temp table.

rajdip_chaudhur · ‎04-13-2017

Yes, this node was part of one of the old HDP installations. However we have uninstalled that now and shifted to 2.5.3, a more stabler release. Have undertaken the current steps : 1) Deleted old 2.3.4 and current folder under /usr/hdp 2) Restarted the ambari agent 3) Added the new host again and took care of host run check issues (like pre-existing old 2.3.4 packages and users and folders. Have removed them) 4) Node was successfully added. But had to install a new rpm python-argparse 5) Added DataNode, Node Manager and clients in the new node successfully Through ambari I can now see this node added successfully with required services.

rajdip_chaudhur · ‎04-07-2017

It worked. Thanks a lot. Have also accepted the best answer.

agauthier · ‎02-09-2017

I had a similar use case recently. You have to approach this understanding that it's different paradigm: You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this: spark = SparkSession\ .builder\ .appName("CheckData")\ .getOrCreate() lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0]) You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function: def virtualPortFunction(line): #Do something, return output process of a line virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: virtualPortFunction(x)) This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route. Also look at the pyspark samples part of the Spark distribution. Good place to start. https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py

rohitdoppalapud · ‎03-02-2017

Hi rajdip, I am having the same issue when loading the data in to the Hbase. Did your error got resolved or you still waiting for the solution? Do you have any idea of mapping the hbase column mappings with multiple row keys?

myoung · ‎12-23-2016

@rajdip chaudhuri I used your warehouse.csv and your load command. While it does appear to finish successfully, this is what I see at the end: ImportTsv Bad Lines=5 File Input Format Counters Bytes Read=590 File Output Format Counters Bytes Written=0 As you can see there were 5 bad lines, which is the total line count in your file. That means the command ran, but there was a problem with the data. It took a little bit of effort to find the issue, but the problem was that your csv file has an extra , at the end of the lines. Here is an example from your file: 1,AAAAAAAABAAAAAAA,Conventional childr,977787,651,6th ,Parkway,Suite 470,Fairview,Williamson County,TN,35709,United States,-5, It should look like this: 1,AAAAAAAABAAAAAAA,Conventional childr,977787,651,6th ,Parkway,Suite 470,Fairview,Williamson County,TN,35709,United States,-5 Notice that I removed the trailing comma. Now the data was loaded: ImportTsv Bad Lines=0 File Input Format Counters Bytes Read=585 File Output Format Counters Bytes Written=0 And here is the scan: hbase(main):011:0> scan 'warehouse' ROW COLUMN+CELL 1 column=mycf:w_city, timestamp=1482521720833, value=Fairview 1 column=mycf:w_country, timestamp=1482521720833, value=United States 1 column=mycf:w_county, timestamp=1482521720833, value=Williamson County 1 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5 1 column=mycf:w_state, timestamp=1482521720833, value=TN 1 column=mycf:w_street_name, timestamp=1482521720833, value=6th 1 column=mycf:w_street_number, timestamp=1482521720833, value=651 1 column=mycf:w_street_type, timestamp=1482521720833, value=Parkway 1 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 470 1 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAABAAAAAAA 1 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Conventional childr 1 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=977787 1 column=mycf:w_zip, timestamp=1482521720833, value=35709 2 column=mycf:w_city, timestamp=1482521720833, value=Fairview 2 column=mycf:w_country, timestamp=1482521720833, value=United States 2 column=mycf:w_county, timestamp=1482521720833, value=Williamson County 2 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5 2 column=mycf:w_state, timestamp=1482521720833, value=TN 2 column=mycf:w_street_name, timestamp=1482521720833, value=View First 2 column=mycf:w_street_number, timestamp=1482521720833, value=600 2 column=mycf:w_street_type, timestamp=1482521720833, value=Avenue 2 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite P 2 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAACAAAAAAA 2 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Important issues liv 2 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=138504 2 column=mycf:w_zip, timestamp=1482521720833, value=35709 3 column=mycf:w_city, timestamp=1482521720833, value=Fairview 3 column=mycf:w_country, timestamp=1482521720833, value=United States 3 column=mycf:w_county, timestamp=1482521720833, value=Williamson County 3 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5 3 column=mycf:w_state, timestamp=1482521720833, value=TN 3 column=mycf:w_street_name, timestamp=1482521720833, value=Ash Laurel 3 column=mycf:w_street_number, timestamp=1482521720833, value=534 3 column=mycf:w_street_type, timestamp=1482521720833, value=Dr. 3 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 0 3 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAADAAAAAAA 3 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Doors canno 3 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=294242 3 column=mycf:w_zip, timestamp=1482521720833, value=35709 4 column=mycf:w_city, timestamp=1482521720833, value=Fairview 4 column=mycf:w_country, timestamp=1482521720833, value=United States 4 column=mycf:w_county, timestamp=1482521720833, value=Williamson County 4 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5 4 column=mycf:w_state, timestamp=1482521720833, value=TN 4 column=mycf:w_street_name, timestamp=1482521720833, value=Wilson Elm 4 column=mycf:w_street_number, timestamp=1482521720833, value=368 4 column=mycf:w_street_type, timestamp=1482521720833, value=Drive 4 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 80 4 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAAEAAAAAAA 4 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Bad cards must make. 4 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=621234 4 column=mycf:w_zip, timestamp=1482521720833, value=35709 5 column=mycf:w_city, timestamp=1482521720833, value=Fairview 5 column=mycf:w_country, timestamp=1482521720833, value=United States 5 column=mycf:w_county, timestamp=1482521720833, value=Williamson County 5 column=mycf:w_gmt_offset, timestamp=1482521720833, value= 5 column=mycf:w_state, timestamp=1482521720833, value=TN 5 column=mycf:w_street_name, timestamp=1482521720833, value= 5 column=mycf:w_street_number, timestamp=1482521720833, value= 5 column=mycf:w_street_type, timestamp=1482521720833, value= 5 column=mycf:w_suite_number, timestamp=1482521720833, value= 5 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAAFAAAAAAA 5 column=mycf:w_warehouse_name, timestamp=1482521720833, value= 5 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value= 5 column=mycf:w_zip, timestamp=1482521720833, value=35709 5 row(s) in 0.3110 seconds

powerofk76 · ‎02-07-2019

@Timothy Spann.... do we not have a solution to parse/read xml without databricks package? I work on HDP 2.0+,Spark2.1 version. I am trying to parse xml using pyspark code; manual parsing but I am having difficulty -when converting the list to a dataframe. Any advice? Let me know; I can post the script here. Thanks. , @Timothy Spann.... do we not have a solution to parse/read xml without databricks package? I work on HDP 2.0+,Spark2.1 version. I am trying to parse xml using pyspark code; manual parsing but I am having difficulty -when converting the list to a dataframe. Any advice? Let me know; I can post the script here. Thanks.

KuldeepK · ‎10-31-2016

@rajdip chaudhuri - That's great. Can you please accept my answer above. I will look into this and file a BUG if needed.

kczhen · ‎12-15-2017

Did you solved the problem? I'm facing the same problem with you. And I tried all methods mentioned above, but it didn't work Here is my exceptions [root@cos1 ~]# sqoop list-tables --connect jdbc:mysql://192.168.2.190/experiment3 --username scott -P 17/12/15 19:17:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Enter password: 17/12/15 19:17:41 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 17/12/15 19:17:41 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.CatalogQueryManager.listTables(CatalogQueryManager.java:102) at org.apache.sqoop.tool.ListTablesTool.run(ListTablesTool.java:49) at org.apache.sqoop.Sqoop.run(Sqoop.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) at org.apache.sqoop.Sqoop.main(Sqoop.java:236) Here is my screenshot I'm using sqoop-1.4.6, hadoop-2.7.4 and mysql-connecctor-java-5.1.38.jar @rajdip chaudhuri @Artem Ervits

Online	Offline
Last Visited	‎09-13-2018 03:02 PM

Member Since	‎02-03-2016 11:49 AM
Last Visited	‎09-13-2018 03:02 PM
Posts	123
Kudos received	23

Cloudera Community

Re: Error in installing services after adding a no...

Re: Copy large number of massive files from local ...

Re: update data in Hive using join

Re: Error in installing services after adding a no...

Re: Enabling and Disabling a Ranger Policy using c...

Re: Need to convert a python code to pyspark scrip...

Re: Vertex did not succeed due to OWN_TASK_FAILURE...

Re: No data shown in HBase after importtsv

Re: Parsing XML in Spark RDD

Re: Issue in adding Hive service from ambari in HD...

Re: Error in using Sqoop