Member since
02-03-2016
123
Posts
23
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3826 | 04-13-2017 08:09 AM |
09-08-2024
10:36 PM
With the Hive (newer than Hive 2.2), you can use Merge INTO MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age)
VALUES (source.id, source.name, source.age);
... View more
04-02-2018
01:19 PM
If this is a one off, and that file server is visible to all nodes in the cluster, you can actually use distcp with the source being a file://store/path URL and the destination hdfs://hdfsserver:port/path.. Use the -bandwidth option to limit the max bandwidth of every mapper so that the (mappers * bandwidth) value is less than the bandwidth off the file server
... View more
04-13-2017
12:27 PM
Yes, this node was part of one of the old HDP installations. However we have uninstalled that now and shifted to 2.5.3, a more stabler release. Have undertaken the current steps : 1) Deleted old 2.3.4 and current folder under /usr/hdp 2) Restarted the ambari agent 3) Added the new host again and took care of host run check issues (like pre-existing old 2.3.4 packages and users and folders. Have removed them) 4) Node was successfully added. But had to install a new rpm python-argparse 5) Added DataNode, Node Manager and clients in the new node successfully Through ambari I can now see this node added successfully with required services.
... View more
04-07-2017
08:01 AM
It worked. Thanks a lot. Have also accepted the best answer.
... View more
02-09-2017
02:44 AM
1 Kudo
I had a similar use case recently. You have to approach this understanding that it's different paradigm:
You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this: spark = SparkSession\
.builder\
.appName("CheckData")\
.getOrCreate()
lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function: def virtualPortFunction(line):
#Do something, return output process of a line
virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: virtualPortFunction(x))
This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route. Also look at the pyspark samples part of the Spark distribution. Good place to start. https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
... View more
03-02-2017
07:34 PM
Hi rajdip, I am having the same issue when loading the data in to the Hbase. Did your error got resolved or you still waiting for the solution? Do you have any idea of mapping the hbase column mappings with multiple row keys?
... View more
12-23-2016
07:08 PM
1 Kudo
@rajdip chaudhuri
I used your warehouse.csv and your load command. While it does appear to finish successfully, this is what I see at the end: ImportTsv
Bad Lines=5
File Input Format Counters
Bytes Read=590
File Output Format Counters
Bytes Written=0 As you can see there were 5 bad lines, which is the total line count in your file. That means the command ran, but there was a problem with the data. It took a little bit of effort to find the issue, but the problem was that your csv file has an extra , at the end of the lines. Here is an example from your file: 1,AAAAAAAABAAAAAAA,Conventional childr,977787,651,6th ,Parkway,Suite 470,Fairview,Williamson County,TN,35709,United States,-5, It should look like this: 1,AAAAAAAABAAAAAAA,Conventional childr,977787,651,6th ,Parkway,Suite 470,Fairview,Williamson County,TN,35709,United States,-5
Notice that I removed the trailing comma. Now the data was loaded: ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=585
File Output Format Counters
Bytes Written=0 And here is the scan: hbase(main):011:0> scan 'warehouse'
ROW COLUMN+CELL
1 column=mycf:w_city, timestamp=1482521720833, value=Fairview
1 column=mycf:w_country, timestamp=1482521720833, value=United States
1 column=mycf:w_county, timestamp=1482521720833, value=Williamson County
1 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5
1 column=mycf:w_state, timestamp=1482521720833, value=TN
1 column=mycf:w_street_name, timestamp=1482521720833, value=6th
1 column=mycf:w_street_number, timestamp=1482521720833, value=651
1 column=mycf:w_street_type, timestamp=1482521720833, value=Parkway
1 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 470
1 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAABAAAAAAA
1 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Conventional childr
1 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=977787
1 column=mycf:w_zip, timestamp=1482521720833, value=35709
2 column=mycf:w_city, timestamp=1482521720833, value=Fairview
2 column=mycf:w_country, timestamp=1482521720833, value=United States
2 column=mycf:w_county, timestamp=1482521720833, value=Williamson County
2 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5
2 column=mycf:w_state, timestamp=1482521720833, value=TN
2 column=mycf:w_street_name, timestamp=1482521720833, value=View First
2 column=mycf:w_street_number, timestamp=1482521720833, value=600
2 column=mycf:w_street_type, timestamp=1482521720833, value=Avenue
2 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite P
2 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAACAAAAAAA
2 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Important issues liv
2 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=138504
2 column=mycf:w_zip, timestamp=1482521720833, value=35709
3 column=mycf:w_city, timestamp=1482521720833, value=Fairview
3 column=mycf:w_country, timestamp=1482521720833, value=United States
3 column=mycf:w_county, timestamp=1482521720833, value=Williamson County
3 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5
3 column=mycf:w_state, timestamp=1482521720833, value=TN
3 column=mycf:w_street_name, timestamp=1482521720833, value=Ash Laurel
3 column=mycf:w_street_number, timestamp=1482521720833, value=534
3 column=mycf:w_street_type, timestamp=1482521720833, value=Dr.
3 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 0
3 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAADAAAAAAA
3 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Doors canno
3 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=294242
3 column=mycf:w_zip, timestamp=1482521720833, value=35709
4 column=mycf:w_city, timestamp=1482521720833, value=Fairview
4 column=mycf:w_country, timestamp=1482521720833, value=United States
4 column=mycf:w_county, timestamp=1482521720833, value=Williamson County
4 column=mycf:w_gmt_offset, timestamp=1482521720833, value=-5
4 column=mycf:w_state, timestamp=1482521720833, value=TN
4 column=mycf:w_street_name, timestamp=1482521720833, value=Wilson Elm
4 column=mycf:w_street_number, timestamp=1482521720833, value=368
4 column=mycf:w_street_type, timestamp=1482521720833, value=Drive
4 column=mycf:w_suite_number, timestamp=1482521720833, value=Suite 80
4 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAAEAAAAAAA
4 column=mycf:w_warehouse_name, timestamp=1482521720833, value=Bad cards must make.
4 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=621234
4 column=mycf:w_zip, timestamp=1482521720833, value=35709
5 column=mycf:w_city, timestamp=1482521720833, value=Fairview
5 column=mycf:w_country, timestamp=1482521720833, value=United States
5 column=mycf:w_county, timestamp=1482521720833, value=Williamson County
5 column=mycf:w_gmt_offset, timestamp=1482521720833, value=
5 column=mycf:w_state, timestamp=1482521720833, value=TN
5 column=mycf:w_street_name, timestamp=1482521720833, value=
5 column=mycf:w_street_number, timestamp=1482521720833, value=
5 column=mycf:w_street_type, timestamp=1482521720833, value=
5 column=mycf:w_suite_number, timestamp=1482521720833, value=
5 column=mycf:w_warehouse_id, timestamp=1482521720833, value=AAAAAAAAFAAAAAAA
5 column=mycf:w_warehouse_name, timestamp=1482521720833, value=
5 column=mycf:w_warehouse_sq_ft, timestamp=1482521720833, value=
5 column=mycf:w_zip, timestamp=1482521720833, value=35709
5 row(s) in 0.3110 seconds
... View more
02-07-2019
08:47 PM
@Timothy Spann.... do we not have a solution to parse/read xml without databricks package? I work on HDP 2.0+,Spark2.1 version. I am trying to parse xml using pyspark code; manual parsing but I am having difficulty -when converting the list to a dataframe. Any advice? Let me know; I can post the script here. Thanks. , @Timothy Spann.... do we not have a solution to parse/read xml without databricks package? I work on HDP 2.0+,Spark2.1 version. I am trying to parse xml using pyspark code; manual parsing but I am having difficulty -when converting the list to a dataframe. Any advice? Let me know; I can post the script here. Thanks.
... View more
10-31-2016
03:11 PM
@rajdip chaudhuri - That's great. Can you please accept my answer above. I will look into this and file a BUG if needed.
... View more
12-15-2017
06:02 AM
Did you solved the problem? I'm facing the same problem with you. And I tried all methods mentioned above, but it didn't work Here is my exceptions [root@cos1 ~]# sqoop list-tables --connect jdbc:mysql://192.168.2.190/experiment3 --username scott -P 17/12/15 19:17:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Enter password:
17/12/15 19:17:41 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/12/15 19:17:41 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.CatalogQueryManager.listTables(CatalogQueryManager.java:102) at org.apache.sqoop.tool.ListTablesTool.run(ListTablesTool.java:49)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) at org.apache.sqoop.Sqoop.main(Sqoop.java:236) Here is my screenshot I'm using sqoop-1.4.6, hadoop-2.7.4 and mysql-connecctor-java-5.1.38.jar @rajdip chaudhuri @Artem Ervits
... View more