About punshi

punshi · ‎07-08-2019

Hello, I need to improve speed for my select query currently I am testing how best I can get from Impala with given hardware. I have created a parquet table with partitions and block size = 512mb. Here is some more details - [quickstart.cloudera:21000] > show table stats tbl_parq_512m; Query: show table stats tbl_parq_512m +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 4999950 | 1 | 337.76MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 | | 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 | | 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 3 | 1.32GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 | | 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 7 | 3.23GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 | | 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 | | 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 | | 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 14 | 6.59GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 | | 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 | | 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 | | 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 7 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 | | 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 11 | 5.47GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 | | Total | | | | 526989910 | 74 | 34.74GB | 0B | | | | | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+--------------------------------------------------------- When I run the below query , it takes 28 seconds to complete - [quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10; Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_parq_512m group by source_ip,destination_ip order by avl limit 10 Query submitted at: 2019-07-08 18:32:31 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=88422271f161227b:fe860ef500000000 +-----------+----------------+-------------------+-----------+ | source_ip | destination_ip | avg(volume) | avl | +-----------+----------------+-------------------+-----------+ | 10.0.0.1 | 20.0.0.1 | 4.499997599976 | 22500415 | | 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 | | 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 | | 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 | | 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 | | 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 | | 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 | | 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 | | 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 | | 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 | +-----------+----------------+-------------------+-----------+ Fetched 10 row(s) in 27.17s I am trying to speed up the query so I tried going with gzip compression. [quickstart.cloudera:21000] > show table stats tbl_gzip_512m; Query: show table stats tbl_gzip_512m +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | source_ip | destination_ip | year | event_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | 10.0.0.1 | 20.0.0.1 | 1990 | 1/1/1990 | 5000950 | 1 | 231.85MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=10.0.0.1/destination_ip=20.0.0.1/year=1990/event_date=1%2F1%2F1990 | | 30.0.0.1 | 40.0.0.1 | 1993 | 1/1/1993 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=30.0.0.1/destination_ip=40.0.0.1/year=1993/event_date=1%2F1%2F1993 | | 41.0.0.1 | 45.0.0.1 | 1994 | 1/1/1994 | 19999800 | 2 | 925.94MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=41.0.0.1/destination_ip=45.0.0.1/year=1994/event_date=1%2F1%2F1994 | | 46.0.0.1 | 50.0.0.1 | 1995 | 1/1/1995 | 48999020 | 5 | 2.22GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=46.0.0.1/destination_ip=50.0.0.1/year=1995/event_date=1%2F1%2F1995 | | 51.0.0.1 | 55.0.0.1 | 1996 | 1/1/1996 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=51.0.0.1/destination_ip=55.0.0.1/year=1996/event_date=1%2F1%2F1996 | | 56.0.0.1 | 60.0.0.1 | 1997 | 1/1/1997 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=56.0.0.1/destination_ip=60.0.0.1/year=1997/event_date=1%2F1%2F1997 | | 61.0.0.1 | 70.0.0.1 | 1998 | 1/1/1998 | 99998000 | 19 | 4.53GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=61.0.0.1/destination_ip=70.0.0.1/year=1998/event_date=1%2F1%2F1998 | | 71.0.0.1 | 75.0.0.1 | 1999 | 1/1/1999 | 49999000 | 5 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=71.0.0.1/destination_ip=75.0.0.1/year=1999/event_date=1%2F1%2F1999 | | 80.0.0.1 | 85.0.0.1 | 2000 | 1/1/2000 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=80.0.0.1/destination_ip=85.0.0.1/year=2000/event_date=1%2F1%2F2000 | | 86.0.0.1 | 90.0.0.1 | 2001 | 1/1/2001 | 49999000 | 10 | 2.26GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=86.0.0.1/destination_ip=90.0.0.1/year=2001/event_date=1%2F1%2F2001 | | 91.0.0.1 | 95.0.0.1 | 2002 | 1/1/2002 | 82998340 | 16 | 3.76GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_gzip_512m/source_ip=91.0.0.1/destination_ip=95.0.0.1/year=2002/event_date=1%2F1%2F2002 | | Total | | | | 526990910 | 80 | 23.84GB | 0B | | | | | +-----------+----------------+------+------------+-----------+--------+----------+--------------+-------------------+---------+-------------------+--------------------------------------------------------- Size has dropped by 30% , it has reduced from 34gb to 24 gb. I was expecting similar behaviour with query performance as well hoping it will reduce some more time. Below is my query query- [quickstart.cloudera:21000] > select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10; Query: select source_ip,destination_ip,avg(volume),sum(duration) as avl from tbl_gzip_512m group by source_ip,destination_ip order by avl limit 10 Query submitted at: 2019-07-08 18:39:09 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8d4bdddc207517ff:952aea9d00000000 +-----------+----------------+-------------------+-----------+ | source_ip | destination_ip | avg(volume) | avl | +-----------+----------------+-------------------+-----------+ | 10.0.0.1 | 20.0.0.1 | 4.50001219768244 | 22504983 | | 30.0.0.1 | 40.0.0.1 | 4.499744347443475 | 90003989 | | 41.0.0.1 | 45.0.0.1 | 4.498248032480324 | 90023641 | | 46.0.0.1 | 50.0.0.1 | 4.499559623845538 | 220487564 | | 86.0.0.1 | 90.0.0.1 | 4.500191183823676 | 224966377 | | 71.0.0.1 | 75.0.0.1 | 4.500268505370108 | 224977244 | | 80.0.0.1 | 85.0.0.1 | 4.499946078921578 | 225002743 | | 56.0.0.1 | 60.0.0.1 | 4.500426968539371 | 225003328 | | 51.0.0.1 | 55.0.0.1 | 4.500039240784815 | 225013849 | | 91.0.0.1 | 95.0.0.1 | 4.499926612990091 | 373535589 | +-----------+----------------+-------------------+-----------+ Fetched 10 row(s) in 27.67s As you can see there isn't any improvement moreover it looks little slow despite decrease in data by 1/3rd. So am I doing something wrong here when using gzip ? Thanks

punshi · ‎07-03-2019

Thanks @AcharkiMed I tried that there was no improvement however after enabling hyper threading I was able to reduce it to 25sec from 40. I tried my hands on HDFS cache however even after defining cache_pool size to 3gb only 1 gb data gets cached, Any idea ? Query: show table stats tbl_parq_123 +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ | 1990 | -1 | 2 | 338.45MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1990 | | 1993 | -1 | 6 | 1.32GB | 0B | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1993 | | 1994 | -1 | 6 | 1.32GB | 1010.95MB | 1 | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1994 | | 1995 | -1 | 14 | 3.24GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1995 | | 1996 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1996 | | 1997 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1997 | | 1998 | -1 | 27 | 6.60GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1998 | | 1999 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=1999 | | 2000 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2000 | | 2001 | -1 | 14 | 3.30GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2001 | | 2002 | -1 | 23 | 5.48GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_parq_123/year=2002 | | Total | -1 | 148 | 34.79GB | 1010.95MB | | | | | +-------+-------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+ F [root@quickstart ~]# hdfs cacheadmin -listPools Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL three_gig_pool impala hdfs rwxr-xr-x 3000000000 never Thanks

punshi · ‎07-01-2019

punshi · ‎06-26-2019

Hello, Finally I was able to add values into partition table after much reading into the concepts and trial & error. However the load file is still not working and also I came to know while doing that when we specify partition value into query it actually is treated as column. insert into tbl_raw_v12 partition(source_ip="192.168.1.10",destination_ip="172.16.8.177",year,event_date) select * from tbl_123; I managed to add 50k values, tbl_raw_v12 contained 29 columns and tbl_123 contains 27 columns and remaining two column values are added through insert query at runtime. I will try with more values tomorrow to check performance. Thanks

punshi · ‎06-26-2019

Hi @EricL I tried for very small table like you have showed and it worked but for even text files which contain roughly 50k values I was getting error. This method seems not suitable for POC purpose or production. I tried copying mere 50k records which I had loaded into other table and it threw memory error. [quickstart.cloudera:21000] > insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata; Query: insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata Query submitted at: 2019-06-26 18:43:21 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=584e2c3facce84f0:a95b21f300000000 WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000 Memory left in process limit: -4.25 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB HdfsTableSink: Total=3.94 GB Peak=3.94 GB CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB Buffer Pool: Free Buffers: Total=0 Buffer Pool: Clean Pages: Total=0 Buffer Pool: Unused Reservation: Total=0 RequestPool=root.root: Total=3.96 GB Peak=4.12 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB Untracked Memory: Total=3.29 GB WARNING: The following tables are missing relevant table and/or column statistics. default.tbl_motadata Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000 Memory left in process limit: -4.25 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB HdfsTableSink: Total=3.94 GB Peak=3.94 GB CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB Buffer Pool: Free Buffers: Total=0 Buffer Pool: Clean Pages: Total=0 Buffer Pool: Unused Reservation: Total=0 RequestPool=root.root: Total=3.96 GB Peak=4.12 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB Untracked Memory: Total=3.29 GB (1 of 2 similar) The same doesn't throw error for non-partitioned tables. Yesterday I copied some 500million records using "INSERT INTO table1 Select * from table2" on same machine though it took 4hours but the task was complete, but here I am getting error for copying mere 50k records. I am actually testing Impala performance for our production system needs however I believe we need to tune it extensivley for better performance, even Select queries are taking too much time ,it took 46 minutes for query to execute and same query I had executed in Clickhouse db which took only 2min from same hardware configuration. We where actually expecting to get result in few seconds as we thought Impala should be way faster then Clickhouse. I don't know why "Insert ...into [Partition] Select *" throws error but without partition it works fine. Is there any other method to check. I am actually checking against 500 million records. Thanks

punshi · ‎06-26-2019

punshi · ‎06-25-2019

Hi @EricL The dynamic part is not that important my concern is how should I upload the data into partitioned table using LOAD, I tried with regular method of directly using - [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part; But it didn't worked my text file id.txt contained both regular column and partitioned column data. So I tried with method by specifying partitioned value in query- load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); But this is also not working. So how does the given query works? and on what kind of table, what is the structure of data file it needs ? LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] I have created a basic partitioned table using below query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; I am able to use load into regular tables but with partitioned I am getting errors. What should my data file which I will load into this table look like ? What should be my load statement so that I am able to load data into partitioned table. Edit- I have also tried following hive example but I am getting same error mentioned. https://bigdataprogrammers.com/load-files-into-hive-partitioned-table/ Thanks

punshi · ‎06-25-2019

Hello, I am trying to add data into my table using load data method, however I was getting error so to ease the complexity and understand the concept better I reduced my table however still I am getting error. I have tried two methods both throw same error. There isn't much example available- My create query- create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ','; Now I tried two methods here - 1. I created a text file id.txt with data 2,2009 10,2009 15,2010 34,2010 My LOAD query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010); When I executed above query I got- Query: load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010) ERROR: AnalysisException: Partition spec does not exist: (year=2010). I thought it must be throwing error id.txt contains year data so I tried with different id1.txt file- 5 10 20 30 My load query- [quickstart.cloudera:21000] > load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010); I am getting same error- Query: load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010) ERROR: AnalysisException: Partition spec does not exist: (year=2010). My question is how should my data text file look like, the partioned column values should be inside file or not. I am pretty much confuse , please guide. Thanks

punshi · ‎06-24-2019

I will be getting data from network via snmp or serial. Loading data from file is very fast , I did it manually first to check and it worked however when I try to do the same with my java code I get different errors. I am trying few steps but getting error -- 1. I wrote 1 miilion records into text file through java. 2. I tried to upload the generate text file into HDFS through my java code. 3. I am executing LOAD query through code. However I am unable to execute step 2- 1. I tried java method of ProcessBuilder(), getRuntime.exec() but it didn't worked. Then I tried searching for some API to do following task but I was getting error- java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) at HdfsWriter.run(impala_crt.java:935) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at impala_crt.writeFile(impala_crt.java:861) at impala_crt.main(impala_crt.java:106) Do you have any solution for this or there is some better method ?

punshi · ‎06-20-2019

Hi @Andrew_Sherman I have accepted this answer but yet one thing bothers me addbatch() is supported so what is the use of addbatch() if I cannot execute it. Thanks

Online	Offline
Last Visited	‎07-23-2019 08:17 AM

Member Since	‎06-17-2019 12:17 AM
Last Visited	‎07-23-2019 08:17 AM
Posts	23

Cloudera Community

No change in overall performance when used gzip.

Re: What performance to expect from Cloudera VM ?

What performance to expect from Cloudera VM ?

Re: Load Data throws error for partitioned table.

Re: Load Data throws error for partitioned table.

Re: Load Data throws error for partitioned table.

Re: Load Data throws error for partitioned table.

Load Data throws error for partitioned table.

Re: Insert into works very slow in Impala.

Re: Impala JDBC executeBatch() doesn't work for St...