About ahadjidj

ahadjidj · ‎05-10-2016

@manikandan ayyasamy Ok, from reading your several posts and comments I think that Hive is not able to access your schema on HDFS. I see that you are trying to give the schema that's stored on HDFS. However, you are using http and port 8888 which is not good. Can you try with this modification TBLPROPERTIES ('avro.schema.url'='hdfs://trvlhaddv2gw1.tsh.thomson.com:8020/PATH_TO_YOUR_AVSC_FILE');

ahadjidj · ‎05-10-2016

@vamsi valiveti The Jar cited in the doc should already be on your cluster since it's the default SerDe. I just checked in my HDP sandbox and it's present here : /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar I also tested it and everything is working from my side: 0: jdbc:hive2://sandbox.hortonworks.com:10000> CREATE TABLE my_table(a string, b bigint) 0: jdbc:hive2://sandbox.hortonworks.com:10000> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; No rows affected (0.841 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000> insert into table my_table values ('row1',1); INFO : Session is already open INFO : Dag name: insert into table my_table values ('row1',1)(Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1462551928453_0043) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.my_table from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/my_table/.hive-staging_hive_2016-05-10_15-26-38_253_2589783258292691814-2/-ext-10000 INFO : Table default.my_table stats: [numFiles=1, numRows=1, totalSize=19, rawDataSize=0] No rows affected (9.278 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000> As you can see data get created in HDFS : [root@sandbox tmp]# hdfs dfs -ls /apps/hive/warehouse/my_table Found 1 items -rwxrwxrwx 3 hive hdfs 19 2016-05-10 15:26 /apps/hive/warehouse/my_table/000000_0 [root@sandbox tmp]# hdfs dfs -cat /apps/hive/warehouse/my_table/000000_0 {"a":"row1","b":1} And hive has the data: 0: jdbc:hive2://sandbox.hortonworks.com:10000> select * from my_table; +-------------+-------------+--+ | my_table.a | my_table.b | +-------------+-------------+--+ | row1 | 1 | +-------------+-------------+--+ 1 row selected (0.847 seconds) For LOAD vs INSERT, as I said, use load if you have data already in a file (local or HDFS) that you want to add to your Hive Table. If you want to use Insert Values, you need to read about ACID support in Hive and its limitation. Hive not a "standard" database. This is out of the scope of this question so I don't spend too match on it. Just look to the note from the doc: INSERT...VALUES is available starting in Hive 0.14. Inserting values from SQL statements can only be performed on tables that support ACID. See Hive Transactions for details. Does this answer your question ?

ahadjidj · ‎05-10-2016

What's the problem you are facing If you didn't get any error messages ? can you see the table tweets when you do : show tables;

ahadjidj · ‎05-10-2016

@Kevin Risden thanks for these links, that's interesting

ahadjidj · ‎05-10-2016

Hi @vamsi valiveti 1) You can start with the JSON SerDe that's cited in the Hive documentation: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe CREATE TABLE my_table(a string, b bigint, ...) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE; 2) Load and insert are used in different scenarios: Load is used to copy (or move) data from a file into you Hive table. This is used when you already have data in a file in HDFS or your local file system and you want to create a table to process it with SQL. Here you start by creating a Hive table and you do a load to get data from files into your Hive table. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] If your data is in your local file system use 'LOCAL'. In this case data will be copied from you local file system into the HDFS directory of your Hive table (by default in the Hive warehouse folder). If your data is already in HDFS, do not use LOCAL but note that this moves (and does not copy) your HDFS files from their current path to the HDFS directory of your Hive table. You have also other options such as OVERWRITE to decide if you want to delete existing data in you Hive table or if you want to append. Insert is used if you want to copy data from a Hive table to another table. Here let say you have table 1 with data in it. You want to create table 2 to get a specific part of you table 1 data. You can combine an insert and select statement to get data that you want from Table 1 and store it in table 2 INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Here again you have choice between appending new data or replacing everything (overwrite)

ahadjidj · ‎05-10-2016

Hi @manikandan ayyasamy, Have you looked to the Hive doc on AvroSerDe : https://cwiki.apache.org/confluence/display/Hive/AvroSerDe ? you have examples on how to do it following your Hive version. Please try to implement it and ask the community if you have a particular problem with your approach. Note also that ORC can give the best performance with cost base optimization and Vectorization. Give it a try if this is a viable option.

ahadjidj · ‎05-09-2016

Hi @Michel Sumbul There are several details around Spark memory management especially since Spark 1.6. But to simply answer your question, if the dataset does not fit in memory, Spark spills to disk. It's clear that this behavior is necessary for Spark to work with large datasets but it impacts the performances. You can have an idea on this in the initial Spark's design dissertation by Matei Zaharia (section 2.6.4 Behavior with Insufficient Memory).

ahadjidj · ‎05-08-2016

Hi @Raj B What's the nature of data source ? Do you need the data in NiFi only or in NiFi and in the remote streaming server ? The ListenTCP processor listen to a particular port on the local host (not possible for remote host). This is the case of "Listen" processors in general. "Get" processors are used to get data from from remote systems. To get this data into NiFi on your laptop I see mainly two choices: Change the source behaviors to send the data to NiFi. If you need the data in the other server too, do an Y with NiFi and duplicate the flow. NiFi will forward data to the other server. Put a NiFi on your other server and listenTCP will be able to get the data. You can then forward it to other NiFi servers with Remote Process Group as well as to the local machine on different port. The first scenario is better if you can change the source.

ahadjidj · ‎05-07-2016

Hi @Subhasis Roy, Substring wants Integer parameters while Size returns Long. Try casting the size to Int like this: D = FOREACH C GENERATE SUBSTRING(wage, 1, (int)SIZE(wage)) as wage_new;

ahadjidj · ‎05-07-2016

Hi @simran kaur, To import several tables you can write a script and call it with Oozie. In the script you put your table names in a variable and iterate over it #!/bin/bash # t='Table1 Table2 Table2' for i in $t; do echo "Debut import table $i" ; sqoop import --connect jdbc:******* (the rest of parameters) echo "Fin import table $i" ; done The import will be sequential and this is not a bad thing. Sqoop import can be hard on the Database if you have lot of data and high degree of paralelisme. To do incremental import you need to have a column in your table that helps you. Mainly two options Your table contains a timestamp column that contains the timestamps of last modification for each row. You can use last modified mode (--incremental lastmodified). Rows having a timestamp value greater than --last-value are imported. To specify which column contains the timestamps use --check-column The ID in your table are always incremental. Here you can use append mode (--incremental append). At each import Sqoop will save the max ID retrieved from your table. In the next import Sqoop will get only rows with ID greater than the last saved max value (--last-value). To specify which column contains the incrementing ID us --check-column. Note: in this mode we suppose that each new row is an append and you don't modify existing rows. If a row is modified after it is imported, you wont have the new values at the next import To have several import frequency you can use several Oozie workflows with several parameters Append mode: see point 2. Append import only new lines and not the modified one. This also requires having an incremental ID Note that row 3 has been modified (Ohio -> Oh). However, it has not been imported in the second import Hope this helps

Online	Offline
Last Visited	‎08-19-2019 05:07 AM

Member Since	‎01-11-2016 06:11 PM
Last Visited	‎08-19-2019 05:07 AM
Posts	355
Kudos received	230

Cloudera Community

Re: How to access NIFI Process Group variable in E...

Re: GETSFTP with NiFi cluster

Re: how is Kafka different from Mosquitto(MQTT) ?

Re: Whitelisting using LookupAttribute

Re: Is there any ways if we can schedule or trigge...

Re: how to create and store the avro files in hive...

Re: basic hive clarification

Re: how to create and store the avro files in hive...

Re: Does Zeppelin have a Solr interpreter?

Re: basic hive clarification

Re: how to create and store the avro files in hive...

Re: dataframe bigger than evalable memory

Re: Can a locally installed NiFi instance listen t...

Re: Use of SIZE function in SUBSTRING for PIG

Re: optimized oozie workflow to import multiple ta...