Member since
06-28-2017
279
Posts
43
Kudos Received
24
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1132 | 12-24-2018 08:34 AM | |
3583 | 12-24-2018 08:21 AM | |
1166 | 08-23-2018 07:09 AM | |
5701 | 08-21-2018 05:50 PM | |
3040 | 08-20-2018 10:59 AM |
12-26-2018
08:16 AM
Hi, can't guarantee this, but the global index in Phoenix is as per documentation for 'heavy read' usage. So if this should make sense, your query should result in using the secondary index (of course only when using Phoenix). Regards Harald
... View more
12-26-2018
07:53 AM
Hi @Armanur Rahman, from the error message it seems to be a simple syntax error. From there I think this is the statement with the error: CREATE TABLE IF NOT EXISTS `hr.f_company_xml_tab` ( `RECID` STRING, `XMLTYPE.GETSTRINGVAL(XMLRECORD)` STRING) your second column is tried to be named 'XMLTYPE.GETSTRINGVAL(XMLRECORD)' which includes a '(' just as the error message claims. Can you rename the column to an easier name i.e. 'val', and try again? Regards Harald
... View more
12-26-2018
07:29 AM
Hi, so the Spark jobs have not finished during the few hours? They have all been hanging? Or did they finish, though there are errors with NoRouteToHost logged? And you shut down a full physical server, which means you 'loose' 4 VMs of 8 VMs at the same time and not just one VM on the server? To make sure that you don't loose data in this event (its a 50% loss), you will need to make your cluster 'rack' aware, so that the replications are ensured to be created in both racks (set the rack correctly in your host overview in Ambari, don't leave it with 'default-rack'). Otherwise the default cluster config is supposed to continue without data loss when max. 2 data nodes are lost as the replication factor is 3. Loosing 4 machines can leave you with data losses that the cluster can't recover, some files/data might be fully located on the machines being shut down. Regards Harald
... View more
12-24-2018
11:39 AM
Hi Ken, the filesystem /var/lib/docker/tmp is located on the docker host, so dependent on how you install it, this you AWS centos or maybe the Sandbox itself, where a docker is run. So check there on the size of the filesystem (df -h /var/lib/docker/tmp). If you run the sandbox there might even be restriction on the VirtualBox executing the Sandbox. Regards Harald
... View more
12-24-2018
11:33 AM
Hi @Abhimanyu Dasarwar, where does it show your memory to be 7.7 GB? Regards Harald
... View more
12-24-2018
11:26 AM
Hi @Pat ODowd, what exactly means the IP address becomes invalid? Will there be another machine getting that IP address? And the machine gets after a restart a new IP address? No route to host is what IP is supposed to respond when the host with that IP disappears from the network. In any case putting the machine down, will have an impact on currently running jobs, which is different to shutting down the service in the Ambari UI, where running jobs will finish and new jobs will be handled by other nodes. Bringing down the machine will interrupt currently active jobs, though it is supposed to be recovered after the TCP timeout. The cluster control needs to become aware of the 'lost' node as well before letting other nodes handle the jobs. UDP communication will simply be lost, while TCP communication will take 3-5 frames before setting the connection to broken. This takes some seconds (dependent on the settings typically between 30 and 90 seconds). Determining that a port isn't listening on an active machine while trying to connect is much faster than determining a TCP connection is broken, due to the specifics of TCP. So I guess the main point is why doesn't your jobs recover without intervention? How long did you wait for it? Regards Harald
... View more
12-24-2018
09:51 AM
Hi @Shesh Kumar, I don't think you can configure this via Ambari (but not sure anyway). What you could do is either write a replication in Hbase or simply configure both as one cluster making them rack aware to ensure the data is replicated in both racks (with different locations). Regards Harald
... View more
12-24-2018
09:48 AM
Hi @Praveen Kumar, this is a quite generic question, so a precise answer is difficult. But given the fact that you want to get the data from a RDBMS I think you can go with one cluster. What you need to consider is how much througput you will have to handle, typically the RDBMS will limit this anyway. Just for consideration: Spark and Flink are merely RAM intensive, while HBase uses HDD and RAM, dependent on the load. Hive again uses mainly HDD for M&R. But if you plan to create an external Hive table pointing to HBase, you are again in the Hbase usage pattern. Assuming you have a sufficient RAM available on your nodes (i would go with >= 2 Gb per PCU core) I think one cluster for all would do. If later the load increase, you can scale the cluster, one of the big advantages of Hadoop. Regards Harald
... View more
12-24-2018
09:37 AM
Hi @A Sabatino, thanks for the info. Would be great if you click on 'accept' for the answer. Helps everyone to see the issue is resolved and provides you and me with a reward in terms of reputation points 🙂 Regards Harald
... View more
12-24-2018
08:34 AM
1 Kudo
Hi @hr pyo This really depends and you will have to understand authentication with SSL to get all the details. I am trying this in short here: If you use self signed certificates or you sign the certificates by your own CA, you will experience browser warnings about unsecure connections. This means each time the user has to confirm he want to continue, until you install either the certificate of the server or the CA into the browser. Anyway there are preinstalled 'root ca' in every browser. So if you get your certificate signed by one of those root cas you don't have to install the certificate itself. Due to the chain of trust the browser accepts the signed certificate without further steps needed. To get a free of charge signed certificate you can use 'Let's encrypt'. In a enterprise level, you usually have an enterprise ca, that gets installed on all enterprise machines, and you let your certificate get signed by your enterprise ca. Regards Harald
... View more
12-24-2018
08:21 AM
Hi @A Sabatino, I am not sure why you expect the date resulting from your epoch value. So from what I can see, your value is not what you expect, the conversion is fine. In the API documentation (https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates) it is desribed, that the value ist interpreted in milliseconds from 1. January 1970 00:00:00 GMT. Now when interpreting '1 545 266 262', it results in something like 17.8 days. So a time on the 18. January 1970 seems to be the correct result. To me it appears as if you lost a factor of 1000 somewhere in your epoch value. Regards Harald
... View more
12-23-2018
01:25 PM
not very sure, but can you try the hdfs command instead? it should be configured to include the necessary jars for the execution: hdfs dfs -copyFromLocal trial.txt hdfs://sandbox-hdp.hortonworks.com:8020/tmp/
... View more
12-23-2018
01:07 PM
on your dev server, do you have any hive table defined that you can query? What actually happens when you are querying the table in hive?
... View more
12-23-2018
01:01 PM
To upload the file from your Windows machine to a Linux machine, you can use a tool like WinSCP. You configure the session for the Linux machine almost identical to the config in Putty. It gives you a GUI to copy files. On the other hand, when you need to access the Windows machine from Linux, you need to configure an FTP or better SFTP server on Windows that allows access to your NTFS path. Or you use the Windows Network to share, and install Samba, a Windows networking implementation, on the Linux machine.
... View more
12-22-2018
06:50 PM
I am a little guessing here, but I believe its possible that the Hive metastore has statistics (i.e. information on the number of records in the partitions), so that the count might actually not read the complete table. The count on the file must read the file in any case. But still i think 12 min are really long for processing 3.8 GB, even if this is the compressed size. Is the count the very first action on the data frame? So that Spark only executes all previous statements (i guess reading the file, uncompressing it etc) when running the count?
... View more
12-22-2018
08:03 AM
Hi Rajeswaran, I guess you are just using Ambari? Or have you implemented some own Python code anywhere? Can you perhaps post what some details on what action you are trying to execute? Regards Harald
... View more
12-22-2018
07:58 AM
HI Ajay, here is a sizing guide, which seems to address exactly your questions: https://community.hortonworks.com/articles/135337/nifi-sizing-guide-deployment-best-practices.html Still i personally wouldn't start with 8Gb RAM per node but at least with 16GB (2 GB per core). Anyway you will have to be clear on the throughput needed (Gb/sec.), not only on the overall volume. Regards Harald
... View more
12-22-2018
07:46 AM
Hi Sindhu, do you get any error message when it fails? Regards Harald
... View more
12-22-2018
07:34 AM
Can you perhaps also let us know how you try to read the file and the hive table? Also where is the file stored?
... View more
08-27-2018
08:39 AM
1 Kudo
It's described at the link, but its just a few steps (actually for a test setup): docker pull registry
docker run -d -p 5000:5000 --restart always --name registry registry:2 Now on the machine where you executed the above commands, a docker registry is available at port 5000. The parameter have this meaning: -d: Run container in background and print container ID -p: Publish a container’s port(s) to the host --restart: Restart policy to apply when a container exits --name: Assign a name to the container For further options refer to: https://docs.docker.com/engine/reference/commandline/run/ For informations on how to set it up for real use (not testing nor demonstration): https://docs.docker.com/registry/deploying/ The last link provides you with important information like how to set up keys and use it in an orchestrated environment.
... View more
08-25-2018
07:05 AM
it really depends on how you mapped the primary key during migration to Hbase. If you mapped it into a column pk within the column familiy, you will not see the pk column as from the original table (i.e. oracle). I suggest you try it as below, which would map the Hbase rowkey into the column "rowkey" in your Phoenix table and would also map the column PK from the column family in case it exists. CREATE VIEW "FNL.ADDRESSES" (
rowkey VARCHAR PRIMARY KEY,
"cf".HJMPTS DECIMAL(20,0),
"cf".CREATEDTS TIME,
"cf".MODIFIEDTS TIME,
"cf".TYPEPKSTRING DECIMAL(20,0),
"cf".OWNERPKSTRING DECIMAL(20,0),
"cf".PK DECIMAL(20,0),
"cf".P_ORIGINAL DECIMAL(20,0),
"cf".P_DUPLICATE DECIMAL(1,0),
"cf".P_APPARTMENT VARCHAR(255),
"cf".P_BUILDING VARCHAR(255),
"cf".P_CELLPHONE VARCHAR(255),
"cf".P_COMPANY VARCHAR(255),
"cf".P_COUNTRY DECIMAL(20,0),
"cf".P_DEPARTMENT VARCHAR(255),
"cf".P_DISTRICT VARCHAR(255),
"cf".P_EMAIL VARCHAR(255),
"cf".P_FAX VARCHAR(255),
"cf".P_FIRSTNAME VARCHAR(255),
"cf".P_LASTNAME VARCHAR(255),
"cf".P_MIDDLENAME VARCHAR(255),
"cf".P_MIDDLENAME2 VARCHAR(255),
"cf".P_PHONE1 VARCHAR(255),
"cf".P_PHONE2 VARCHAR(255),
"cf".P_POBOX VARCHAR(255),
"cf".P_POSTALCODE VARCHAR(255),
"cf".P_REGION DECIMAL(20,0),
"cf".P_STREETNAME VARCHAR(255),
"cf".P_STREETNUMBER VARCHAR(255),
"cf".P_TITLE DECIMAL(20,0),
"cf".P_TOWN VARCHAR(255),
"cf".P_GENDER DECIMAL(20,0),
"cf".P_DATEOFBIRTH TIME,
"cf".P_REMARKS VARCHAR(255),
"cf".P_URL VARCHAR(255),
"cf".P_SHIPPINGADDRESS DECIMAL(1,0),
"cf".P_UNLOADINGADDRESS DECIMAL(1,0),
"cf".P_BILLINGADDRESS DECIMAL(1,0),
"cf".P_CONTACTADDRESS DECIMAL(1,0),
"cf".P_VISIBLEINADDRESSBOOK DECIMAL(1,0),
"cf".P_STATE VARCHAR(255),
"cf".P_LANDMARK VARCHAR(255),
"cf".P_CODELIGIBLE DECIMAL(1,0),
"cf".ACLTS DECIMAL(20,0),
"cf".PROPTS DECIMAL(20,0),
"cf".P_ISHOMEADDRESS DECIMAL(1,0)
);
... View more
08-24-2018
10:41 AM
Have you setup SSL or Kerberos as the security for your Kafka broker? In that case have a look here: https://community.hortonworks.com/content/supportkb/150148/errorwarn-bootstrap-broker-6668-disconnected-orgap.html
... View more
08-24-2018
10:36 AM
1 Kudo
I think you will not be able to do so with SerDe. The SerDe will read record by record, which is line by line, and then the regex is applied on this record that has been read, making it impossible to span the pattern over multiple lines. One way you could solve it, is try to create a table with SerDe with 1 line for each record and do the combination of multiple lines then via query in Hive. Another way would be to process the input file first outside Hive and write out the line combined in one line as needed for your table.
... View more
08-23-2018
01:25 PM
can you do a klist on the shell with the same user that starts the HBase master (and post the result here). I am a little curios on the point that the TGT expiry timestamp is the exact time of your start. Are you using a ticket or a keytab for Kerberos? And which is the configured principal for HBase?
... View more
08-23-2018
10:58 AM
Would be great if you 'accept' the answer if you consider it helpful.
... View more
08-23-2018
10:54 AM
but you still have a column name PK in the column family I guess (I remember that table was migrated from Oracle to Hbase originally)? In that case remove the "primary key" from the column definition. The actual primary key must be outside any column family.
... View more
08-23-2018
07:09 AM
1 Kudo
There might be an application storing the data already in Hbase and other people like to query this data in an sql manner, or want to combine it with data from other Hive tables. it is also possible that the amount of data getting inserted or updated is an argument for using Hbase. In principal Hbase has some features to handle high amounts of data pretty fast with memory based processing, while Hive itself is a SQL layer, using other storage engines, resulting in the data being stored one or the other way in hdfs (or whatever your storage system is). Hbase also uses hdfs as the persistence layer, but the data inserted is available for queries even before the write operation to disk takes place. So a typical use case is that data is inserted and updated online in Hbase, while someone needs to combine that data with other data in SQL queries. I think it is much less usual to insert and update Hbase tables only via Hive, but reasons could be very different anyway, i.e. the policies by the ops team, know-how of involved people, a cluster having evolved using different tools, established dev or ops procedure etc...
... View more
08-23-2018
06:57 AM
a Hbase table will only have one rowkey, valid for all columnfamilies. It is the only primary key and it can't belong to any column family. So from here we must be clear on what you mean with 'I have a column name "PK" which is a primary key belonging to column family "cf"'. You might have a column PK in the column family cf, but for sure not as the rowkey (primary key) in Hbase. Within a column family any column is optional. Can you provide a table description of the Hbase table you try to map and the Phoenix view statement?
... View more
08-23-2018
06:43 AM
How do you initialize/load the Azure libs? Maybe you will have to configure the workdir in your processor as well? Do you have any error log? Anyway it might be a good idea to close this issue (by accepting one answer) and create a new post with the next issue your are facing (your script seems to not connect correctly).
... View more
08-23-2018
06:38 AM
Would be appreciated if you click on 'accept' for the answer. This will let all know the issue is resolved, and is rewarded by the platform.
... View more