About hsfelix

hsfelix · ‎01-22-2018

Used impyla. Works like a charm 🙂

hsfelix · ‎01-22-2018

It was actually a problem in the twitter JSON. When we get a tweet wich is actually a retweet, flume truncates it. Problem solved 🙂

hsfelix · ‎01-12-2018

@PY Paul-Arnaud, Many thanks for your quick answer. Unfortunately that's not a datatype that hiveql recognizes... 😞

hsfelix · ‎01-12-2018

Hello friends, I'm working with a Hive table which fetches twitter data from flume / oozie. The problem is that Hive is truncating the tweet text field... Can anybody please help me solving this issue? Here's the table: CREATE EXTERNAL TABLE tweets ( id bigint, created_at string, source STRING, favorited BOOLEAN, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>, retweet_count:INT>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, lang string, retweet_count int, text string, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING> ) PARTITIONED BY (datehour int) LOCATION 'hdfs://192.168.1.11:8020/user/flume/tweets'

hsfelix · ‎11-20-2017

I'm trying to get a table located in hive (hortonworks) ,to collect some twitter data to implement on a machine learning project, using pyhive since pyhs2 is not supported by python3.6. Here's my code: from pyhive import hive conn = hive.Connection(host='192.168.1.11', port=10000, auth='NOSASL') import pandas as pd import sys df = pd.read_sql("SELECT * FROM my_table", conn) print(sys.getsizeof(df)) df.head() When compiling I get this error: Traceback (most recent call last): File "C:\Users\PWST112\Desktop\import.py", line 44, in <module> conn = hive.Connection(host='192.168.1.11', port=10000, auth='NOSASL') File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site- packages\pyhive\hive.py", line 164, in __init__ response = self._client.OpenSession(open_session_req) File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site- packages\TCLIService\TCLIService.py", line 187, in OpenSession return self.recv_OpenSession() File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\TCLIService\TCLIService.py", line 199, in recv_OpenSession (fname, mtype, rseqid) = iprot.readMessageBegin() File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\thrift\protocol\TBinaryProtocol.py", line 148, in readMessageBegin name = self.trans.readAll(sz) File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\thrift\transport\TTransport.py", line 60, in readAll chunk = self.read(sz - have) File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\thrift\transport\TTransport.py", line 161, in read self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size))) File "C:\Users\PWST112\AppData\Local\Programs\Python\Python36\lib\site-packages\thrift\transport\TSocket.py", line 132, in read message='TSocket read 0 bytes') thrift.transport.TTransport.TTransportException: TSocket read 0 bytes [Finished in 0.3s] Here is the PIP list: beautifulsoup4 (4.6.0) bleach (2.0.0) colorama (0.3.9) cycler (0.10.0) decorator (4.0.11) entrypoints (0.2.3) ez-setup (0.9) future (0.16.0) html5lib (0.999999999) impala (0.2) ipykernel (4.6.1) ipython (6.1.0) ipython-genutils (0.2.0) ipywidgets (6.0.0) jedi (0.10.2) Jinja2 (2.9.6) jsonschema (2.6.0) jupyter (1.0.0) jupyter-client (5.1.0) jupyter-console (5.1.0) jupyter-core (4.3.0) konlpy (0.4.4) MarkupSafe (1.0) matplotlib (2.0.2) mistune (0.7.4) nbconvert (5.2.1) nbformat (4.3.0) nltk (3.2.4) notebook (5.0.0) numpy (1.13.1+mkl) pandas (0.20.3) pandocfilters (1.4.1) pickleshare (0.7.4) pip (9.0.1) prompt-toolkit (1.0.14) pure-sasl (0.4.0) Pygments (2.2.0) PyHive (0.5.0) pyhs2 (0.6.0) pyparsing (2.2.0) python-dateutil (2.6.0) pytz (2017.2) pyzmq (16.0.2) qtconsole (4.3.0) sasl (0.2.1) scikit-learn (0.18.2) scipy (0.19.1) setuptools (28.8.0) simplegeneric (0.8.1) six (1.10.0) testpath (0.3.1) thrift (0.10.0) thrift-sasl (0.3.0) tornado (4.5.1) traitlets (4.3.2) wcwidth (0.1.7) webencodings (0.5.1) wheel (0.30.0) widgetsnbextension (2.0.0) Can somebody help? I have my sandbox configured for "NONE" authentication, since the NOSASL option is not available. Best regards

hsfelix · ‎10-04-2017

@Dan Zaratsian I'm still around this problem... the query only crashes when I query 2 or more columns with the text column. If I don't query the text column, or query it alone it works... Do you have any suggestion, please? Many thanks in advance.

hsfelix · ‎09-28-2017

@Dan Zaratsian, Hope you're doing great. After a fresh install of HDP 2.6.1.0 I've tried to make the query. First I've made the external tweets table: ADD JAR /tmp/json-serde-1.3.8-jar-with-dependencies.jar; CREATE EXTERNAL TABLE tweets ( id bigint, created_at string, source STRING, favorited BOOLEAN, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>, retweet_count:INT>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, lang string, retweet_count int, text string, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING> ) PARTITIONED BY (datehour int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true") LOCATION 'hdfs://192.168.1.11:8020/user/flume/tweets' The hive query gives this error on hiveserver2: 2017-09-28 15:22:39,577 ERROR [HiveServer2-Background-Pool: Thread-1161]: SessionState (SessionState.java:printError(993)) - Status: Failed 2017-09-28 15:22:39,578 ERROR [HiveServer2-Background-Pool: Thread-1161]: SessionState (SessionState.java:printError(993)) - Vertex failed, vertexName=Map 1, vertexId=vertex_1506521964877_0091_1_00, diagnostics=[Task failed, taskId=task_1506521964877_0091_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":913124962290069505,"created_at":"Wed Sep 27 19:35:30 +0000 2017","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","retweeted_status":{"text":"MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting https://t.co/GnPPkt2CYG","user":{"screen_name":"sportingfanspt","name":"SPORTING FANS"},"retweet_count":43},"lang":"pt","retweet_count":0,"text":"RT @sportingfanspt: MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting httpsâ€¦","user":{"screen_name":"ladraodoapito","name":"Arbitro com Voucher","friends_count":568,"followers_count":319,"statuses_count":1668,"verified":false,"utc_offset":3600,"time_zone":"Lisbon"},"datehour":2017092720} at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":913124962290069505,"created_at":"Wed Sep 27 19:35:30 +0000 2017","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","retweeted_status":{"text":"MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting https://t.co/GnPPkt2CYG","user":{"screen_name":"sportingfanspt","name":"SPORTING FANS"},"retweet_count":43},"lang":"pt","retweet_count":0,"text":"RT @sportingfanspt: MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting httpsâ€¦","user":{"screen_name":"ladraodoapito","name":"Arbitro com Voucher","friends_count":568,"followers_count":319,"statuses_count":1668,"verified":false,"utc_offset":3600,"time_zone":"Lisbon"},"datehour":2017092720} at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:325) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150) ... 14 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":913124962290069505,"created_at":"Wed Sep 27 19:35:30 +0000 2017","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","retweeted_status":{"text":"MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting https://t.co/GnPPkt2CYG","user":{"screen_name":"sportingfanspt","name":"SPORTING FANS"},"retweet_count":43},"lang":"pt","retweet_count":0,"text":"RT @sportingfanspt: MÃ£e de @Cristiano, Dolores Aveiro no seu perfil de Instagram: \"Vim apoiar o meu @Sporting_CP\" #UCL #DiaDeSporting httpsâ€¦","user":{"screen_name":"ladraodoapito","name":"Arbitro com Voucher","friends_count":568,"followers_count":319,"statuses_count":1668,"verified":false,"utc_offset":3600,"time_zone":"Lisbon"},"datehour":2017092720} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83) ... 17 more Can you help? I'm only getting this error when I query the text column....

hsfelix · ‎09-04-2017

@Sindhu Many thanks for your answer... it didn't work but apparently worked with disable database check. It completed the upgrade at least. Many thanks!

hsfelix · ‎08-28-2017

Dear @Sagar Shimpi I've followed your article but when I hit restart I get this error: Can you help? Best regards

hsfelix · ‎08-25-2017

@Eric Periard have you managed to find any solution for this? I'm stuck in the same error for over a week. Many thanks in advance. Best regards

Online	Offline
Last Visited	‎02-20-2018 04:58 AM

Member Since	‎08-01-2017 09:10 AM
Last Visited	‎02-20-2018 04:58 AM
Posts	65
Kudos received	3

Cloudera Community

Re: pyhive connection error: thrift.transport.TTra...

Re: Hive is truncating strings :(

Re: Zeppelin Tutorial : error: value toDF is not a...

Re: Oozie error - HTTP Status 401

Re: pyhive connection error: thrift.transport.TTra...

Re: Hive is truncating strings :(

Re: Hive is truncating strings :(

Hive is truncating strings :(

pyhive connection error: thrift.transport.TTranspo...

Re: How to make hive queries including scala and p...

Re: How to make hive queries including scala and p...

Re: HDP 2.6.1.0 - Error Finalizing HDP upgrade Fai...

Re: HDP upgrade failed on Finalize Upgrade step - ...

Re: HDP Upgrade 2.4.0.0 to 2.4.2.0 Failing a pre-c...