About mark_hadoop

mark_hadoop · ‎02-18-2021

Hi everybody, I am trying the following approach to write data in to hive table. import logging from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split from pyspark.streaming.kafka import KafkaUtils import datetime from pyspark.sql.functions import lit,unix_timestamp from os.path import * from pyspark import Row warehouseLocation = abspath("spark-warehouse") spark = SparkSession.builder.appName("spark_streaming").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate() kafka = "kafka" offsets = "earliest" servers = "server_1:port,server_2:port" security_protocol = "SSL" keystore_location = "keystore" keystore_password = "keystore_password" kafka_topic = "kafka_topic" checkpoint_location ="/checkpoint/location" def hiveInsert(df, batchId): df.createOrReplaceTempView("updates") spark.sql("insert into hive_db.hive_table select value, time_stamp from updates") df = spark.readStream.format(kafka).option("startingoffsets", offsets).option("kafka.bootstrap.servers", servers).option("kafka.security.protocol", security_protocol).option("kafka.ssl.keystore.location", keystore_location).option("kafka.ssl.keystore.password", keystore_password).option("subscribe",kafka_topic).load().selectExpr("CAST(value AS STRING)").select('value').withColumn('time_stamp',lit(datetime.datetime.now().strftime('%Y%m%d%H%M'))) query = df.writeStream.foreachBatch(hiveInsert).start() query.awaitTermination() The above code is not working Any pointers are of great help!

mark_hadoop · ‎08-21-2019

Hi, I am trying to match multiple values in a string using hive regxp, looking for an optimal solution. I want to match "first" and "1.11" from the below column name is col: This string is the first string with two decimals 1.11 and 2.22 with a special char / and some more extra string. table name is t: query I was using: select * from t where t.col regexp '(?=.*first)(?=.*1.11)' Could you please help me. Thank you

mark_hadoop · ‎02-07-2019

@Shu can you please help me

mark_hadoop · ‎02-06-2019

Hi All, I have a string String: some text with an ip 111.111.111.111 and a decimal 11.2323232 and some text here and then an int 1 and then some HTTP/1.1 with a 503 request and then another ip 222.222.222.222 and some imaginary 999.999.999.999 I want to output all the ip addresses in comma saperated. I tried the below select regexp_replace(regexp_replace(String,'[^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})]',' '),'\\s+',','); +------------------------------------------------------------------------+--+ | _c0 | +------------------------------------------------------------------------+--+ | ,111.111.111.111,11.2323232,1,1.1,503,222.222.222.222,999.999.999.999 | +------------------------------------------------------------------------+--+ Expected output is : 111.111.111.111,222.222.222.222,999.999.999 Could you please help me

mark_hadoop · ‎10-16-2018

@sadapa you can never insert a "?" in to a column which has a datatype int. because you can never find a number as "?", and hive knows it. I am not sure why you want to do that, but if you want to still convert a "?" in to a number, which you want to change it later, you can try ascii()

mark_hadoop · ‎10-02-2018

@Shu Thankyou

mark_hadoop · ‎10-02-2018

@Carlton Patterson myresults.coalesce(1).write.format('csv').save("/tmp/myresults.csv", header='true')

mark_hadoop · ‎10-01-2018

Hi All, I need help to get the below result. I have two tables table name: match +-----------------------------------+----------------+--+ | hint | remarks | +-----------------------------------+----------------+--+ | 1.1.1.1 | ip | | 123456789 | contact | | http://123123123123123123.some_n | url | +-----------------------------------+----------------+--+ table name : t1 +-------------------------------------------------------------------------------+-------------------+--+ | t1.text | t1.b | +-------------------------------------------------------------------------------+-------------------+--+ | This ip is found 1.1.1.1 and is matched with match | table name match | | This ip is found 1.1.1.2 and is matched with match | table name match | | This contact is found 123456789 and is matched with match | table name match | | This contact is found 123456789123456789 and is matched with match | table name match | | This url is found http://123456789123456789.some_n and is matched with match | table name match | +-------------------------------------------------------------------------------+-------------------+--+ I want to search hint column of match table in text column of t1 table and get complete text column values. so, basically I want to do a query like select t1.text from t1 join match where t1.text contains (any value in match.hint); It will be helpful if this can be done in hive or I can live with pyspark, so pyspark help is also welcome P.S: table t1 is a big table and match is a small table with limite values(say 1500). Thank you

mark_hadoop · ‎09-18-2018

@Andy LoPresto Thats a nice idea, but I dont have leverage to user executescript or excecutestreamcommand, as there are no scripts/programs(including awk) waiting for me, also getting them is out of my hands, so looking for a solution with in my flex. Thank you

mark_hadoop · ‎09-18-2018

@Shu 1. Sample data: Every value is present in attributes(i.e. every flowfile is parsed and the value in the flowfile is assigned to attributes) There are multiple flow files with the same value (user_name)in attributes. ex: flowfile1 attributes: user_name: mark, file_in: 2018-09-18 15:00:00, file_out: 2018-09-18 15:01:00 user_name: michelle, file_in: 2018-09-18 15:00:02, file_out: 2018-09-18 15:01:01 user_name: mark, file_in: 2018-09-18 15:00:05, file_out: 2018-09-18 15:01:01 flowfile2 attributes: user_name: mark, file_in: 2018-09-18 15:01:00, file_out: 2018-09-18 15:01:10 user_name: stella, file_in: 2018-09-18 15:01:12, file_out: 2018-09-18 15:01:21 2. I want to count all the flowfiles that have user_name (in the above example count of mark is 3 in both the flowfiles) 3. Schema of the flow file is just as above 3 fields, which are assigned to attributes. Thank you

Online	Offline
Last Visited	‎09-20-2021 09:14 AM

Member Since	‎07-14-2017 11:10 AM
Last Visited	‎09-20-2021 09:14 AM
Posts	99
Kudos received	5

Cloudera Community

Re: update TCP stream with batchsize 10000 at once...

Re: listen syslog

Re: puthbasejson

Re: Extract text and Replace text processors regex

pyspark streaming writing data in to hive using fo...

Regexp search match multiple values

Re: Hive to extract multiple ip addresses from a s...

Hive to extract multiple ip addresses from a strin...

Re: how to insert semicolon in hive table

Re: How to count the number of occurrences of a wo...

Re: How to save all the output of pyspark sql quer...

hive join tables and extracts value based on colum...

Re: How to count the number of occurrences of a wo...

Re: How to count the number of occurrences of a wo...