Member since
09-16-2021
337
Posts
53
Kudos Received
27
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
296 | 11-22-2024 05:29 AM | |
166 | 11-15-2024 06:38 AM | |
383 | 11-13-2024 07:12 AM | |
415 | 11-10-2024 11:19 PM | |
569 | 10-25-2024 05:02 AM |
07-14-2023
02:39 AM
If my understanding is correct, the schema is altered for different input files, which implies that the data itself lacks a structured schema. Given the frequent changes in the schema, it is advisable to store the data in a column-oriented system such as HBASE. The Same HBASE data can be accessed through spark using HBase-Spark Connector. Ref - https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/accessing-hbase/topics/hbase-example-using-hbase-spark-connector.html
... View more
07-14-2023
12:31 AM
In general, a dump.sql file from an RDBMS can be read using the respective RDBMS command line utility. For instance, the mysql dump.sql file can be read using the mysql utility. To read MySQL data from Hive, you have two options: you can either utilize the HIVE JDBC Handler or import the RDBMS data into Hive tables using Sqoop. Ref - https://cwiki.apache.org/confluence/display/Hive/JDBC+Storage+Handler https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/migrating-data-into-hive/topics/hive_moving_data_from_databases_to_hive.html JdbcStorageHandler reads the data from the JDBC data source Sqoop import command that imports data from diverse data source.
... View more
07-13-2023
11:59 PM
use showHeader=false and outputformat as csv2 to extract only count value to bash variable. [root@nightly-cm-ap-2 ~]# count=$(beeline -u "<jdbcURL>" --showHeader=false --outputformat=csv2 -e "select count(*) from sample_07")
SLF4J: Class path contains multiple SLF4J bindings.
.
.
.
INFO : Compiling command(queryId=hive_20230714065303_399f60fb-d947-4fcf-b017-c64a4a39251f): select count(*) from sample_07
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230714065303_399f60fb-d947-4fcf-b017-c64a4a39251f); Time taken: 0.213 seconds
INFO : Executing command(queryId=hive_20230714065303_399f60fb-d947-4fcf-b017-c64a4a39251f): select count(*) from sample_07
INFO : Completed executing command(queryId=hive_20230714065303_399f60fb-d947-4fcf-b017-c64a4a39251f); Time taken: 0.009 seconds
INFO : OK
1 row selected (0.675 seconds)
[root@nightly-cm-ap-2 ~]# echo $count
822
[root@nightly-cm-ap-2 ~]# For additional information use beeline --help. [root@nightly-cm-ap-2 ~]# beeline --help
SLF4J: Class path contains multiple SLF4J bindings.
.
.
.
Usage: java org.apache.hive.cli.beeline.BeeLine
-u <database url> the JDBC URL to connect to
-c <named url> the named JDBC URL to connect to,
which should be present in beeline-site.xml
as the value of beeline.hs2.jdbc.url.<namedUrl>
-r reconnect to last saved connect url (in conjunction with !save)
-n <username> the username to connect as
-p <password> the password to connect as
-d <driver class> the driver class to use
-i <init file> script file for initialization
-e <query> query that should be executed
-f <exec file> script file that should be executed
-w (or) --password-file <password file> the password file to read password from
--hiveconf property=value Use value for given property
--hivevar name=value hive variable name and value
This is Hive specific settings in which variables
can be set at session level and referenced in Hive
commands or queries.
--property-file=<property-file> the file to read connection properties (url, driver, user, password) from
--color=[true/false] control whether color is used for display
--showHeader=[true/false] show column names in query results
--escapeCRLF=[true/false] show carriage return and line feeds in query results as escaped \r and \n
--headerInterval=ROWS; the interval between which heades are displayed
--fastConnect=[true/false] skip building table/column list for tab-completion
--autoCommit=[true/false] enable/disable automatic transaction commit
--verbose=[true/false] show verbose error messages and debug info
--showWarnings=[true/false] display connection warnings
--showDbInPrompt=[true/false] display the current database name in the prompt
--showNestedErrs=[true/false] display nested errors
--numberFormat=[pattern] format numbers using DecimalFormat pattern
--force=[true/false] continue running script even after errors
--maxWidth=MAXWIDTH the maximum width of the terminal
--maxColumnWidth=MAXCOLWIDTH the maximum width to use when displaying columns
--silent=[true/false] be more silent
--report=[true/false] show number of rows and execution time after query execution
--autosave=[true/false] automatically save preferences
--outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv] format mode for result display
Note that csv, and tsv are deprecated - use csv2, tsv2 instead
--incremental=[true/false] Defaults to false. When set to false, the entire result set
is fetched and buffered before being displayed, yielding optimal
display column sizing. When set to true, result rows are displayed
immediately as they are fetched, yielding lower latency and
memory usage at the price of extra display column padding.
Setting --incremental=true is recommended if you encounter an OutOfMemory
on the client side (due to the fetched result set size being large).
Only applicable if --outputformat=table.
--incrementalBufferRows=NUMROWS the number of rows to buffer when printing rows on stdout,
defaults to 1000; only applicable if --incremental=true
and --outputformat=table
--truncateTable=[true/false] truncate table column when it exceeds length
--delimiterForDSV=DELIMITER specify the delimiter for delimiter-separated values output format (default: |)
--isolation=LEVEL set the transaction isolation level
--nullemptystring=[true/false] set to true to get historic behavior of printing null as empty string
--maxHistoryRows=MAXHISTORYROWS The maximum number of rows to store beeline history.
--delimiter=DELIMITER set the query delimiter; multi-char delimiters are allowed, but quotation
marks, slashes, and -- are not allowed; defaults to ;
--convertBinaryArrayToString=[true/false] display binary column data as string or as byte array
--getUrlsFromBeelineSite Print all urls from beeline-site.xml, if it is present in the classpath
--help display this message
Example:
1. Connect using simple authentication to HiveServer2 on localhost:10000
$ beeline -u jdbc:hive2://localhost:10000 username password
2. Connect using simple authentication to HiveServer2 on hs.local:10000 using -n for username and -p for password
$ beeline -n username -p password -u jdbc:hive2://hs2.local:10012
3. Connect using Kerberos authentication with hive/localhost@mydomain.com as HiveServer2 principal
$ beeline -u "jdbc:hive2://hs2.local:10013/default;principal=hive/localhost@mydomain.com"
4. Connect using SSL connection to HiveServer2 on localhost at 10000
$ beeline "jdbc:hive2://localhost:10000/default;ssl=true;sslTrustStore=/usr/local/truststore;trustStorePassword=mytruststorepassword"
5. Connect using LDAP authentication
$ beeline -u jdbc:hive2://hs2.local:10013/default <ldap-username> <ldap-password>
... View more
06-16-2023
12:04 AM
Check the possibility of using hive managed table. As part of hive managed tables, you won't require separate merge job , as hive compaction takes care by default if compaction is enabled. You can access managed tables through HWC (Hive Warehouse Connector) from Spark.
... View more
06-15-2023
11:47 PM
@Abdul_ As of now hive won't support row delimiter other new line character . Attaching the corresponding Jira for reference HIVE-11996 As a workaround, Recommend to update the input file using external libraries like awk,...etc and upload the input file in the corresponding FileSystem location to read. Eg - Through AWK [root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt
"IM43163","SOUTH,OFC","10-Jan-23"
"IM41763","John:comment added","12-Jan-23"
[root@c2757-node2 ~]# awk -F "\",\"" 'NF < 3 {getline nextline; $0 = $0 nextline} 1' sample_case.txt > sample_text.csv Reading from Hive Table 0: jdbc:hive2://c2757-node2.coelab.cloudera.c> select * from table1;
.
.
.
INFO : Executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f): select * from table1
INFO : Completed executing command(queryId=hive_20230616064136_333ff98d-636b-43b1-898d-fca66031fe7f); Time taken: 0.023 seconds
INFO : OK
+---------------+---------------------+---------------+
| table1.col_1 | table1.col_2 | table1.col_3 |
+---------------+---------------------+---------------+
| IM43163 | SOUTH,OFC | 10-Jan-23 |
| IM41763 | John:comment added | 12-Jan-23 |
+---------------+---------------------+---------------+
2 rows selected (1.864 seconds)
... View more
06-14-2023
07:04 AM
Check the bellow helps for your usecase. https://github.com/databricks/spark-xml scala> import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema
scala> import java.nio.file.Paths
import java.nio.file.Paths
scala> val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...
scala> import com.databricks.spark.xml._
import com.databricks.spark.xml._
scala> val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct<ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct<NewCrrctn: struct<ScrtstnRpt: struct<ScrtstnIdr: string, CutOffDt: string ... 1 more field>>, Cxl: struct<ScrtstnCxl: array<string>, UndrlygXpsrRptCxl: array<struct<ScrtstnIdr:string,CutOffDt:string>>>>>] spark-shell command used spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd" FYI - In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.
... View more
06-13-2023
12:16 AM
Share sample Data file with minimum of 2 records , to understand the structure.
... View more
06-13-2023
12:12 AM
Output of below to identify the exact ouptut records details, explain formatted <query> explain extended <query> explain analyze <query>
... View more
06-06-2023
09:53 PM
Use the below Query to fetch the table location from HMS . select "DBS"."NAME" as DB_NAME, "TBLS"."TBL_NAME", "SDS"."LOCATION" from "DBS" join "TBLS" on "DBS"."DB_ID" = "TBLS"."DB_ID" AND "TBLS"."TBL_TYPE" != 'VIRTUAL_VIEW' join "SDS" on "TBLS"."SD_ID" = "SDS"."SD_ID"; To query the same from hive, Would recommend to use JDBC Storage Handler. In CDP by default in sysdb this tables has been created. you can use the same. Query select dbs.name as db_name , tbls.tbl_name , sds.location from dbs join tbls on dbs.db_id = tbls.db_id and tbls.tbl_type != 'VIRTUAL_VIEW' join sds on tbls.sd_id = sds.sd_id;
... View more
06-06-2023
09:37 PM
Once the data has been read from database, you don't need to write the same data to file (i.e. CSV ) . Instead you can write directly into hive table using DataFrame API's. Once the Data has been loaded you query the same from hive. df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records") Ref - https://spark.apache.org/docs/2.4.7/sql-data-sources-hive-tables.html Sample Code Snippet df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://<server name>:5432/<DBNAME>") \
.option("dbtable", "\"<SourceTableName>\"") \
.option("user", "<Username>") \
.option("password", "<Password>") \
.option("driver", "org.postgresql.Driver") \
.load()
df.write.mode('overwrite').saveAsTable("<TargetTableName>")
From hive
INFO : Compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10); Time taken: 0.591 seconds
INFO : Executing command(queryId=hive_20230607042851_fa703b79-d6e0-4a4c-936c-efa21ec00a10): select count(*) from TBLS_POSTGRES
.
.
.
+------+
| _c0 |
+------+
| 122 |
+------+
... View more