Monday, October 26, 2015

Hive Notes

1.
Few times where there is files related to server logs where all the information about the server was stored in log format, I created regular expression in the hive queries and only filter the data of my interest from the warehouse file. I used this data to created dashboards in help my team members to know the latest state of server and database without looking into the server logs. I also created many dashboards, system alters , graphs, charts for enrich the presentation of data.
For example, I created the primary and secondary database dashboards.



2. Use of Regular expression in a log file stored in hive table.

insert overwrite table batting
SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_battting
DISTRIBUTE BY tic;


3. you can insert new data into table by two ways.


load the data of a file into table using load command.

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename.
you can insert new data into table by using select query.

INSERT INTO tablename1 select columnlist FROM secondtable;

4. Insert Data into already partitioned tables.

http://stackoverflow.com/questions/30211878/how-to-append-new-data-to-already-existing-hive-table

5. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column level aggregations. E.g. avg(degrees).

CREATE TABLE parquet_test (
id int,
str string,
mp MAP,
lst ARRAY,
strct STRUCT)
PARTITIONED BY (part string)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

Link 1
Link 2

6. Data insertion into partitioned and non-partitioned tables.
http://unmeshasreeveni.blogspot.com/2014/11/hive-partitioning.html

7. Bucketing in hive

Link

8. Configuration of Hive

There is a precedence hierarchy to setting properties. In the following list, lower numbers
take precedence over higher numbers:
1. The Hive SET command
2. The command-line -hiveconf option.
3. hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapred-site.xml,
and yarn-site.xml).
4. The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml,
mapred-default.xml, and yarn-default.xml).

9.
Metastore
-The metastore is the central repository of Hive metadata.The metastore is divided into two pieces: a service and the backing store for the data.
- only one Hive session open at a time that accesses the same metastore in case of Embedded Metastore.

Using Standalon Database
-The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service but connects to a database running in a separate process, either on the same machine or on a remote machine.

No comments:

Post a Comment