Lets talk Hadoop & Netezza: October 2015

Thursday, October 29, 2015

Hive Properties

https://murshedsqlcat.wordpress.com/2014/04/18/useful-hive-settings/
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Join two CSV data set using mapreduce

http://www.codeproject.com/Articles/869383/Implementing-Join-in-Hadoop-Map-Reduce

Diff between Writable and WritableComparable Interface

org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

Wednesday, October 28, 2015

Im-mapper Combining for word count

https://vangjee.wordpress.com/2012/03/07/the-in-mapper-combining-design-pattern-for-mapreduce-programming/

Find the top N most frequent words

1- Let the mapper run as usual writing (key, 1) for reduce phase.

Reduce phase:

1- We override two methods: reduce() and cleanup().
2- at the beginning of the method, we compute the sum of all the values received from the mappers for this key, which is the number of occurrences of this word inside the book; then we put the word and the number of occurrences into a HashMap.
3- We sort the hashmap by count in the map.sortByValues(countMap);
4- in the cleanup() method first we sort the HashMap by values , then we loop over the keyset and output the first 20 items.

Source

Finding the top 10 list from a set

http://blog.pivotal.io/pivotal/products/how-hadoop-mapreduce-can-transform-how-you-build-top-ten-lists

Impala Notes

1-

Cloudera Impala is an addition to tools available for querying big data. Impala does not replace the batch
processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are
best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and
Load (ETL) type jobs.

2-

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or
HBase.

Impala daemons

a. Impalad process
b. Impalad process also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any
Impala node in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement
is processed through Impala.

Impala Metastore

The Impala component known as the statestore checks on the health of Impala daemons on all the nodes in a
cluster, and continuously relays its findings to each of those daemons.

It is physically represented by a daemon process named statestored. You only need such a process on one node in the cluster.

The Impala Catalog Service

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the nodes in a cluster. It is physically represented by a daemon process named catalogd; you only need such a process on one node in the cluster. Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same node.

This new component in Impala 1.2 reduces the need for the REFRESH and INVALIDATE METADATA statements.

http://blog.cloudera.com/blog/2014/12/the-impala-cookbook/

Informatica and Netezza notes

------- Informatica-------
-I was mostly working with creating mapping and validating it.
1. check if the informatica services are running,any session failed or not,if a file failed then how many records loaded and how many not loaded yet and figure out how to load rest,also find out the reason for session failure.
2. how to use refreshinterval for incremental loads in tables using informatica?
- What to do if session fails, what to if mapping it wrong OR work flow is wrong.
- how to recover the lost records, do we reinitiate the process or only work to load the remaining records.
- whom to contact if session, workflow, mapping falis.
http://etl-developer.com/2011/12/email-task-session-and-workflow-notification-informatica/#Workflow-and-session-details
- do we raised some tickets? what tool used? ZEONSS tool.
2- What do we used for work flow management in hdfs. ozzie or another tool.
3 - Develop informatica mappings and workflows for new change requests.
what is the process after mapping or workflow creation was done?
what database usually you used in informatica?
4- Optimise the current mappings and workflows?
5 -Ensure daily repository backup is taken.
6 - Follow SLA
-Develop mappings and workflows as per provided requirements.
-Create high level and low level docs for the mappings created.
-Test those mappings and prepare test cases,also ask for peer review and create doc for same.
-Send daily updates to customer.
------------------------------What are you doing ------------------------------------------
Very limited exposure to Netezza. Since I got more passionated to hadoop I started working in that. But I would welcome any opportunity to work with it.

- Informtica, Star schmema, snow flakes schema, elastic search, EMR, RedShift, S3, Oracle analytical queries.

Monday, October 26, 2015

Hive Notes 2

1.Locks

Hive also has support for table- and partition-level locking. Locks prevent, for example, one process from dropping a table while another is reading from it. Locks are managed transparently using ZooKeeper, so the user doesn’t have to acquire or release them, although it is possible to get information about which locks are being held via the SHOW LOCKS statement. By default, locks are not enabled.

2. SerDe

A SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data.
the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

3. Use EXPLAIN and you will see the abstract syntax tree, the dependency graph, and the plan of each stage.

4.LEFT SEMI JOIN is used as replacement to IN, Exisits clause in hive quries. Although with HIVE 0.13 it support IN and Exists clause.

- Hive allows uncorrelated subqueries, where the subquery is a self-contained query referenced by an IN or EXISTS statement in the WHERE clause.

5. A UDF must satisfy the following two properties:
-A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
-A UDF must implement at least one evaluate() method.

6. Hive Delta or Incremental load

https://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

7. Bulk Insert Update Delete in Data lake

https://pkghosh.wordpress.com/2015/04/26/bulk-insert-update-and-delete-in-hadoop-data-lake/

8. Hive Join Optimization

http://www.datascience-labs.com/hive/hiveql-joins/

9. Use Vectorization

Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

10. We could used TEZ instead of mapreduce execution engine.

Hive Notes

1.
Few times where there is files related to server logs where all the information about the server was stored in log format, I created regular expression in the hive queries and only filter the data of my interest from the warehouse file. I used this data to created dashboards in help my team members to know the latest state of server and database without looking into the server logs. I also created many dashboards, system alters , graphs, charts for enrich the presentation of data.
For example, I created the primary and secondary database dashboards.

2. Use of Regular expression in a log file stored in hive table.

insert overwrite table batting
SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_battting
DISTRIBUTE BY tic;

3. you can insert new data into table by two ways.

load the data of a file into table using load command.

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename.
you can insert new data into table by using select query.

INSERT INTO tablename1 select columnlist FROM secondtable;

4. Insert Data into already partitioned tables.

http://stackoverflow.com/questions/30211878/how-to-append-new-data-to-already-existing-hive-table

5. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column level aggregations. E.g. avg(degrees).

CREATE TABLE parquet_test (
id int,
str string,
mp MAP,
lst ARRAY,
strct STRUCT)
PARTITIONED BY (part string)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat';

Link 1
Link 2

6. Data insertion into partitioned and non-partitioned tables.
http://unmeshasreeveni.blogspot.com/2014/11/hive-partitioning.html

7. Bucketing in hive

Link

8. Configuration of Hive

There is a precedence hierarchy to setting properties. In the following list, lower numbers
take precedence over higher numbers:
1. The Hive SET command
2. The command-line -hiveconf option.
3. hive-site.xml and the Hadoop site files (core-site.xml, hdfs-site.xml, mapred-site.xml,
and yarn-site.xml).
4. The Hive defaults and the Hadoop default files (core-default.xml, hdfs-default.xml,
mapred-default.xml, and yarn-default.xml).

9.
Metastore
-The metastore is the central repository of Hive metadata.The metastore is divided into two pieces: a service and the backing store for the data.
- only one Hive session open at a time that accesses the same metastore in case of Embedded Metastore.

Using Standalon Database
-The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service but connects to a database running in a separate process, either on the same machine or on a remote machine.

Map Reduce notes

1. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

Ans: The default partitioner computes a hash value for the key and assigns the partition based on this result.

2. Distributed Cache

DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

3.

Best Practise in Netezza

Source1
Source2

what is data slice, spu, FPGA and inter-blade network fabric

Disk:
A disk is a physical drive on which data resides. In a Netezza system, host servers have several disks that hold the Netezza software, host operating system, database metadata, and sometimes small user files. The Netezza system also has many more disks that hold the user databases and tables.

Data Slice:
A data slice is a logical representation of the data that is saved on a disk. The data slice contains “pieces” of each user database and table. When users create tables and load their data, they distribute the data for the table across the data slices in the system by using a distribution key. An optimal distribution is one where each data slice has approximately the same amount of each user table as any other. The Netezza system distributes the user data to all of the data slices in the system by using a hashing algorithm.

Data Partition:

A data partition is a logical representation of a data slice that is managed by a specific SPU. That is, each SPU owns one or more data partitions, which contains the user data that the SPU is responsible for processing during queries. For example, in the IBM PureData System for Analytics N200x appliances, each SPU typically owns 40 data partitions although one or two may own 32 partitions. For example, in IBM Netezza 1000 or IBM PureData System for Analytics N1001 systems, each SPU typically owns 8 data partitions although one SPU has only 6 partitions. For a Netezza C1000 system, each SPU owns 9 data partitions by default. SPUs could own more than their default number of partitions; if a SPU fails, its data partitions are reassigned to the other active SPUs in the system.

spu
FPGA
inter-blade network fabric:

Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth.

Taken from
IBM

Responsibilities when working with Netezza

1- I was working with TwinFin(N1001).

Importance of setting the right distribution key for improved performance in Netezza

1- try create table table_name ( field1 bigint, field2 varchar(10))distribute on random

insert into table_name ()

generate statistics on table_name

the key is controlling your data types. if your recordset is large varchar fields, netezza is going to struggle for speed. if you can take your number fields and cast them as integer types, you will see better performance. All your varchar fields need to be "reasonable" (don't use varchar(5000) on every field.

2 -

Perhaps the table statistics are not updated.

generate express statistics on table_name;

3- last option:

What is the distrubution key on the existing table? Perhaps that is driving your problem, in that the table has high skew which means you are not distributing across the box efficiently. If you own the table, you may want to rebuild with a better distribution key, keeping in mind that it will be slow, but the future performance of the table will be better.

4- Same distribution key is required :
If you use a CTAS without distribution it will default to the same distribution as the source table, which may be more efficient.

Source

How could I redistribute the tables in Netezza?

Topics Covered:

Redistribution of KEY

Choosing the right distribution key is one of the key factor for improving performance in netezza system. We will create a table with a distribution key and later realize that it is not giving good performance. In this case, you might need to change the distribution key of a netezza table. You can achieve this by creating a new table.

The CTAS (create table As) comes in handy to create a new table and to load the data at the same time. The syntax of CTAS is

"create table as select * from distribute on (columns);"

While creating the new table, specify the distribution columns. If you do not specify distribution keys, a CTAS table will inherit its distribution from parent table.

The create tables as(CTAS table) are efficient method to change the distribution method of a table without reloading the data.

During the insert operation on CTAS table, the NPS system computes statistics for all columns.

Once you create the new table and loaded with the data, you can drop the old table.

Taken from Source

Netezza introduction

Topics Covered:

Intro , Colocation and distribution key, data skew, process skew, filter-table and filter-join.

It is important to keep in mind that while distribution and co-location are keys to high(er) performance on the machine, the true hero here is the SPU architecture and how it applies to the problem at hand. We have seen cases where applying a distribution alone has provided dramatic effects, such as 30x and 40x boost because of the nature of the data. This is not typical however, since the co-located reads will usually provide only a percentage boost rather than an X-level boost. This is why we often suggest that people sort of clear-their-head of jumping directly into a distribution discussion and instead put together a workable data model with a random distribution. Once loaded, profile the various key candidates to get a feel for which ones work the best and which ones do not. We have seen some users struggle with the data only because they prematurely selected a distribution key that - unbeknownst to them - had a very high skew and made the queries run too slow. This protracted their workstreams and made all kinds of things take longer than they should have. So at inception, go simple, and then work your way up to the ideal. Taken from this Source