Lets talk Hadoop & Netezza: what is data slice, spu, FPGA and inter-blade network fabric

Monday, October 26, 2015

what is data slice, spu, FPGA and inter-blade network fabric

Disk:
A disk is a physical drive on which data resides. In a Netezza system, host servers have several disks that hold the Netezza software, host operating system, database metadata, and sometimes small user files. The Netezza system also has many more disks that hold the user databases and tables.

Data Slice:
A data slice is a logical representation of the data that is saved on a disk. The data slice contains “pieces” of each user database and table. When users create tables and load their data, they distribute the data for the table across the data slices in the system by using a distribution key. An optimal distribution is one where each data slice has approximately the same amount of each user table as any other. The Netezza system distributes the user data to all of the data slices in the system by using a hashing algorithm.

Data Partition:

A data partition is a logical representation of a data slice that is managed by a specific SPU. That is, each SPU owns one or more data partitions, which contains the user data that the SPU is responsible for processing during queries. For example, in the IBM PureData System for Analytics N200x appliances, each SPU typically owns 40 data partitions although one or two may own 32 partitions. For example, in IBM Netezza 1000 or IBM PureData System for Analytics N1001 systems, each SPU typically owns 8 data partitions although one SPU has only 6 partitions. For a Netezza C1000 system, each SPU owns 9 data partitions by default. SPUs could own more than their default number of partitions; if a SPU fails, its data partitions are reassigned to the other active SPUs in the system.

spu
FPGA
inter-blade network fabric:

Many times people will ask: Why is it that the query runs fast when it's running alone, but when it's running side-by-side with another instance of itself they both slow to a crawl? This is largely due to the lack of co-location in the query. When the query cannot co-locate, it must redistribute the data across the inter-blade network fabric so that all the CPUs are privy to all the data. This quickly saturates the fabric so that when another query launches, they start fighting over fabric bandwidth not the CPU bandwidth.

Taken from
IBM

Lets talk Hadoop & Netezza

Monday, October 26, 2015

what is data slice, spu, FPGA and inter-blade network fabric

No comments:

Post a Comment