Notes: Some preparation notes and question for Cloudera Certified Developer Hadoop.
These are all for a review your knowledge.
-------------
Q) What is the default Key and Value in Mapper Input parameters ?
Ans: For mapper input keys are : The line offsets ( LongWritable)
values are : the lines ( Text format).
Q) What is reason for using different set of Hadoop own data types like LongWritable,
IntWritable over the java data types like string and Integer.
Ans: Hadoop provides its own set of basic types that are optimized for network serialization.
Q) What is the purpose of job.setJarByClass() method ?
Ans: It serves two purposes
1) When we run this job on a Hadoop cluster, we will package the code into a JAR
file (which Hadoop will distribute around the cluster).
2) Rather than explicitly specifying the name of the JAR file,we can pass a class in the
Job’s setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.
Q) What is the purpose of FileInputFormat.addInputPath()
Ans: An input path is specified by calling the static addInputPath() method on
FileInputFormat, and it can be a single file, a directory (in which case, the input
forms all the files in that directory), or a file pattern.
Q) How do we add multiple input paths ?
Ans: As the name suggests, addInputPath() can be called more than once to use input
from multiple paths.
Q) What is the purpose of FileOutputFormat.setOutputPath()
Ans: The output path (of which there is only one) is specified by the static
setOutputPath() method on FileOutputFormat.It specifies a directory where the
output files from the reduce function are written.
Q)What happens when output directory is already present in FileOutputFormat.setOutputPath()
Ans: The directory shouldn’t exist before running the job because Hadoop will complain
and not run the job.
This precaution is to prevent data loss (it can be very annoying to accidentally
overwrite the output of a long job with that of another).
Q)What does the Job methods setOutputKeyClass() and setOutputValueClass() do ?
Ans: The setOutputKeyClass() and setOutputValueClass() methods control the output
types for the reduce function, and must match what the Reduce class produces.
Q) Do we need to set Mapper Output Key and Value classes ?
Ans: The map output types default to the same types, so they do not need to be set if
the mapper produces the same types as the reducer.
However, if they are different, the map output types must be set using the
setMapOutputKeyClass() and setMapOutputValueClass() methods.
Q) What is the default InputFormat
Ans: The input types are controlled via the input format ,the default TextInputFormat.
Q) What is the data locality optimization
Ans: Hadoop does its best to run the map task on a node where the input data resides
in HDFS, because it doesn’t need to use valuable cluster bandwidth.
Q) What is the process that can use the Data Locality.
Ans: Map Process.
Note that Reduce tasks don’t have the advantage of data locality as the input to a
single reduce task is normally the output from all mappers.
Q) Where to the Map output go ?
Ans: Map tasks write their output to the local disk, not to HDFS.
Q) Can we make zero Reducers ?
Ans: Finally, it’s also possible to have zero reduce tasks.
This can be appropriate when you don’t need the shuffle because the processing
can be carried out entirely in parallel.
Example : Parallel copy with distcp
% hadoop distcp file1 file2 -- equivalent to hadoop fs -cp file1 file2
----------------
Q) What are the high, default (Normal ) , min replication factors for HDFS
Ans: 10,3,1
Q) What is Uber task ( YARN)
Ans: If the job is small, the AM may choose to run the tasks in the same JVM as itself.
This happens when it judges that the overhead of allocating and running tasks in new
containers outweighs the gain to be had in running them in parallel, compared to
running them sequentially on one node.
Such a job is said to be uberized, or run as an uber task.
Q) What qualifiers Uber task ?
Ans: By default, a small job is one that has less than 10 mappers, only one reducer,
and an input size that is less than the size of one HDFS block.
But the parameters decide Uber task are mapreduce.job.ubertask.maxmaps,
mapreduce.job.ubertask.maxreduces, and mapreduce.job.ubertask.maxbytes.
Note: Uber tasks must be enabled explicitly by setting
mapreduce.job.ubertask.enable to true.
Q) What are the two types HDFS cluster nodes
Ans: An HDFS cluster has two types of nodes operating in a master-worker pattern:
a namenode (the master) and a number of datanodes (workers).
Q) What are the Name Node components ?
The namenode manages the filesystem namespace.
It maintains the filesystem tree and the metadata for all the files and directories in the tree.
This information is stored persistently on the local disk in the form of two files:
the namespace image and the edit log.
Q) What additional information does Name node contains ?
The namenode also knows the datanodes on which all the blocks for a given file are
located; however, it does not store block locations persistently, because this
information is reconstructed from data nodes when the system starts.
Q) What are failover management ways for a NameNode in Classic modal ( Other than YARN).
Ans: Hadoop provides two mechanisms
Q) How do the secondary name node works ?
Ans: The secondary namenode usually runs on a separate physical machine because it
requires plenty of CPU and as much memory as the namenode to perform the merge.
It keeps a copy of the merged namespace image, which can be used in the event
of the namenode failing.However, the state of the secondary namenode lags that
of the primary, so in the event of total failure of the primary, data loss is almost
certain. The usual course of action in this case is to copy the namenode’s
metadata files that are on NFS to the secondary and run it as the new primary.
Q) What is block cache ?
Ans: Normally a datanode reads blocks from disk, but for frequently accessed files the
blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
By default, a block is cached in only one datanode’s memory, although the
number is configurable on a per-file basis.
Q) What is the advantage of Block Cache ?
Ans: Job schedulers (for MapReduce, Spark, and other frameworks) can take advantage of cached blocks by running tasks on the datanode
where a block is cached, for increased read performance.
A small lookup table used in a join is a good candidate for caching.
Q) Can user direct the block cache ?
Ans: Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool.
Q) What is HDFS federation.
Ans: HDFS federation, introduced in the 2.x release series, allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem namespace called HDFS Federation.
Q) What changes needed for HDFS federation.
Ans: To access a federated HDFS cluster, clients use client-side mount tables to map file paths to namenodes.
This is managed in configuration using ViewFileSystem and the viewfs:// URIs.
Q) what happens when a name node fails .
Ans: To recover from a failed namenode in this situation, an administrator starts a new primary namenode with
one of the filesystem metadata replicas and configures datanodes and clients to use this new namenode.
Q) What are the steps to start a new Primary name node
The new namenode is not able to serve requests until it has
(i) loaded its namespace image into memory,
(ii) replayed its edit log, and
(iii) received enough block reports from the datanodes to leave safe mode.
Note: On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
Q) What is the HDFS HA or High Availablility.
Ans: Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation,
there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode,
the standby takes over its duties to continue servicing client requests without a significant interruption.
Q) What architectural changes are needed to allow HA ?
Ans:
1) The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up,
it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues
to read new entries as they are written by the active namenode.
2) Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.
3) Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.
4) The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
Q) What is quorum journal manager (QJM)
Ans:The highly available shared storage QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a
highly available edit log, and is the recommended choice for most HDFS installations.
Q) What are the three different modes of the Running Hadoop.
Ans: i) Local mode , directory stucture : /input/docs/
ii) Psudodistributed mode , directory can be accessed : hdfs://localhost/
iii) Full distributed mode , directory can be : hdfs://abc.com:8021/
Q) what does the second column represent in
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - abc supergroup 0 2015-10-04 13:22 books
-rw-r--r-- 1 abc supergroup 119 2015-10-04 13:21 temp.txt
Ans: It represents the replication factor.
Q) How do we change the replication factor.
Ans: dfs.replication property in core-site.xml
Q) what does 'x' ( execute) permission represent in HDFS files
Ans: The execute permission is ignored for a file because you can’t execute a file on HDFS (unlike UNIX),
and for a directory this permission is required to access its children
Q) What is the default security type in File permissions ?
Ans: By default, Hadoop runs with security disabled. Because clients are remote, it is possible for a client
to become an arbitrary user simply by creating an account of that name on the remote system.
Q) what is file permission priority in HDFS ?
Ans: Owner -> group -> Others , where Owner being top priority.
Q) Who is the Super User and what are the powers does it have ?
Ans: Superuser, is the identity of the namenode process. Permissions checks are not performed for the superuser.
Q) Which process is responsible for replica of File blocks when a write operation in HDFS
Ans: The blocks are placed by DFSOutputStream in The data queue is consumed by the DataStreamer, is responsible for
asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
Q) How the blocks are replicated in HDFS ?
Ans: The list of datanodes given by Name node to DataStreamer forms a pipeline say there are 3 nodes in the pipeline.
The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards
it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last).
Q) What is the success criteria for a write operation in HDFS ?
Ans: It’s possible, but unlikely, for multiple datanodes to fail while a block is being written.
As long as dfs.namenode.replication.min replicas (which defaults to 1) are written, the write will succeed,
and the block will be asynchronously replicated across the cluster until its target replication factor is reached
(dfs.replication, which defaults to 3).
Q) Hadoops Replica Placement strategy -- Very Important.
Ans: Hadoop’s default strategy is to place the first replica on the same node as the client
(for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes
that are too full or too busy).
The second replica is placed on a different rack from the first (off-rack), chosen at random.
The third replica is placed on the same rack as the second, but on a different node chosen at random.
Further if any replicas are placed on random nodes in the cluster, although the system tries to avoid placing
too many replicas on the same rack.
Q) What is hflush() or hsync() and diference ?
Ans: HDFS provides a way to force all buffers to be flushed to the datanodes via the hflush() method on FSDataOutputStream.
After a successful return from hflush(), HDFS guarantees that the data written up to that point in the file has reached
all the datanodes in the write pipeline and is visible to all new readers:
And Closing a file in HDFS performs an implicit hflush.
But may be a power failure will not guarantee commit.
So hsync() will be used that commits buffered data for a file descriptor.
For many applications, this is unacceptable, so you should call hflush() at suitable points, such as after writing a certain number
of records or number of bytes.
Q) What is distcp
Ans: The distcp is a program for copying data to and from Hadoop filesystems in parallel.
distcp is implemented as a MapReduce job where the work of copying is done by the maps that run
in parallel across the cluster. There are no reducers.
Q) Give example for distcp
Ans: 1) % hadoop distcp file1 file2 -- equivalent to hadoop fs -cp file1 file2
2) % hadoop distcp dir1 dir2 -- creates new directory dir2 and copy contents of dir1 to dir2
-- if exist or places dir1 under dir2 as dir2/dir1
3) % hadoop distcp -overwrite dir1 dir2 -- will overwrites dir2 with dir1 contents.
4) % hadoop distcp -update dir1 dir2 -- will synchronize dir2 with dir1..( Merge) .
------------
Q) what are counters ?
Ans: Counters are a useful channel for gathering statistics about the job.
Use a counter instead of log message to record that a particular condition occurred as the logs will be huge.
Q) What are the types of Counters
Ans: Broadly there are
1) Job counters : They measure job-level statistics, not values that change while a task is running.
maintained by Job tracker (MR-1) and by Application Master ( YARN).
2) Task counters : Gather information about tasks over the course of their execution,
and the results are aggregated over all the tasks in a job
other ways of classification is
1) Hadoop in built counters and 2) User defined counters.
Q) what are different kinds of Hadoop Built in counters ?
Ans:
MapReduce task counters -- org.apache.hadoop.mapreduce.TaskCounter
Filesystem counters -- org.apache.hadoop.mapreduce.FileSystemCounter
FileInputFormat counters -- org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
FileOutputFormat counters -- org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
Job counters -- org.apache.hadoop.mapreduce.JobCounter
Q) Name few samplers in Hadoop?
Ans: 1) SplitSampler, which samples only the first n records in a split, is not so good for sorted data,
reason it doesn’t select keys from throughout the split.
2) IntervalSampler chooses keys at regular intervals through the split and makes a better choice for sorted data.
3) RandomSampler is a good general-purpose sampler.
4) you can write your own implementation of the Sampler interface.
Q) Can we send key and value in single KEY value ?
Ans: Yes. We can do that using "IntPair" or "TextPair" as Key and NullWritable as value.
Q) What are the join techniques ?
Ans : Join depends on Dataset Volume
If one dataset is large but the other one is small. The small dataset is to be distributed to each node in the cluster
If both are huge datasets , then
Map-Side-Join: If the join is performed by the mapper it is called a map-side join.
A map-side join between large inputs works by performing the join before the data reaches the map function.
Reduce-side join: If it is performed by the reducer it is called a reduce-side join.
Q) What is Side data ?
Ans: Side data can be defined as extra read-only data needed by a job to process the main dataset.
Q) What is distributed cache ?
Ans: This is a service for copying files and archives to the task nodes in time for the tasks to use them when they run.
Q) How that can be used ?
Ans: when you run the hadoop command with details as
- files option for data file or look up files that can be used by tasks
- archives option for zip or gzip files
- libjars option for jar files to be placed in task classpath.
Q) what is the distributed cache defaul size and how to change it. ?
Ans: Distributed cache default size is 10 GB and can be changes using property yarn.nodemanager.localizer.cache.target-size-mb
Q) How does DC works
Ans: Hadoop copies the files specified by the -files, -archives, and -libjars options to the distributed filesystem (normally HDFS).
Then, before a task is run, the node manager copies the files from the distributed filesystem to a local disk—the cache—so
the task can access the files.
For the tasks there are now localized, the files are just there, symbolically linked from the task’s working directory.
In addition, files specified by -libjars are added to the task’s classpath before it is launched.
Q) When did the files in DC gets deleted.
Ans: The node manager maintains a reference count for the number of tasks using each file in the cache.
Before the task has run, the file’s reference count is incremented by 1;
then, after the task has run, the count is decreased by 1.
Only when the file is not being used (when the count reaches zero) is it eligible for deletion.
Files are deleted to make room for a new file when the node’s cache exceeds a certain size—10 GB by default
—using a least-recently used policy.
Q) Is there any other options for DC ?
Ans: we can use Java API 's in Job like public void addCacheFile(URI uri) and public void addCacheArchive(URI uri).
For this the file need to be present in the URI while job is running.
Note: These API's can not copy from local system where in using generic option parser ( - files) can do.
------------------
---------
These are all for a review your knowledge.
-------------
MR and Java
--------------Q) What is the default Key and Value in Mapper Input parameters ?
Ans: For mapper input keys are : The line offsets ( LongWritable)
values are : the lines ( Text format).
Q) What is reason for using different set of Hadoop own data types like LongWritable,
IntWritable over the java data types like string and Integer.
Ans: Hadoop provides its own set of basic types that are optimized for network serialization.
Q) What is the purpose of job.setJarByClass() method ?
Ans: It serves two purposes
1) When we run this job on a Hadoop cluster, we will package the code into a JAR
file (which Hadoop will distribute around the cluster).
2) Rather than explicitly specifying the name of the JAR file,we can pass a class in the
Job’s setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.
Q) What is the purpose of FileInputFormat.addInputPath()
Ans: An input path is specified by calling the static addInputPath() method on
FileInputFormat, and it can be a single file, a directory (in which case, the input
forms all the files in that directory), or a file pattern.
Q) How do we add multiple input paths ?
Ans: As the name suggests, addInputPath() can be called more than once to use input
from multiple paths.
Q) What is the purpose of FileOutputFormat.setOutputPath()
Ans: The output path (of which there is only one) is specified by the static
setOutputPath() method on FileOutputFormat.It specifies a directory where the
output files from the reduce function are written.
Q)What happens when output directory is already present in FileOutputFormat.setOutputPath()
Ans: The directory shouldn’t exist before running the job because Hadoop will complain
and not run the job.
This precaution is to prevent data loss (it can be very annoying to accidentally
overwrite the output of a long job with that of another).
Q)What does the Job methods setOutputKeyClass() and setOutputValueClass() do ?
Ans: The setOutputKeyClass() and setOutputValueClass() methods control the output
types for the reduce function, and must match what the Reduce class produces.
Q) Do we need to set Mapper Output Key and Value classes ?
Ans: The map output types default to the same types, so they do not need to be set if
the mapper produces the same types as the reducer.
However, if they are different, the map output types must be set using the
setMapOutputKeyClass() and setMapOutputValueClass() methods.
Q) What is the default InputFormat
Ans: The input types are controlled via the input format ,the default TextInputFormat.
Q) What is the data locality optimization
Ans: Hadoop does its best to run the map task on a node where the input data resides
in HDFS, because it doesn’t need to use valuable cluster bandwidth.
Q) What is the process that can use the Data Locality.
Ans: Map Process.
Note that Reduce tasks don’t have the advantage of data locality as the input to a
single reduce task is normally the output from all mappers.
Q) Where to the Map output go ?
Ans: Map tasks write their output to the local disk, not to HDFS.
Q) Can we make zero Reducers ?
Ans: Finally, it’s also possible to have zero reduce tasks.
This can be appropriate when you don’t need the shuffle because the processing
can be carried out entirely in parallel.
Example : Parallel copy with distcp
% hadoop distcp file1 file2 -- equivalent to hadoop fs -cp file1 file2
----------------
HDFS
-------------Q) What are the high, default (Normal ) , min replication factors for HDFS
Ans: 10,3,1
Q) What is Uber task ( YARN)
Ans: If the job is small, the AM may choose to run the tasks in the same JVM as itself.
This happens when it judges that the overhead of allocating and running tasks in new
containers outweighs the gain to be had in running them in parallel, compared to
running them sequentially on one node.
Such a job is said to be uberized, or run as an uber task.
Q) What qualifiers Uber task ?
Ans: By default, a small job is one that has less than 10 mappers, only one reducer,
and an input size that is less than the size of one HDFS block.
But the parameters decide Uber task are mapreduce.job.ubertask.maxmaps,
mapreduce.job.ubertask.maxreduces, and mapreduce.job.ubertask.maxbytes.
Note: Uber tasks must be enabled explicitly by setting
mapreduce.job.ubertask.enable to true.
Q) What are the two types HDFS cluster nodes
Ans: An HDFS cluster has two types of nodes operating in a master-worker pattern:
a namenode (the master) and a number of datanodes (workers).
Q) What are the Name Node components ?
The namenode manages the filesystem namespace.
It maintains the filesystem tree and the metadata for all the files and directories in the tree.
This information is stored persistently on the local disk in the form of two files:
the namespace image and the edit log.
Q) What additional information does Name node contains ?
The namenode also knows the datanodes on which all the blocks for a given file are
located; however, it does not store block locations persistently, because this
information is reconstructed from data nodes when the system starts.
Q) What are failover management ways for a NameNode in Classic modal ( Other than YARN).
Ans: Hadoop provides two mechanisms
- Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems.These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount.
- To run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
Q) How do the secondary name node works ?
Ans: The secondary namenode usually runs on a separate physical machine because it
requires plenty of CPU and as much memory as the namenode to perform the merge.
It keeps a copy of the merged namespace image, which can be used in the event
of the namenode failing.However, the state of the secondary namenode lags that
of the primary, so in the event of total failure of the primary, data loss is almost
certain. The usual course of action in this case is to copy the namenode’s
metadata files that are on NFS to the secondary and run it as the new primary.
Q) What is block cache ?
Ans: Normally a datanode reads blocks from disk, but for frequently accessed files the
blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
By default, a block is cached in only one datanode’s memory, although the
number is configurable on a per-file basis.
Q) What is the advantage of Block Cache ?
Ans: Job schedulers (for MapReduce, Spark, and other frameworks) can take advantage of cached blocks by running tasks on the datanode
where a block is cached, for increased read performance.
A small lookup table used in a join is a good candidate for caching.
Q) Can user direct the block cache ?
Ans: Users or applications instruct the namenode which files to cache (and for how long) by adding a cache directive to a cache pool.
Q) What is HDFS federation.
Ans: HDFS federation, introduced in the 2.x release series, allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem namespace called HDFS Federation.
Q) What changes needed for HDFS federation.
Ans: To access a federated HDFS cluster, clients use client-side mount tables to map file paths to namenodes.
This is managed in configuration using ViewFileSystem and the viewfs:// URIs.
Q) what happens when a name node fails .
Ans: To recover from a failed namenode in this situation, an administrator starts a new primary namenode with
one of the filesystem metadata replicas and configures datanodes and clients to use this new namenode.
Q) What are the steps to start a new Primary name node
The new namenode is not able to serve requests until it has
(i) loaded its namespace image into memory,
(ii) replayed its edit log, and
(iii) received enough block reports from the datanodes to leave safe mode.
Note: On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
Q) What is the HDFS HA or High Availablility.
Ans: Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation,
there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode,
the standby takes over its duties to continue servicing client requests without a significant interruption.
Q) What architectural changes are needed to allow HA ?
Ans:
1) The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up,
it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues
to read new entries as they are written by the active namenode.
2) Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.
3) Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.
4) The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
Q) What is quorum journal manager (QJM)
Ans:The highly available shared storage QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a
highly available edit log, and is the recommended choice for most HDFS installations.
Q) What are the three different modes of the Running Hadoop.
Ans: i) Local mode , directory stucture : /input/docs/
ii) Psudodistributed mode , directory can be accessed : hdfs://localhost/
iii) Full distributed mode , directory can be : hdfs://abc.com:8021/
Q) what does the second column represent in
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - abc supergroup 0 2015-10-04 13:22 books
-rw-r--r-- 1 abc supergroup 119 2015-10-04 13:21 temp.txt
Ans: It represents the replication factor.
Q) How do we change the replication factor.
Ans: dfs.replication property in core-site.xml
Q) what does 'x' ( execute) permission represent in HDFS files
Ans: The execute permission is ignored for a file because you can’t execute a file on HDFS (unlike UNIX),
and for a directory this permission is required to access its children
Q) What is the default security type in File permissions ?
Ans: By default, Hadoop runs with security disabled. Because clients are remote, it is possible for a client
to become an arbitrary user simply by creating an account of that name on the remote system.
Q) what is file permission priority in HDFS ?
Ans: Owner -> group -> Others , where Owner being top priority.
Q) Who is the Super User and what are the powers does it have ?
Ans: Superuser, is the identity of the namenode process. Permissions checks are not performed for the superuser.
Q) Which process is responsible for replica of File blocks when a write operation in HDFS
Ans: The blocks are placed by DFSOutputStream in The data queue is consumed by the DataStreamer, is responsible for
asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
Q) How the blocks are replicated in HDFS ?
Ans: The list of datanodes given by Name node to DataStreamer forms a pipeline say there are 3 nodes in the pipeline.
The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards
it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last).
Q) What is the success criteria for a write operation in HDFS ?
Ans: It’s possible, but unlikely, for multiple datanodes to fail while a block is being written.
As long as dfs.namenode.replication.min replicas (which defaults to 1) are written, the write will succeed,
and the block will be asynchronously replicated across the cluster until its target replication factor is reached
(dfs.replication, which defaults to 3).
Q) Hadoops Replica Placement strategy -- Very Important.
Ans: Hadoop’s default strategy is to place the first replica on the same node as the client
(for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes
that are too full or too busy).
The second replica is placed on a different rack from the first (off-rack), chosen at random.
The third replica is placed on the same rack as the second, but on a different node chosen at random.
Further if any replicas are placed on random nodes in the cluster, although the system tries to avoid placing
too many replicas on the same rack.
Q) What is hflush() or hsync() and diference ?
Ans: HDFS provides a way to force all buffers to be flushed to the datanodes via the hflush() method on FSDataOutputStream.
After a successful return from hflush(), HDFS guarantees that the data written up to that point in the file has reached
all the datanodes in the write pipeline and is visible to all new readers:
And Closing a file in HDFS performs an implicit hflush.
But may be a power failure will not guarantee commit.
So hsync() will be used that commits buffered data for a file descriptor.
For many applications, this is unacceptable, so you should call hflush() at suitable points, such as after writing a certain number
of records or number of bytes.
Q) What is distcp
Ans: The distcp is a program for copying data to and from Hadoop filesystems in parallel.
distcp is implemented as a MapReduce job where the work of copying is done by the maps that run
in parallel across the cluster. There are no reducers.
Q) Give example for distcp
Ans: 1) % hadoop distcp file1 file2 -- equivalent to hadoop fs -cp file1 file2
2) % hadoop distcp dir1 dir2 -- creates new directory dir2 and copy contents of dir1 to dir2
-- if exist or places dir1 under dir2 as dir2/dir1
3) % hadoop distcp -overwrite dir1 dir2 -- will overwrites dir2 with dir1 contents.
4) % hadoop distcp -update dir1 dir2 -- will synchronize dir2 with dir1..( Merge) .
------------
MR Features and HDFS
-------------Q) what are counters ?
Ans: Counters are a useful channel for gathering statistics about the job.
Use a counter instead of log message to record that a particular condition occurred as the logs will be huge.
Q) What are the types of Counters
Ans: Broadly there are
1) Job counters : They measure job-level statistics, not values that change while a task is running.
maintained by Job tracker (MR-1) and by Application Master ( YARN).
2) Task counters : Gather information about tasks over the course of their execution,
and the results are aggregated over all the tasks in a job
other ways of classification is
1) Hadoop in built counters and 2) User defined counters.
Q) what are different kinds of Hadoop Built in counters ?
Ans:
MapReduce task counters -- org.apache.hadoop.mapreduce.TaskCounter
Filesystem counters -- org.apache.hadoop.mapreduce.FileSystemCounter
FileInputFormat counters -- org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
FileOutputFormat counters -- org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
Job counters -- org.apache.hadoop.mapreduce.JobCounter
Q) Name few samplers in Hadoop?
Ans: 1) SplitSampler, which samples only the first n records in a split, is not so good for sorted data,
reason it doesn’t select keys from throughout the split.
2) IntervalSampler chooses keys at regular intervals through the split and makes a better choice for sorted data.
3) RandomSampler is a good general-purpose sampler.
4) you can write your own implementation of the Sampler interface.
Q) Can we send key and value in single KEY value ?
Ans: Yes. We can do that using "IntPair" or "TextPair" as Key and NullWritable as value.
Q) What are the join techniques ?
Ans : Join depends on Dataset Volume
If one dataset is large but the other one is small. The small dataset is to be distributed to each node in the cluster
If both are huge datasets , then
Map-Side-Join: If the join is performed by the mapper it is called a map-side join.
A map-side join between large inputs works by performing the join before the data reaches the map function.
Reduce-side join: If it is performed by the reducer it is called a reduce-side join.
Q) What is Side data ?
Ans: Side data can be defined as extra read-only data needed by a job to process the main dataset.
Q) What is distributed cache ?
Ans: This is a service for copying files and archives to the task nodes in time for the tasks to use them when they run.
Q) How that can be used ?
Ans: when you run the hadoop command with details as
- files option for data file or look up files that can be used by tasks
- archives option for zip or gzip files
- libjars option for jar files to be placed in task classpath.
Q) what is the distributed cache defaul size and how to change it. ?
Ans: Distributed cache default size is 10 GB and can be changes using property yarn.nodemanager.localizer.cache.target-size-mb
Q) How does DC works
Ans: Hadoop copies the files specified by the -files, -archives, and -libjars options to the distributed filesystem (normally HDFS).
Then, before a task is run, the node manager copies the files from the distributed filesystem to a local disk—the cache—so
the task can access the files.
For the tasks there are now localized, the files are just there, symbolically linked from the task’s working directory.
In addition, files specified by -libjars are added to the task’s classpath before it is launched.
Q) When did the files in DC gets deleted.
Ans: The node manager maintains a reference count for the number of tasks using each file in the cache.
Before the task has run, the file’s reference count is incremented by 1;
then, after the task has run, the count is decreased by 1.
Only when the file is not being used (when the count reaches zero) is it eligible for deletion.
Files are deleted to make room for a new file when the node’s cache exceeds a certain size—10 GB by default
—using a least-recently used policy.
Q) Is there any other options for DC ?
Ans: we can use Java API 's in Job like public void addCacheFile(URI uri) and public void addCacheArchive(URI uri).
For this the file need to be present in the URI while job is running.
Note: These API's can not copy from local system where in using generic option parser ( - files) can do.
------------------
Hadoop I/O
--------------------------------
Q) What is the error -detection code used in hadoop ?
Ans: CRC-32 is used for checksumming in Hadoop’s ChecksumFileSystem, while HDFS uses a more efficient variant called CRC-32C.
Q) For how many bytes of data will HDFS creates check sum ?
Ans: Default is 512 bytes, the property that decides is : dfs.bytes-per-checksum
Q) How HDFS checks corruption due to “bit rot” in the physical storage media
Ans: Each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode.
Q) What is the "healing" process in HDFS for currepted blocks of data ?
Ans: Because HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas
to produce a new, uncorrupt replica.
Q) what is the process of replica healing .
Ans: if a client detects an error when reading a block, it reports the bad block and the datanode it was trying to read
from to the namenode before throwing a ChecksumException.
The namenode marks the block replica as corrupt so it doesn’t direct any more clients to it or try to copy
this replica to another datanode.
It then schedules a copy of the block to be replicated on another datanode, so its replication factor
is back at the expected level.
Once this has happened, the corrupt replica is deleted.
Q) Command for Checking Checksum
Ans: hadoop fs -checksum
Q) If an error is detected by ChecksumFileSystem when reading a file, what does it do ?
Ans: It will call its reportChecksumFailure() method. LocalFileSystem moves the offending file and its checksum
to a side directory on the same device called bad_files.
Administrators should periodically check for these bad files and take action on them.
Q) What is Codec ?
Ans: A codec is the implementation of a compression-decompression algorithm.
Q) What is the better compression in case of large files ?
Ans: For large files, you should not use a compression format that does not support splitting on the whole file,
because you lose locality and make MapReduce applications very inefficient.
Q) What is the significance of NullWritable ?
Ans: NullWritable is a special type of Writable, as it has a zero-length serialization.
No bytes are written to or read from the stream. It is used as a placeholder.
Q) when you use NullWritable ?
Ans: In MapReduce, a key or a value can be declared as a NullWritable when you don’t need to
use that position, effectively storing a constant empty value.
Q) What is block compression in a sequence file ?
Ans: Block compression compresses multiple records at once; it is therefore more compact than and
should generally be preferred over record compression because it has the opportunity to take
advantage of similarities between records.
Q) How the Block compression structures ?
Ans: Records are added to a block until it reaches a minimum size in bytes, defined by
the io.seqfile.compress.blocksize property; the default is one million bytes.
A sync marker is written before the start of every block. The format of a block is a field
indicating the number of records in the block, followed by four compressed fields:
the key lengths, the keys, the value lengths, and the values.
Q) What is a Mapfile ?
Ans: A MapFile is a sorted SequenceFile with an index to permit lookups by key.
The index is itself a SequenceFile that contains a fraction of the keys in the map (every 128th key, by default).
Q) what is the structure of Mapfile ?
Ans: The idea is that the index can be loaded into memory to provide fast lookups from the main data file,
which is another SequenceFile containing all the map entries in sorted key order.
Q) What are the conditions of Mapfile ?
Ans: MapFile offers a very similar interface to SequenceFile for reading and writing.
But when writing using MapFile.Writer, map entries must be added in order, otherwise throws IOException.
Q) what are the MapFile variants?
Ans: There are three variants to Mapfile that comes with hadoop.
1) SetFile is a specialization of MapFile for storing a set of Writable keys. The keys must be added in sorted order.
2) ArrayFile is a MapFile where the key is an integer representing the index of the element in the array
and the value is a Writable value.
3) BloomMapFile is a MapFile that offers a fast version of the get() method, especially for sparsely populated files.
The implementation uses a dynamic Bloom filter for testing whether a given key is in the map.
The test is very fast because it is in-memory, and it has a nonzero probability of false positives.
Only if the test passes (the key is present) is the regular get() method called.
Q) How does column oriented formats works ?
Ans: A column-oriented layout permits columns that are not accessed in a query to be skipped.
Consider a query of the table needs to processes only column 2.
With column-oriented storage, only the column 2 parts of the file (as req example) need to be read into memory.
Q) Compare and contrast Row oriented and Column oriented formats ?
Ans: Column-oriented formats work good when queries access only a small number of columns in the table.
On the other hand, row-oriented formats are appropriate when a large number of columns of a single row are needed.
Q) What is the error -detection code used in hadoop ?
Ans: CRC-32 is used for checksumming in Hadoop’s ChecksumFileSystem, while HDFS uses a more efficient variant called CRC-32C.
Q) For how many bytes of data will HDFS creates check sum ?
Ans: Default is 512 bytes, the property that decides is : dfs.bytes-per-checksum
Q) How HDFS checks corruption due to “bit rot” in the physical storage media
Ans: Each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode.
Q) What is the "healing" process in HDFS for currepted blocks of data ?
Ans: Because HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas
to produce a new, uncorrupt replica.
Q) what is the process of replica healing .
Ans: if a client detects an error when reading a block, it reports the bad block and the datanode it was trying to read
from to the namenode before throwing a ChecksumException.
The namenode marks the block replica as corrupt so it doesn’t direct any more clients to it or try to copy
this replica to another datanode.
It then schedules a copy of the block to be replicated on another datanode, so its replication factor
is back at the expected level.
Once this has happened, the corrupt replica is deleted.
Q) Command for Checking Checksum
Ans: hadoop fs -checksum
Q) If an error is detected by ChecksumFileSystem when reading a file, what does it do ?
Ans: It will call its reportChecksumFailure() method. LocalFileSystem moves the offending file and its checksum
to a side directory on the same device called bad_files.
Administrators should periodically check for these bad files and take action on them.
Q) What is Codec ?
Ans: A codec is the implementation of a compression-decompression algorithm.
Q) What is the better compression in case of large files ?
Ans: For large files, you should not use a compression format that does not support splitting on the whole file,
because you lose locality and make MapReduce applications very inefficient.
Q) What is the significance of NullWritable ?
Ans: NullWritable is a special type of Writable, as it has a zero-length serialization.
No bytes are written to or read from the stream. It is used as a placeholder.
Q) when you use NullWritable ?
Ans: In MapReduce, a key or a value can be declared as a NullWritable when you don’t need to
use that position, effectively storing a constant empty value.
Q) What is block compression in a sequence file ?
Ans: Block compression compresses multiple records at once; it is therefore more compact than and
should generally be preferred over record compression because it has the opportunity to take
advantage of similarities between records.
Q) How the Block compression structures ?
Ans: Records are added to a block until it reaches a minimum size in bytes, defined by
the io.seqfile.compress.blocksize property; the default is one million bytes.
A sync marker is written before the start of every block. The format of a block is a field
indicating the number of records in the block, followed by four compressed fields:
the key lengths, the keys, the value lengths, and the values.
Q) What is a Mapfile ?
Ans: A MapFile is a sorted SequenceFile with an index to permit lookups by key.
The index is itself a SequenceFile that contains a fraction of the keys in the map (every 128th key, by default).
Q) what is the structure of Mapfile ?
Ans: The idea is that the index can be loaded into memory to provide fast lookups from the main data file,
which is another SequenceFile containing all the map entries in sorted key order.
Q) What are the conditions of Mapfile ?
Ans: MapFile offers a very similar interface to SequenceFile for reading and writing.
But when writing using MapFile.Writer, map entries must be added in order, otherwise throws IOException.
Q) what are the MapFile variants?
Ans: There are three variants to Mapfile that comes with hadoop.
1) SetFile is a specialization of MapFile for storing a set of Writable keys. The keys must be added in sorted order.
2) ArrayFile is a MapFile where the key is an integer representing the index of the element in the array
and the value is a Writable value.
3) BloomMapFile is a MapFile that offers a fast version of the get() method, especially for sparsely populated files.
The implementation uses a dynamic Bloom filter for testing whether a given key is in the map.
The test is very fast because it is in-memory, and it has a nonzero probability of false positives.
Only if the test passes (the key is present) is the regular get() method called.
Q) How does column oriented formats works ?
Ans: A column-oriented layout permits columns that are not accessed in a query to be skipped.
Consider a query of the table needs to processes only column 2.
With column-oriented storage, only the column 2 parts of the file (as req example) need to be read into memory.
Q) Compare and contrast Row oriented and Column oriented formats ?
Ans: Column-oriented formats work good when queries access only a small number of columns in the table.
On the other hand, row-oriented formats are appropriate when a large number of columns of a single row are needed.
---------
Avro
------------
Q) What is Avro ?
Ans: Apache Avro[79] is a language-neutral data serialization system. The project was created by Doug Cutting
to address the lack of language portability of Hadoop Writables.
Q) Give two characterstics that Avro supports ?
Ans: Avro datafiles support compression and are splittable, which is crucial for a MapReduce data input format.
Q) Explain Avro data type with example ?
Ans: Ex, Avro’s double type is represented in C, C++, and Java by a "double", in Python by a "float", and in Ruby by a "Float".
Q) what is the extension for Avro schema ?
Ans: .avsc is the conventional extension for an Avro schema
Q) What is projection in Avro schema ?
Ans: When there are many columns in read and we need few of them in the writer schema we drop few fields in the record.
This is called Projection in Avro Schema resolution.
Q) what does the aliases do in Avro schema ?
Ans: Aliases traslate old names to new names, which is helpful naming conventional differences.
Q) what type does not have preordained rules for their sort order in Avro spec ?
Ans: Record . All types except "record" have preordained rules for their sort order, as described in the Avro specification,
that cannot be overridden by the user.
Q) How can user control sorting order in "record" type of Avro data ?
Ans: For records, you can control the sort order by specifying the order attribute for a field.
Q) What is the power of Avro in sorting ?
Ans: Avro implements efficient binary comparisons. That is to say, Avro does not have to deserialize binary data
into objects to perform the comparison, because it can instead work directly on the byte streams.
Q) What is Avro ?
Ans: Apache Avro[79] is a language-neutral data serialization system. The project was created by Doug Cutting
to address the lack of language portability of Hadoop Writables.
Q) Give two characterstics that Avro supports ?
Ans: Avro datafiles support compression and are splittable, which is crucial for a MapReduce data input format.
Q) Explain Avro data type with example ?
Ans: Ex, Avro’s double type is represented in C, C++, and Java by a "double", in Python by a "float", and in Ruby by a "Float".
Q) what is the extension for Avro schema ?
Ans: .avsc is the conventional extension for an Avro schema
Q) What is projection in Avro schema ?
Ans: When there are many columns in read and we need few of them in the writer schema we drop few fields in the record.
This is called Projection in Avro Schema resolution.
Q) what does the aliases do in Avro schema ?
Ans: Aliases traslate old names to new names, which is helpful naming conventional differences.
Q) what type does not have preordained rules for their sort order in Avro spec ?
Ans: Record . All types except "record" have preordained rules for their sort order, as described in the Avro specification,
that cannot be overridden by the user.
Q) How can user control sorting order in "record" type of Avro data ?
Ans: For records, you can control the sort order by specifying the order attribute for a field.
Q) What is the power of Avro in sorting ?
Ans: Avro implements efficient binary comparisons. That is to say, Avro does not have to deserialize binary data
into objects to perform the comparison, because it can instead work directly on the byte streams.
No comments:
Post a Comment