TechnicalPhilosophy: top 50 question for cloudera certified hadoop developer

Here are some practice questions that I gathered from here and there.
I sincerely suggest to Check materials and answer your self.
These questions I feel like challenging so putting them here.

Question :

You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?
Answer:

Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
When submitting the job on the command line, specify the –libjars option followed by the JAR file path.
Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
Package your code and the Apache Commands Math library into a zip file named JobJar.zip

----------
Question :

On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot.
What determines how the JobTracker assigns each map task to a TaskTracker?
Answer:

The location of the InsputSplit to be processed in relation to the location of the node.
The average system load on the TaskTracker node over the past fifteen (15) minutes.
The amount of free disk space on the TaskTracker node.
The number and speed of CPU cores on the TaskTracker node.

-------------
Question :

In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?
Answer:

m
m+n (i.e., m plus n)
mXn (i.e., m multiplied by n)
n

-------------------------
Question :

For each input key-value pair, mappers can emit:
Answer:

One intermediate key-value pair, of a different type.
As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).

-----------
Question :

In a MapReduce job with 500 map tasks, how many map task attempts will there be?
Answer:

Between 500 and 1000.
At most 500.
It depends on the number of reduces in the job.
At least 500.

------------------
Which process describes the lifecycle of a Mapper?
Answer:

The TaskTracker spawns a new Mapper to process each key-value pair.
The TaskTracker spawns a new Mapper to process all records in a single input split.
The JobTracker calls the TaskTracker’s configure () method, then its map () method and finally its close () method.
The JobTracker spawns a new Mapper to process all records in a single file.

--------------------------
Question :
You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?

Answer:

configure
map
init
combine

------------------------
Question :
Given a directory of files with the following structure: line number, tab character, string:
Example:
1 abialkjfjkaoasdfjksdlkjhqweroij
2 kadfjhuwqounahagtnbvaswslmnbfgy
3 kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

Answer:

KeyValueFileInputFormat
SequenceFileAsTextInputFormat
BDBInputFormat
SequenceFileInputFormat

-------------------------------
Question :
Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data?

Answer:

fuse-dfs
Sqoop
Hive
Hue

------------------------
Question :
Indentify which best defines a SequenceFile?

Answer:

A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects

-----------------------

Question :
You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS?

Answer:

HTML
Avro
SequenceFiles
JSON

------------------------------
Question :
You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this life?

Answer:

They would see the current state of the file, up to the last bit written by the command.
They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
They would see the current of the file through the last completed block.
They would see no content until the whole file written and closed.

---------------
Question :
Table metadata in Hive is:

Answer:

Stored as metadata on the NameNode.
Stored in ZooKeeper.
Stored in the Metastore.
Stored along with the data in HDFS.

------------------------------------------------
Question :
When is the earliest point at which the reduce method of a given Reducer can be called?

Answer:

As soon as at least one mapper has finished processing its input split.
As soon as a mapper has emitted at least one record.
Not until all mappers have finished processing all records.
It depends on the InputFormat used for the job.

------------
Question :
MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.

Answer:

Job coordination between the ResourceManager and NodeManager
Launching tasks
Managing file system metadata
Job scheduling/monitoring

-------------------------------
Question :
What data does a Reducer reduce method process?

Answer:

All data for a given key, regardless of which mapper(s) produced it.
All data for a given value, regardless of which mapper(s) produced it.
All data produced by a single mapper.
All the data in a single input file.

----------------------
Question :
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

Answer:

You will longer be able to take advantage of a Combiner.
You will not be able to compress the intermediate data.
By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
There are no concerns with this approach. It is always advisable to use multiple reduces.

--------------------------------------
Question :
Identify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

Answer:

Oozie
Hadoop Streaming
Flume
Sqoop

-------------------------
Question :
You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you’ve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

Answer:

hadoop “mapred.job.name=Example” MyDriver input output
hadoop setproperty mapred.job.name=Example MyDriver input output
hadoop MyDriver mapred.job.name=Example input output
hadoop MyDrive –D mapred.job.name=Example input output

----------------
Question :
All keys used for intermediate output from mappers must:

Answer:

Implement a splittable compression algorithm.
Be a subclass of FileInputFormat.
Override isSplitable.
Implement WritableComparable.

--------------------------------------
Question :
Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run?

Answer:

--------------------------
Question :
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product identifies (Text).
Identify what determines the data types used by the Mapper for a given job.

Answer:

The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
The data types specified in HADOOP_MAP_DATATYPES environment variable
The mapper-specification.xml file submitted with the job determine the mapper’s input key and value types.
The InputFormat used by the job determines the mapper’s input key and value types.

------------
Question :
You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt. How many files will be processed by the FileInputFormat.setInputPaths () command when it's given a path object representing this directory?

Answer:

Four, all files will be processed
None, the directory cannot be named jobdata
Two, file names with a leading period or underscore are ignored
Three, the pound sign is an invalid character for HDFS file names

--------------
Question :
The Hadoop framework provides a mechanism for coping with machine issues such as faulty configuration or impending hardware failure. MapReduce detects that one or a number of machines are performing poorly and starts more copies of a map or reduce task. All the tasks run simultaneously and the task finish first are used. This is called:

Answer:

IdentityReducer
IdentityMapper
Default Partitioner
Speculative Execution

---------------------
Question :
Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

Answer:

Pig
Hue
HBase
Hive

-----------------------
Question :
You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer’s reduce method?

Answer:

Two
Four
Five
Six

---------------------
Question :
Which best describes what the map method accepts and emits?

Answer:

It accepts a single key-value pairs as input and can emit only one key-value pair as output.
It accepts a list key-value pairs as input and can emit only one key-value pair as output.
It accepts a single key-value pair as input and emits a single key and list of corresponding values as output.
It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.

--------------
Question :
Analyze each scenario below and indentify which best describes the behavior of the default partitioner?

Answer:

The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the key space.
The default partitioner computes the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.
The default partitioner assigns key-values pairs to reduces based on an internal random number generator.
The default partitioner computes the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is assigned to a specific reducer.

-------------
Question :
In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?

Answer:

It returns a reference to the same Writable object each time, but populated with different data.
It returns a reference to a different Writable object time.
It returns a reference to a Writable object from an object pool.
It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object.

--------------
Question :
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

Answer:

ApplicationMasterService
ApplicationMaster
ResourceManager
NodeManager

----------------
Question :
A combiner reduces:

Answer:

The number of output files a reducer must produce.
The amount of intermediate data that must be transferred between the mapper and reducer.
The number of input files a mapper must process.
The number of values across different keys in the iterator supplied to a single reduce method call.

------------------------
Question :
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?

Answer:

Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.

----------------
Question :
You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

Answer:

Disk I/O and network I/O
Processor and RAM
Processor and disk I/O
Processor and network I/O

-------------
Question :
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

Answer:

There is no difference in output between the two settings.
With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.

--------------
Question :
MapReduce v2 (MRv2/YARN) is designed to address which two issues?

Answer:

HDFS latency.
Ability to run frameworks other than MapReduce, such as MPI.
Resource pressure on the JobTracker.
Reduce complexity of the MapReduce APIs.

---------------
Question :
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

Answer:

Algorithms that require global, sharing states.
Relational operations on large amounts of structured and semi-structured data.
Large-scale graph algorithms that require one-step link traversal.
Algorithms that require applying the same mathematical function to large numbers of individual binary records.

-------------
Question :
When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

Answer:

When the signature of the reduce method matches the signature of the combine method.
Always. Code can be reused in Java since it is a polymorphic object-oriented programming language.
Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
When the types of the reduce operation’s input key and input value match the types of the reducer’s output key and output value and when the reduce operation is both communicative and associative.

-------------------
Question :
Workflows expressed in Oozie can contain:

Answer:

Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce sequences can be combined with forks and path joins.
Iterntive repetition of MapReduce jobs until a desired answer or state is reached.
Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks.
Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

--------------
Question :
You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters.
Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:

Answer:

You will have five failed task attempts
You will have seventeen failed task attempts
You will have twenty failed task attempts - 5 files fails for 4 times each
You will have forty-eight failed task attempts

----------------
Question :
You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?

Answer:

Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
Channel these clickstreams inot Hadoop using Hadoop Streaming.
Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
Ingest the server web logs into HDFS using Flume. -- Better Approach

---------------------
Question :
Which best describes how TextInputFormat processes input files and line breaks?

Answer:

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.
Input file splits may cross line breaks. A line that crosses file splits is ignored.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.

------------
Question :
In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

Answer:

Write a custom MapRunner that iterates over all key-value pairs in the entire file.
Set the number of mappers equal to the number of input files you want to process.
Write a custom FileInputFormat and override the method isSplitable to always return false.
Increase the parameter that controls minimum split size in the job configuration.

TechnicalPhilosophy

Labels

top 50 question for cloudera certified hadoop developer

2 comments: