mapreduce - Using Hadoop for Parallel Processing rather than Big Data -
i manage small team of developers , @ given time have several on going (one-off) data projects considered "embarrassingly parallel" - these involve running single script on single computer several days, classic example processing several thousand pdf files extract key text , place csv file later insertion database.
we doing enough of these type of tasks started investigate developing simple job queue system using rabbitmq few spare servers (with eye use amazon sqs/s3/ec2 projects needed larger scaling)
in searching examples of others doing keep coming across classic hadoop new york times example:
the new york times used 100 amazon ec2 instances , hadoop application process 4 tb of raw image tiff data (stored in s3) 11 million finished pdfs in space of 24 hours @ computation cost of $240 (not including bandwidth)
which sounds perfect? researched hadoop , map/reduce.
but can't work out how did it? or why did it?
converting tiff's in pdf's not map/reduce problem surely? wouldn't simple job queue have been better?
the other classic hadoop example "wordcount" yahoo hadoop tutorial seems perfect fit map/reduce, , can see why such powerful tool big data.
i don't understand how these "embarrassingly parallel" tasks put map/reduce pattern?
tl;dr
this conceptual question, want know how fit task of "processing several thousand pdf files extract key text , place csv file" map/reduce pattern?
if know of examples perfect, i'm not asking write me.
(notes: have code process pdf's, i'm not asking - it's example, task. i'm asking putting processes hadoop map/reduce pattern - when there no clear "map" or "reduce" elements task.)
cheers!
your thinking right.
the above examples mentioned used part of solution hadoop offers. used parallel computing ability of hadoop plus distributed file system. it's not necessary need reduce step. may not have data interdependency between parallel processes run. in case eliminate reduce step.
i think problem fit hadoop solution domain.
you have huge data - huge number of pdf files , long running job
you can process these files parallely placing files on hdfs , running mapreduce job. processing time theoretically improves number of nodes have on cluster. if not see need aggregate data sets produced individual threads not need use reduce step else need design reduce step well.
the thing here if not need reduce step, leveraging parallel computing ability of hadoop plus equipped run jobs on not expensive hardware.
Comments
Post a Comment