mapreduce - Hadoop, hardware and bioinformatics -

March 15, 2013

we're buy new hardware run our analyses , wondering if we're making right decisions.

the setting:
we're bioinformatics lab handling dna sequencing data. biggest issue our field has amount of data, rather compute. single experiment go 10s-100s of gb, , typically run different experiments @ same time. obviously, mapreduce approaches interesting (see http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), not our software use paradigm. also, software uses ascii files in/output while other software works binary files.

what might buying:
machine might buying server 32 cores , 192gb of ram, linked nas storage (>20tb). seems interesting setup many of our (non-mapreduce) applications, such configuration prevent implementing hadoop/mapreduce/hdfs in meaningful way?

many thanks,
jan.

you have interesting configuration. disk io nas storage used you?

make decision based on following: map reduce paradigm used solve problem of handling large amount of data. basically, ram more expensive disk storage. cannot hold data in ram. disk storage allows store large amounts of data @ cheaper costs. but, speed @ can read data disks not high. how map reduce solve problem? map reduce solves problem distributing data on multiple machines. now, speed @ can read data in parallel greater have done single storage disk. suppose disk io speed 100 mbps. 100 machines can read data @ 100*100 mbps = 10gbps.

typically processor speed not bottleneck. rather, disk ios big bottlenecks while processing large amount of data.

i have feeling may not efficient.

Search This Blog

Assebmley

mapreduce - Hadoop, hardware and bioinformatics -

Comments

Post a Comment

Popular posts from this blog

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

apache - Add omitted ? to URLs -

php - How can I stop spam on my custom forum/blog? -