i want to merge multiple csv files by specific condition using perl -
i have multiple csv files, want merge files..... showing of sample csv files below...
m1dl1_interpro_sum.csv
ipr017690,outer membrane, omp85 target,821 ipr014729,rossmann,327 ipr013785,aldolase,304 ipr015421,pyridoxal,224 ipr003594,atpase,179 ipr000531,tonb receptor,150 ipr018248,ef-hand,10
m1dl2_interpro_sum.csv
ipr017690,outer membrane, omp85 target,728 ipr013785,aldolase,300 ipr014729,rossmann,261 ipr015421,pyridoxal,189 ipr011991,winged,113 ipr000873,amp-dependent synthetase/ligase,111
m1dl3_interpro_sum.csv
ipr017690,outer membrane,905 ipr013785,aldolase,367 ipr014729,rossmann,338 ipr015421,pyridoxal,271 ipr003594,atpase,158 ipr018248,ef-hand,3
now merge these files have tried following code
@argv = <merge_csvfiles/*.csv>; print @argv[0],"\n"; open(page,">outfile.csv") || die"can't open outfile.csv\n"; while($i<scalar(@argv)) { open(file,@argv[$i]) || die"can't open ...@argv[$i]...\n"; $data.=join("",<file>); close file; print"file completed...",$i+1,"\n"; $i++; } @data=split("\n",$data); @data2=@data; print scalar(@data); for($i=0;$i<scalar(@data);$i++) { @id1=split(",",@data[$i]); $id_1=@id1[0]; @data[$j]=~s/\n//; if(@data[$i] ne "") { print page "\n@data[$i],"; for($j=$i+1;$j<scalar(@data2);$j++) { @id2=split(",",@data2[$j]); $id_2=@id2[0]; if($id_1 eq $id_2) { @data[$j]=~s/\n//; print page "@data2[$j],"; @data2[$j]=""; @data[$j]=""; print "match found @ ",$i+1," , ",$j+1,"\n"; } } } print $i+1,"\n"; }
merge_csvfiles folder contains files
output of above code is
ipr017690,outer membrane,821,ipr017690,outer membrane ,728,ipr017690,outer membrane,905 ipr014729,rossmann,327,ipr014729,rossmann,261,ipr014729,rossmann,338 ipr013785,aldolase,304,ipr013785,aldolase,300,ipr013785,aldolase,367 ipr015421,pyridoxal,224,ipr015421,pyridoxal,189,ipr015421,pyridoxal,271 ipr003594,atpase,179,ipr003594,atpase,158 ipr000531,tonb receptor,150 ipr018248,ef-hand,10,ipr018248,ef-hand,3 ipr011991,winged,113 ipr000873,amp-dependent synthetase/ligase
but want output in following format....
ipr017690,outer membrane,821,ipr017690,outer membrane ,728,ipr017690,outer membrane,905 ipr014729,rossmann,327,ipr014729,rossmann,261,ipr014729,rossmann,338 ipr013785,aldolase,304,ipr013785,aldolase,300,ipr013785,aldolase,367 ipr015421,pyridoxal,224,ipr015421,pyridoxal,189,ipr015421,pyridoxal,271 ipr003594,atpase,179,0,0,0,ipr003594,atpase,158 ipr000531,tonb receptor,150,0,0,0,0,0,0 ipr018248,ef-hand,10,0,0,0,ipr018248,ef-hand,3 0,0,0,ipr011991,winged,113,0,0,0 0,0,0,ipr000873,amp-dependent synthetase/ligase,111,0,0,0
has got idea how can this? thank help
as mentioned in miguel prz's comment, haven't explained how want merge performed, but, judging "desired output" sample, appears want concatenate lines matching ids 3 input files single line in output file, "0,0,0" taking place of lines don't appear in given file.
so, then:
#!/usr/bin/env perl use strict; use warnings; @input_files = glob 'merge_csvfiles/*.csv'; %data; $i (0 .. $#input_files) { open $infh, '<', $input_files[$i] or die "failed open $input_files[$i]: $!"; while (<$infh>) { chomp; $id = (split ',', $_, 2)[0]; $data{$id}[$i] = $_; } print "input file read: $input_files[$i]\n"; } open $outfh, '>', 'outfile.csv' or die "failed open outfile.csv: $!"; $id (sort keys %data) { @merge_data; $i (0 .. $#input_files) { push @merge_data, $data{$id}[$i] || '0,0,0'; } print $outfh join(',', @merge_data) . "\n"; }
the first loop collects lines each file hash of arrays. hash keys ids, lines id files kept together, , value each key (a reference to) array of line associated id in each file; using array allows keep track of values missing present.
the second loop takes keys of hash (in alphabetical order) and, each one, creates temporary array of values associated id, substituting "0,0,0" missing values, joins them single string, , prints output file.
the results, in outfile.csv
, are:
ipr000531,tonb receptor,150,0,0,0,0,0,0 0,0,0,ipr000873,amp-dependent synthetase/ligase,111,0,0,0 ipr003594,atpase,179,0,0,0,ipr003594,atpase,158 0,0,0,ipr011991,winged,113,0,0,0 ipr013785,aldolase,304,ipr013785,aldolase,300,ipr013785,aldolase,367 ipr014729,rossmann,327,ipr014729,rossmann,261,ipr014729,rossmann,338 ipr015421,pyridoxal,224,ipr015421,pyridoxal,189,ipr015421,pyridoxal,271 ipr017690,outer membrane, omp85 target,821,ipr017690,outer membrane, omp85 target,728,ipr017690,outer membrane,905 ipr018248,ef-hand,10,0,0,0,ipr018248,ef-hand,3
edit: added explanations requested op in comments
can u expalain me working of $id = (split ',', $_, 2)[0]; , $# in program
my $id = (split ',', $_, 2)[0];
gets text prior first comma in last line of text read:
- because didn't specify variable put data in,
while (<$infh>)
reads default variable$_
. split ',', $_, 2
splits value of$_
list of comma-separated fields.2
@ end tells produce @ 2 fields; code work fine without2
, but, since need first field, splitting more parts isn't necessary.- putting
(...)[0]
aroundsplit
command turns returned list of fields (anonymous) array , returns first element of array. it's same if i'd writtenmy @fields = split ',', $_, 2; $id = $fields[0];
, shorter , without variable.
$#array
returns highest-numbered index in array @array
, for $i (0 .. $#array)
means "loop on indexes elements in @array
". (note that, if hadn't needed value of index counter, have instead looped on array's data directly, using for $filename (@input_files)
, have been less convenient keep track of missing values if i'd done way.)
Comments
Post a Comment