design - Parsing delimited files and creating a hash that excludes certain fields -
i have inherited application that, in part, handles processing many large delimited files , merging them database. in order speed process of chugging through these files, create file stores md5 of each file. once file comes in next day, compare md5 value of row being parsed set object contains md5 values gathered last time processing occurred. works great, , reduced amount of time upload our database 99%.
except 1 ridiculous, , rather large (300mb), file. 1 has timestamp of when file created first record in each row. it's dumb, , cant ask sender change format.
my question not how deal problem @ low level, though suggestions welcome!
this app running in linux environment, , run this...
cut -f 2- -d"|" stupid_file.dat | awk '{fs="|"; print $0}' > file.dat
...and on merry way processing file. plan if thread doesn't work out.
what hoping higher level answer on different ways approach problem. language agnostic.the setup have "definition" each file comes in stored in database, , use delimiter, expected filename, compression type (if any), etc...
one thought had ball rolling add column, let's call hash_ignore, in definition table indicate fields wanting ignore. pass hashing function string splitting/joining necessary, , hash result. storing "1" ignore first value or "2-3,5-" include first , fourth values only. function like:
hashvalue = hashdatarow(rowdata, hashignorevalue);
would "acceptable" way go this, or overcomplicating issue? wonder if else has run similar situation when dealing parsing delimited files.
i ended using idea proposed in question. added field in database allows me adjust fields in delimited file being used create hash. i'm willing post few snippets of code if there needing similar solution.
Comments
Post a Comment