Friday, September 23, 2011

Bash Shell: Working with large text files

When working with multi-Gb text files I use these commands:

1. Get the first line which often contains column names and dump it into a small text file

uki $ head -n 1 source_file_name.txt > header_line.txt

2. Get first record after the headline and dump it into a small text file

uki $ head -n 2 source_file_name.txt | tail -1 > first_data_line.txt 

3. Finally, when developing using large files, I take SAMPLE 1000 records (out of millions) to speed up the dev time, I use 1000 because that is default SELECT * number of records in MySQL, but you can use any other if you want, but I would not go too small as you many not catch memory leak errors. The random number 2500 in this example I would change occasionally to pull different sample. You do want to sample your data in different places.

uki $ head -n 2500 source_file_name.txt | tail -1000 > sample_1000_records.txt 

Resulting files: