Curtis Jackson February 2016

Reading Text File and Using 2D Vector

I am using the method below to read a Large space delimited txt files(About 900 Mb). It took me 879s to load the data into memory. I am wondering if there is a more efficient way to read the txt file?

Another associated question is: is it a good idea to store such a huge data set using a 2D vector?

Here is my code

void Grid::loadGrid(const char* filePathGrid)
{
        // 2D vector to contain the matrix
        vector<vector<float>> data;

        unsigned nrows, ncols;
        double xllcorner, yllcorner;
        int cellsize, nodataValue;
        const int nRowHeader = 6;
    string line, strtmp;

    ifstream DEMFile;
    DEMFile.open(filePathGrid);
    if (DEMFile.is_open())
    {           
        // read the header (6 lines)
        for (int index = 0; index < nRowHeader; index++)
        {
            getline(DEMFile, line); 
            istringstream  ss(line);
            switch (index)
            {
                case 0: 
                    while (ss >> strtmp)
                    {                       
                        istringstream(strtmp) >> ncols;                     
                    }
                    break;
                case 1:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> nrows;                     
                    }
                    break;
                case 2: 
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> xllcorner;                     
                    }
                    break;                  
                case 3:
                    while (ss >> strtmp)
                    {
                        istringstream(strtmp) >> yllcorner;                     
                    }
                    break;                      
                case 4:
                    while (ss        

Answers


Deshawn Dawkins February 2016

With respect to your second question:

I would have chosen to use a one-dimensional vector, and then index it by (row*ncols+col).

This will at least reduce memory consumption, but it may also have a significant impact on speed.

I don't remember whether a 'vector of vectors' is an endorsed idiom by the standard, but there is a risk that too much copying and memory reallocation is going on, if there is no special handling of the 'vector of vectors' case.


jgreve February 2016

Not so much an answer as additional questions (these don't format well for a comment).

Observation: I think answering vector or something else better for memory storage is hard to say without knowing more about optimization potential... some of the data-related questions hit on that.

Questions about timing:

Can you modify your timing logic to read header values, then time the following scenarios:

1) Read-only, just pull each line into memory.   Disable the the grid
      parsing & assignment part.  Goal to benchmark raw reads on file.
      Also no "resize" operations on your "data" member.
2) Memory allocation (just read the headers and "resize" the "data" member;
   don't loop through remainder of file).
3) Everything (code as-posted).

Seems to me that reading a file under 1gb in size will be cached by the operating system.
So...
I'd encourage you to run the above 5 times or something and see if subsequent runs are are consistent. (If you don't check for that it might look like you get get a big speedup from a minor change to the code but actually it was just because the data file was cached by the os and your "gains" evaporate w/the next reboot).

Questions about the data file...

To paraphrase, the data file looks like a complete dump of every value in the grid.

Example: In addition to the 6 "header" lines, a 100 x 200 "grid" will have 100 lines in the file with 200 lines on each row. So 6 + 100*200 = 20,006 lines.

You mentioned a 900 MB data file.

I'll make some assumptions, just for the fun of it.
If your values are consistently formatted (e.g. "0.000000" thru "1.000000") that means 8 chars per value.


Assuming simple encoding (1 byte per character) then you'll fit something like a 10,000^2 grid in 900 MB.

Ignoring the header "lines" and end-of-line delims (which will just be rounding errors):
1kb has 1,024 char


Curtis Jackson February 2016

Great Answers. Using binary format instead of plain text will give me a great performance boost I think.

I had someone send me this code as well. I'll play around some more and come up with a V2.

#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>

#include <deque>
#include <vector>

using namespace std;

/* ////////////////////////////////////////////////////////////////////////////
  timer stuff
//////////////////////////////////////////////////////////////////////////// */

//----------------------------------------------------------------------------
typedef double epoch_time_t;

//----------------------------------------------------------------------------
#ifdef __WIN32__

  #include <windows.h>

  epoch_time_t get_epoch_time()  // Since 1601 Jan 1
    {
    FILETIME f;
    GetSystemTimeAsFileTime( &f );
    LARGE_INTEGER t =
      { {
      f.dwLowDateTime,
      f.dwHighDateTime
      } };
    return t.QuadPart / 10000000.0;  // 100 nano-second intervals
    }

  void waitforkeypress()
    {
    WaitForSingleObject(
      GetStdHandle( STD_INPUT_HANDLE ),
      INFINITE
      );
    }

//----------------------------------------------------------------------------
#else // POSIX

  #include <sys/time.h>

  epoch_time_t get_epoch_time()  // Since 1970 Jan 1
    {
    struct timeval tv;
    if (gettimeofday( &tv, NULL )) return 0.0;
    return tv.tv_sec + tv.tv_usec / 1000000.0;
    }

  #include <unistd.h>
  #include <termios.h>
  #include <poll.h>

  #define INFINITE (-1)

  void waitforkeypress()
    {
    struct termios initial_settings;
    tcgetattr( STDIN_FILENO, &initial_settings );

    struct termios settings = initial_settings;
    settings.c_lflag &= ~(ICANON);
    tcsetattr( STDIN_FILENO, TCSANOW, &settings );

    struct pollfd pls[ 1 ];
    pls[ 0 ].fd     = S 

Post Status

Asked in February 2016
Viewed 2,211 times
Voted 14
Answered 3 times

Search




Leave an answer