Sujay_K February 2016

How do I store a 100000*100000 matrix in C++?

I have two vectors a[100000] and b[100000]. I want to store a[i]*b[j] in a matrix. How can I do it in C++?

Answers


Pierre February 2016

You can use a std::vector<std::vector<your_type>> to store the result.

int rows = 100000, cols = 100000;

std::vector<std::vector<double>> result;
result.resize(rows);

for(int i=0; i<rows; i++) {
        result[i].resize(cols);
}    

for(int i=0; i<rows; i++) {
    for(int j=0; j<cols; j++){
        result[i][j] = a[i] * b[j];
    }
}

Or you can use a linear algebra library, for example Eigen (you'll may have less to code with this), which will surely be more efficient.


Humam Helfawi February 2016

The NON-contiguity part of this answer should be re-researched. It may be wrong.

If you want to work with big number of element such as 100000*100000. I would not recommend to use vector of vectors due to the "NON-contiguity" property of the inner vectors elements to each other. Small push_back may results in a lot of mess.

I would use single vector with a wrapper. See this for more information Clean ways to write multiple 'for' loops .


Kay February 2016

Using an in-RAM std::vector<double> would probably not feasible if you have less than 80GB of RAM in your system (for a 100000×100000 matrix of doubles).

Here is how you would do this using an mmap'd file. Please see the inline comments:

#include <sys/mman.h>
#include <stddef.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

#define ROWS 1000
#define COLS 1000
#define FILENAME "./matrix.doubles"

int main(void)
{
    double (*matrix)[ROWS][COLS]; // pointer to our matrix
    int fd; // file descriptor of backing file

    // open backing file
    fd = open(FILENAME,
              O_CREAT | O_RDWR, // create (if absent) and/or read and writable
              S_IRUSR | S_IWUSR); // (only) user may read and write
    if (fd < 0) {
        perror("Could not open file");
        return 1;
    }

    if ((lseek(fd, sizeof(*matrix), SEEK_SET) == (off_t) -1) ||
        ftruncate(fd, sizeof(*matrix)) ||
        (lseek(fd, 0, SEEK_SET) == (off_t) -1)) {
        perror("Could not set file size.");
        return 1;
    }

    matrix = mmap(NULL, // I don't care were the address starts
                  sizeof(*matrix), // size of matrix in bytes
                  PROT_READ | PROT_WRITE, // readable and writable
                  MAP_PRIVATE, // we access the data exclusively
                  fd, // file descriptor of backing file
                  0); // offset
    if (matrix == MAP_FAILED) {
        perror("Could not mmap file.");
        return 1;
    }

    // operate on matrix
    for (unsigned row = 0; row < ROWS; ++row) {
        for (unsigned col = 0; col < COLS; ++col) {
            (*matrix)[row][col] = row * col;
        }
    }

    // close backing file
    munmap(matrix, sizeof(*matrix));
    close(fd);

    return 0;
}

This is pure C code. You could fancy up the code by using e.g.

user3528438 February 2016

    #include <vector>
    class C
    {
    public:
        C(const std::vector<double>& a_, const std::vector<double>& b_)
            :a(a_),b(b_){};
        double operator()(size_t i, size_t j) const { return a[i]*b[j]; }
    private:
         std::vector<double> a, b;
    };

What the problem actually is?

The original question asks about a way to save C(i,j)=A(i)*B(j) to a matrix.

From an OOP point of view, such an matrix can be defined as an object with a method takes two inputs(i and j), and returns a result (ret=A(i)*B(j)).

This could be implemented using nested array subscriptions(c[i][j]), or linear array indexing(c[i*100000+j]), or a function (c.get(i, j)). The third way could also be simplified to a functor (c.operator()(i, j) or c(i, j)).

Then what?

If you agree to all above that any of the three interfaces serves the purpose, or at least partially (like I mentioned in the comment, if the matrix is only required to provide random read access to its elements). Then we continue to implement one of them, 3rd one being my choice.

Why do it that way?

My observation is that, computing the return value is not expensive, so why not calculate the product "lazily" when the product is actually accessed?

In this way, the storage is very efficient (memory usage is reduced from n^2 to 2n).

Hiding the multiplication in the getter function does not significantly increase access time (two memory accesses and one multiplication, compared with the ideal case being one memory access only, but both cases are constant time, and this implementation is much more cache friendly for the reduced size of the data).


hungptit March 2016

If you can compute a[i]*b[j] on the fly then you should do that because of two reasons:

  • Obtaining the results from a huge matrix may not be faster than compute the product of two double values on the fly.
  • 10000x10000 double matrix requires 80 Gbytes of storage (in-memory or on disk) and some extra work might be needed to access pre-computed data.

In my example below, I see 30x speed up (compiled in release mode using clang 3.8) if I compute the product of two double values on the fly for N=20000.

template <typename T>
void test_lookup(std::vector<T> &data, std::vector<size_t> &index,
                 std::vector<T> &results) {
    const size_t LOOP = index.size() / 2;
    for (size_t idx = 0; idx < LOOP; ++idx) {
        auto row = index[2 * idx];
        auto col = index[2 * idx + 1];
        results[idx] = data[col * LOOP + row];
    }
}

template <typename T>
void test_mul(std::vector<T> &x, std::vector<T> &y, std::vector<T> &results) {
    for (size_t idx = 0; idx < x.size(); ++idx) {
        results[idx] = x[idx] * y[idx];
    }
}

Post Status

Asked in February 2016
Viewed 1,574 times
Voted 4
Answered 5 times

Search




Leave an answer