Nikko February 2016

Closest value different files, with different number of lines and other conditions ( bash awk other)

I have to revive and old question with a modification for long files.

I have the age of two stars in two files (File1 and File2). The column of the age of the stars is $1 and the rest of the columns up to $13 are information that I need to print at the end.

I am trying to find an age in which the stars have the same age or the closest age. Since the files are too large (~25000 lines) I don't want to search in the whole array, for speed issues. Also, they could have a big difference in number of lines (let say ~10000 in some cases)

I am not sure if this is the best way to solve the problem, but in a lack of a better one, this is my idea. (If you have a faster and more efficient method, please do it)

All the values are with 12 decimals of precision. And for now I am only concern in the first column (where the age is).

And I need different loops.

Let's use this value from file 1:

2.326062371284e+05

First the routine should search in file2 all the matches that contain

2.3260e+05

(This loop probably will search in the whole array, but if there is a way to stop the search as soon it reaches 2.3261 then it will save some time)

If it finds just one, then the output should be that value.

Usually, it will find several lines, maybe even up to 1000. It this is the case, it should search again against

2.32606e+05

between the lines founded before. (It is a nested loop I think) Then the number of matches will decrease up to ~200

At that moment, the routine should search the best difference with certain tolerance X between

2.326062371284e+05

and all these 200 lines.

This way having these files

File1

1.83380065035        

Answers


choroba February 2016

Perl to the rescue. This should be very fast, as it does a binary search in the given range.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ max min };
use constant { SIZE      => 100,
               TOLERANCE => 3000,
           };


my @times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
    chomp;
    push @times2, $_;
}

my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
    chomp $time;

    my $from = max(0, $num - SIZE);
    my $to   = min($#times2, $num + SIZE);
    my $between;
    while (1) {
        $between = int(($from + $to) / 2);

        if ($time < $times2[$between] && $to != $between) {
            $to = $between;

        } elsif ($time > $times2[$between] && $from != $between) {
            $from = $between;

        } else {
            last
        }
    }
    $num++;
    if ($from != $to) {
        my $f = $time - $times2[$from];
        my $t = $times2[$to] - $time;
        $between = ($f > $t) ? $to : $from;
    }
    say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}


Tensibai February 2016

Not an awk solution, comes a time when other solutions are great too, so here is an answer using R

New answer with different datas, not reading from file this time to bake an example:

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")

Code:

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "a 

Post Status

Asked in February 2016
Viewed 1,323 times
Voted 13
Answered 2 times

Search




Leave an answer