slaw February 2016

Pandas Set Data From the Last Period As New DataFrame Column

I have a Pandas DataFrame:

import pandas as pd

df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
                   ['B', '2014-01-01', '2014-01-07', 2.5],
                   ['C', '2014-01-01', '2014-01-07', 3.],
                   ['A', '2014-01-08', '2014-01-14', 13.],
                   ['B', '2014-01-08', '2014-01-14', 2.],
                   ['C', '2014-01-08', '2014-01-14', 1.],
                   ['A', '2014-01-15', '2014-01-21', 10.],
                   ['A', '2014-01-21', '2014-01-27', 98.],
                   ['B', '2014-01-21', '2014-01-27', -5.],
                   ['C', '2014-01-21', '2014-01-27', -72.],
                   ['A', '2014-01-22', '2014-01-28', 8.],
                   ['B', '2014-01-22', '2014-01-28', 25.],
                   ['C', '2014-01-22', '2014-01-28', -23.],
                   ['A', '2014-01-22', '2014-02-22', 8.],
                   ['B', '2014-01-22', '2014-02-22', 25.],
                   ['C', '2014-01-22', '2014-02-22', -23.],
                  ], columns=['Group', 'Start Date', 'End Date', 'Value'])

And the output looks like this:

   Group  Start Date    End Date  Value
0      A  2014-01-01  2014-01-07    1.2
1      B  2014-01-01  2014-01-07    2.5
2      C  2014-01-01  2014-01-07    3.0
3      A  2014-01-08  2014-01-14   13.0
4      B  2014-01-08  2014-01-14    2.0
5      C  2014-01-08  2014-01-14    1.0
6      A  2014-01-15  2014-01-21   10.0
7      A  2014-01-21  2014-01-27   98.0
8      B  2014-01-21  2014-01-27   -5.0
9      C  2014-01-21  2014-01-27  -72.0
10     A  2014-01-22  2014-01-28    8.0
11     B  2014-01-22  2014-01-28   25.0
12     C  2014-01-22  2014-01-28  -23.0
13     A  2014-01-22  2014-02-22    8.0
14     B  2014-01-22  2014-02-22   25.0
15     C  2014-01-22  2014-02-22  -23.0

I am trying to add a new column with data from the same group in the previous period (if it exists). So, the output should look like th

Answers


Artur Nowak February 2016

The simplest method (although with quadratic complexity) would be as follows:

import datetime as dt
df.sd = pd.to_datetime(df['Start Date'])
df.ed = pd.to_datetime(df['End Date'])

def find_previous_period(row):
  prev_sd = row.sd - dt.timedelta(days=7)
  prev_ed = row.ed - dt.timedelta(days=7)
  prev_period = df[(df.sd == prev_sd) & (df.ed == prev_ed) & (df.Group == row.Group)]
  if prev_period.size > 0:
    return prev_period.irow(0).Value

df['Last Period Value'] = df.apply(find_previous_period, axis=1)

Some more elegant solution may be required if you have a lot of data.


Update for the requirement that the number of days need to be the same (from the comments):

def find_previous_period(row):
  delta = row.ed - row.sd + dt.timedelta(days=1)
  prev_sd = row.sd - delta
  prev_ed = row.ed - delta
  prev_period = df[(df.sd == prev_sd) & (df.ed == prev_ed) & (df.Group == row.Group)]
  if prev_period.size > 0:
    return prev_period.irow(0).Value


howMuchCheeseIsTooMuchCheese February 2016

If I'm understanding your definition of "period" right, this will work and should be pretty fast.

  df['sd'] = pd.to_datetime(df['Start Date'])
  df['sd2'] = df.sd - dt.timedelta(days=1)
  df['ed2'] = df.ed - dt.timedelta(days=1)

  df2 = pd.merge(df, df[['sd2','ed2','Value', 'Group']], left_on=['sd','Group', 'ed'], 
           right_on=['sd2','Group', 'ed2'], how='outer', copy=False)

You'll have to clean up the column names / delete the extra columns.


unutbu February 2016

Suppose we compute the duration between Start and End for each row:

df['duration'] = df['End']-df['Start']

and suppose we also compute the previous Start value based on that duration:

df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)

Then we can express the desired DataFrame as the result of a merge between df and itself where we merge rows whose Group, duration and Prev (in one DataFrame) match the Group, duration and Start (in the other DataFrame):

import pandas as pd

df = pd.DataFrame([['A', '2014-01-01', '2014-01-07', 1.2],
                   ['B', '2014-01-01', '2014-01-07', 2.5],
                   ['C', '2014-01-01', '2014-01-07', 3.],
                   ['A', '2014-01-08', '2014-01-14', 3.],
                   ['B', '2014-01-08', '2014-01-14', 2.],
                   ['C', '2014-01-08', '2014-01-14', 1.],
                   ['A', '2014-01-15', '2014-01-21', 10.],
                   ['A', '2014-01-21', '2014-01-27', 98.],
                   ['B', '2014-01-21', '2014-01-27', -5.],
                   ['C', '2014-01-21', '2014-01-27', -72.],
                   ['A', '2014-01-22', '2014-01-28', 8.],
                   ['B', '2014-01-22', '2014-01-28', 25.],
                   ['C', '2014-01-22', '2014-01-28', -23.],
                   ['A', '2014-01-22', '2014-02-22', 8.],
                   ['B', '2014-01-22', '2014-02-22', 25.],
                   ['C', '2014-01-22', '2014-02-22', -23.],
                  ], columns=['Group', 'Start', 'End', 'Value'])
for col in ['Start', 'End']:
    df[col] = pd.to_datetime(df[col])

df['duration'] = df['End']-df['Start']
df['Prev'] = df['Start'] - df['duration'] - pd.Timedelta(days=1)

result = pd.merge(df, df[['Group','duration','Start','Value']], how='left',
                  left_on=['Group','durat 

Post Status

Asked in February 2016
Viewed 3,809 times
Voted 13
Answered 3 times

Search




Leave an answer