Yona February 2016

pandas: when data is NaN logic operations cannot be done

I have a large DataFrame in Pandas and 2 columns can have values or be NaN (Null) when not assigned to any value.

I want to populate a 3rd column based on these 2. When not NaN it takes some value. This works as follows:

In [16]: import pandas as pd

In [17]: import numpy as np

In [18]: df = pd.DataFrame([[np.NaN, np.NaN],['John', 'Malone'],[np.NaN, np.NaN]], columns = ['col1', 'col2'])

In [19]: df
Out[19]:
   col1    col2
0   NaN     NaN
1  John  Malone
2   NaN     NaN

In [20]: df['col3'] = np.NaN

In [21]: df.loc[df['col1'].notnull(),'col3'] = 'I am ' + df['col1']

In [22]: df
Out[22]:
   col1    col2       col3
0   NaN     NaN        NaN
1  John  Malone  I am John
2   NaN     NaN        NaN

This also works:

In [29]: df.loc[df['col1']== 'John','col3'] = 'I am ' + df['col2']

In [30]: df
Out[30]:
   col1    col2         col3
0   NaN     NaN          NaN
1  John  Malone  I am Malone
2   NaN     NaN          NaN

But if I not make all values NaN and then try this last loc, it gives me an error!

In [31]: df = pd.DataFrame([[np.NaN, np.NaN],[np.NaN, np.NaN],[np.NaN, np.NaN]], columns = ['col1', 'col2'])

In [32]: df
Out[32]:
   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN

In [33]: df['col3'] = np.NaN

In [34]: df.loc[df['col1']== 'John','col3'] = 'I am ' + df['col2']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
c:\python33\lib\site-packages\pandas\core\ops.py in na_op(x, y)
    552             result = expressions.evaluate(op, str_rep, x, y,
--> 553                                           raise_on_error=True, **eval_kwargs)
    554         except TypeError:

c:\python33\lib\site-packages\pandas\computation\expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    217         return _evaluate(op, op_str, a, b, raise_on_er        

Answers


Paul H February 2016

The problem here is that if an entire column is np.nan, it's probably stored as floats, not object (text).

So you can do:

if not np.all(pandas.isnull(df['mycol'])):
    df = my_string_operation(df)

You could also coerce the column in question to an object type.

df['mycol'] = df['mycol'].astype(object)
df = my_string_operation(df)


johnchase February 2016

I would argue that really all this line is doing is doing is adding a string to column 1 values if there are any values that are not null.

df.loc[df['col1'].notnull(),'col3'] = 'I am ' + df['col1']

So you can just check if there are any values that are not null and then only perform the operation if there are:

if df['col1'].notnull().any():
    df['col3'] = 'I am ' + df['col1']

You also don't need to create the col3 column prior to running it this way.

Post Status

Asked in February 2016
Viewed 2,200 times
Voted 5
Answered 2 times

Search




Leave an answer