-1

I apologize if I'm butchering the terminology. I'm trying to understand the code in this example on how to chain a custom function onto a PySpark dataframe. I'd really want to understand exactly what it's doing, and if it is not awful practice before I implement anything.

From the way I'm understanding the code, it:

  1. defines a function g with sub-functions inside of it, that returns a copy of itself
  2. assigns the sub-functions to g as attributes
  3. assigns g as a property of the DataFrame class

I don't think at any step in the process do any of them become a method (when I do getattr, it always says "function")

When I run a (as best as I can do) simplified version of the code (below), it seems like only when I assign the function as a property to a class, and then instantiate at least one copy of the class, do the attributes on the function become available (even outside of the class). I want to understand what and why that is happening.

An answer [here(https://stackoverflow.com/a/17007966/19871699) indicates that this is a behavior, but doesn't really explain what/why it is. I've read this too but I'm having trouble seeing the connection to the code above.

I read here about the setattr part of the code. He doesn't mention exactly the use case above. this post has some use cases where people do it, but I'm not understanding how it directly applies to the above, unless I've missed something.

The confusing part is when the inner attributes become available.

class SampleClass():
  def __init__(self):
    pass

def my_custom_attribute(self):
  def inner_function_one():
    pass
  
  setattr(my_custom_attribute,"inner_function",inner_function_one)

  return my_custom_attribute
  
[x for x in dir(my_custom_attribute) if x[0] != "_"]

returns []

then when I do:

SampleClass.custom_attribute = property(my_custom_attribute) 
[x for x in dir(my_custom_attribute) if x[0] != "_"]

it returns []

but when I do:

class_instance = SampleClass()
class_instance.custom_attribute

[x for x in dir(my_custom_attribute) if x[0] != "_"]

it returns ['inner_function']

In the code above though, if I do SampleClass.custom_attribute = my_custom_attribute instead of =property(...) the [x for x... code still returns [].

edit: I'm not intending to access the function itself outside of the class. I just don't understand the behavior, and don't like implementing something I don't understand.

  • You have the `setattr()` *inside* the function - so you have to somehow call the function before the attribute is created. I have no idea what `SampleClass` has to do with any of this, as you don't seem to be trying to add an attribute to instances of that class (which is what `property()` is used for). – jasonharper Feb 14 '23 at 21:59
  • "defines a function g with sub-functions inside of it, that returns a copy of itself" it does not return a copy of itself. It simply returns itself. All of this is designed pretty badly, IMO – juanpa.arrivillaga Feb 14 '23 at 22:02
  • 2
    In any case, this has nothing to do with `setattr`, this has everything to do with the way this solution is designed, which frankly, is just vastly overengineered for the solution it is trying to achieve. – juanpa.arrivillaga Feb 14 '23 at 22:13
  • Thank you guys so much. It's sincere even if it doesn't read that way. – jonathan-dufault-kr Feb 15 '23 at 14:17

2 Answers2

1

Because when you call a function the attributes within that function aren't returned only the returned value is passed back.

In other words the additional attributes are only available on the returned function and not with 'g' itself.

Try moving setattr() outside of the function.

linuxgx
  • 401
  • 2
  • 9
1

So, setattr is not relevant here. This would all work exactly the same without it, say, by just doing my_custom_attribute.inner_function = inner_function_one etc. What is relevant is that the approach in the link you showed (which your example doesn't exactly make clear what the purpose is) relies on using a property, which is a descriptor. But the function won't get called unless you access the attribute corresponding to the property on an instance. This comes down to how property works. For any property, given a class Foo:

Foo.attribute_name = property(some_function)

Then some_function won't get called until you do Foo().attribute_name. That is the whole point of property.

But this whole solution is very confusingly engineered. It relies on the above behavior, and it sets attributes on the function object.

Note, if all you want to do is add some method to your DataFrame class, you don't need any of this. Consider the following example (using pandas for simplicity):

>>> import pandas as pd
>>> def foobar(self):
...     print("in foobar with instance", self)
...
>>> pd.DataFrame.baz = foobar
>>> df = pd.DataFrame(dict(x=[1,2,3], y=['a','b','c']))
>>> df
   x  y
0  1  a
1  2  b
2  3  c
>>> df.baz()
in foobar with instance    x  y
0  1  a
1  2  b
2  3  c

That's it. You don't need all that rigamarole. Of course, if you wanted to add a nested accessor, df.custom.whatever, you would need something a bit more complicated. You could use the approach in the OP, but I would prefer something more explicit:

import pandas as pd

class AccessorDelegator:
    def __init__(self, accessor_type):
        self.accessor_type = accessor_type
    def __get__(self, instance, cls=None):
        return self.accessor_type(instance)

class CustomMethods:
    def __init__(self, instance):
        self.instance = instance
    def foo(self):
        # do something with self.instance as if this were your `self` on the dataframe being augmented
        print(self.instance.value_counts())

pd.DataFrame.custom = AccessorDelegator(CustomMethods)

df = pd.DataFrame(dict(a=[1,2,3], b=['a','b','c']))

df.foo()

The above will print:

a  b
1  a    1
2  b    1
3  c    1
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172