Skip to content

Conversation

@phofl
Copy link
Member

@phofl phofl commented Dec 9, 2022

cc @mroeschke not saying that this is a good fix, but this shows that we have to use repr for the string dtypes in some way. Is there a more general rule to accomplish this?

@phofl phofl added Bug Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data labels Dec 9, 2022
@mroeschke
Copy link
Member

Looks to be a doctest error

=================================== FAILURES ===================================
_____________ [doctest] pandas.core.generic.NDFrame.convert_dtypes _____________
6466         Convert the DataFrame to use best possible dtypes.
6467 
6468         >>> dfn = df.convert_dtypes()
6469         >>> dfn
6470            a  b      c     d     e      f
6471         0  1  x   True     h    10   <NA>
6472         1  2  y  False     i  <NA>  100.5
6473         2  3  z   <NA>  <NA>    20  200.0
6474 
6475         >>> dfn.dtypes
Differences (unified diff with -expected +actual):
    @@ -1,7 +1,7 @@
    -a      Int32
    -b     string
    -c    boolean
    -d     string
    -e      Int64
    -f    Float64
    +a             Int32
    +b    string[python]
    +c           boolean
    +d    string[python]
    +e             Int64
    +f           Float64
     dtype: object

def test_to_string_string_dtype():
# GH#50099
if pa_version_under6p0:
pytest.skip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you use the pytest.mark.skipif decorator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, changed

@mroeschke
Copy link
Member

Is there a more general rule to accomplish this?

Yeah I don't think at the moment. Not sure why StringDtype (before StorageDtype was refactored away) has this str vs repr difference.

Looks like I actually hacked around this when creating ArrowDtype lol

class StorageExtensionDtype(ExtensionDtype):
    """ExtensionDtype that may be backed by more than one implementation."""

    name: str
    _metadata = ("storage",)

    def __init__(self, storage=None) -> None:
        self.storage = storage

    def __repr__(self) -> str:
        return f"{self.name}[{self.storage}]"

    def __str__(self) -> str:
        return self.name
class ArrowDtype(StorageExtensionDtype):
    def __repr__(self) -> str:
        return self.name

    @property
    def name(self) -> str:  # type: ignore[override]
        """
        A string identifying the data type.
        """
        return f"{str(self.pyarrow_dtype)}[{self.storage}]"

@phofl
Copy link
Member Author

phofl commented Dec 12, 2022

Thx, fixed the doctest.

Ok then can keep as is :)

@mroeschke mroeschke added this to the 2.0 milestone Dec 13, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM can merge on green

@phofl phofl merged commit d050408 into pandas-dev:main Dec 13, 2022
Copy link
Contributor

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl!

@jorisvandenbossche
Copy link
Member

Not sure why StringDtype (before StorageDtype was refactored away) has this str vs repr difference.

As far as I remember, this was actually done intentionally. And I personally I would prefer to keep this. But let's discuss on the issue (will reopen that one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Output-Formatting __repr__ of pandas objects, to_string Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.dtypes doesn't include backend for string columns

4 participants