Some interesting DataFrame hate
Jakob N. Andersen griping on DataFrames in the Julia Slack (only copying this here because it will disappear into the Slack hole):
People should be banned for using DataFrames as a data structure in packages 😠. All the worst Python code I've read the last 5 years have been DataFrame abuse. DataFrames encourages balling up all information in one mega collection, and then handily destroys all type information about what data it contains and its invariants. You get functions whose sole argument is df , but which really takes all state in the entire program. It also makes people too lazy when considering parsing. They just read the input with pd.read_csv , and then assume that all the columns are present, with the right types. Which, incidentally, also means it's impossible for a reader of the code to know what data is actually being loaded. People who use DataFrames in package code should be sentenced to 100 hours of community service, refactoring other people's incomprehensible DataFrame packages to use proper types.
I should say, data frames are fine when used interactively, and possibly also if people use them responsibly (which I've never seen)
Some further commentary in the thread:
I teach Python to university students, and I feel like the number 1 thing I advice them to do when developing packages is to remove their pandas dependency. I swear, no matter what task is being proposed, they will first reach for Pandas, and then think of a way to cram their problem into a Pandas-shaped hole. For example, I have this graph problem that I like giving them, and ~100% of my students try to solve it using Pandas, which is a terrible decision for doing graph operations
And, from another commenter:
There's a certain infuriating "EVERYTHING MUST BE A TABLE!!!" mentality that is ubiquitous in some organizations. It requires you to abandon the central paradigm (in python's case OOP, in julia's case multiple dispatch) that the language was built around and leave unused many of the tools the language provides you. Drives me bananas.
Interesting to see some pushback against tables and dataframes because I legitimately default to reasoning about things that way most of the time. I think some reasons for that are:
- Python slowness brainrot. A lot of people think “I need to use a dataframe” because otherwise hand-written Python is frequently so slow that it is legitimately more performant to shove all your crap into dataframes. I definitely find myself thinking this sometimes.
- This conversation seems to be taking normalization/relational stuff out of the picture. I do agree that putting a bunch of unrelated stuff into a dataframe feels ugly and bad. But on the other hand, I think “object-oriented” and “relational data” actually go hand-in-hand and I think about them in conjunction all the time! I think well-normalized data is kind of object-based in nature, and you can go back and forth between “these two things should be separate objects” and “this data should be in two separate tables” pretty naturally. In those cases, whether you’re using structs or tables kind of just comes down to scale, speed, and context: are you operating on millions of objects? Is your language slow at custom functionality? Do you need something hyper portable? Those kinds of questions might help you figure out whether a tabular or an object-like representation is best for a particular situation, but I don’t think it’s crazy to think of both as almost equally-viable options for trying to accomplish basically the same things.
Anyways. I’m still very relational-brained these days. We’ll see if that changes I guess!