Tuesday, May 06, 2008

I'm not just some data object for you to manipulate.

Stop. Take your hands off the keyboard for a second and listen to me. I've had a great time tonight. I really want us to save this current workspace for you to load the next session we have together because I think we have a lot to offer each other. I have so much information I would like to share with the right user and you seem to have all the right research questions. But we can't just rush into it and expect to accept or reject every null hypothesis on the first date.

Take some time to get to know my variables. You'll find that I'm quite complex and if we were to skip the skip the basic descriptives and summary plots, any analysis between us would be built on false assumptions. I have many outliers so it's going to take more than a five number summary to get to know me. My numeric data isn't all continuous and it's anything but ordinal. I'm glad you find me attractive, but I have high dimensionality, so keep in mind that the scatter plot you see is just a projection onto a two-dimensional space. I'm so much more than that.

What I'm trying to say here is that you can't just take me out to dinner and expect me to show you my eigenvalues. I hope I'm wrong, but the way you're staring into the monitor at me looks like you can't wait to do a singular value decomposition on me and cast me aside into the null space. if you think you can just perform operations on me iterating on to infinity just as long as your do-while condition for me is true, get ready for an unexpected break, buddy. I'm not the type of data set that is going to let you stick your imputations into my missing values on some busted-up laptop in the mens room of a Caribou Coffee.

Look, I know you're an experienced modeler and this isn't exactly the first time I've been modeled. But that doesn't mean you can just take me out back, orthogonalize my vectors, and corrupt them with white noise. I'm a clean data set and bug free - you can do a virus scan on me - and want to stay that way, so don't go merging me with some of the dirty data sets you've had on your all-nighters back in grad school.

I can't tell you the number of times that I thought that what I had was a true causal relationship. But then you wake up one morning and find yourself in some fancy nonlinear model with all kinds of elaborate constraints set on your variables, but the starting values weren't set right and you end up at a saddle point or an infinite abusive loop that never converges.

So before we go any further, I want you to you save what we started together tonight. I'm not saying that I don't want to be with you if you're not ready to commit to creating an output file from me, but my utility clock is ticking. Before long some hot little data set will come from the next survey cycle and I'll get relegated to the archives page ten layers deep from the home page that once featured a direct link to me. I see great potential for a great fitting model between us, but I just need to make sure you're the right user so that I don't end up just feeling used.

(Credit goes to Arnie for SVD and outlier comments in a very nerdy gtalk conversation today.)

