1 min readfrom Data Science

Precision and recall > .90 on holdout data

I'm running ML models (XGBoost and elastic net logistic regression) predicting a 0/1 outcome in a post period based on pre period observations in a large unbalanced dataset. I've undersampled from the majority category class to achieve a balanced dataset that fits into memory and doesn't take hours to run.

I understand sampling can distort precision or recall metrics. However I'm testing model performance on a raw holdout dataset (no sampling or rebalancing).

Are my crazy high precision and recall numbers valid?

Of course there could be something fishy with my data, such as an outcome variable measuring post period information sneaking into my variable list. I think I've ruled that out.

submitted by /u/RobertWF_47
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#large dataset processing
#big data performance
#big data management in spreadsheets
#generative AI for data analysis
#conversational data analysis
#rows.com
#Excel alternatives for data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#data analysis tools
#data cleaning solutions
#cloud-based spreadsheet applications
#financial modeling with spreadsheets
Precision and recall > .90 on holdout data