FUNDAMENTAL SKILLS TO BECOME A SUCCESSFUL DATA SCIENTIST THAT FEW TALK ABOUT

5/19/20256 min read

FUNDAMENTAL SKILLS TO BECOME A SUCCESSFUL DATA SCIENTIST THAT FEW TALK ABOUT

Introduction

Imagine you've taken the top Data Science courses on Coursera. Dived deep into probability, linear algebra, and calculus. Read uncountable articles on Medium and practiced their Python codes. Also, you've mastered various Neural Network architectures, extensively used TensorFlow tools, and learned how to use Large Language Models like ChatGPT. Finally, you tested your skills in Kaggle competitions.

After all this hard work, you may think: "I am ready. I can tackle any challenge!! I feel the force!". Take it easy, Sky Walker Jedi, and first, let me tell you a little story.

The reality

Suppose you landed your first role as a Data Scientist at a company without a fully-fledged Data Science department. As your first project, your director (not a Data Science person) asked you to help Steve, the Sales Manager, to improve weekly sales forecasting. So, you talked to Steve and realized he wasn't thrilled about the project. "Never mind" was your thought.

You asked Steve to send you some data to build forecasting models. So, he sent you e-mails with Excel, pdf, and txt files. Shocked, you think: "What is this! I just expected to receive a few features to run a good model as learned in my Data Science Courses."

Lesson learned 1 – Forget the beautiful dataset in Data Science Courses. Reality means, mostly, messy, dirty, non-structured, wrong, and missing data from different data sources.

You discussed the absence of a data pipeline with your director, and you realized the company didn't have Data Engineers. So, if you wanted the project to succeed, you would have to do it. Ok, let's go on. After many Python codes, SQL selects, and and experimenting with various ETL (Extract, Transform, and Load) libraries, you finally got nearly 300 features. But which features would be relevant? "It doesn't matter. I'll just feed all these features into a gradient boosting algorithm, and it will select the best ones".

Surprise. Your results were awful. You got a 40% error, while Steve typically has reached 20% with old, understandable, and simple rules. As a next step, you used feature selection and dimensionality reduction algorithms. Then your new best error was 30%. Well, you have improved, but not so much.

Frustrated, you begin to think about what could you have forgotten. So, you realized you didn't ask Steve's sales team which data(features) were relevant for them.

Lesson learned 2 – Always try to learn everything you can from the people who work with the business. Be humble and understand the process that generates the data. Ask them what kind of feature(s) they have tried, what has worked, what has not, and what they think could work. Generally, people like and cooperate when you genuinely try to understand their job's challenges.

The time spent with Steve's sales team was worth it. They suggested you get data you didn't know were important, but the remarkable point was when you tried combining features to represent the process information better. That's right. You have just begun to discover Feature Engineering powers.

Lesson learned 3 – You don't need the most advanced Deep Learning algorithm with tons of hyperparameters to tune. Instead, spend your time looking for the few features possible that best represent the process and can work with simple Machine Learning (ML) algorithms. It makes a huge difference and avoids headaches.

After two weeks of feature combinations and transformation tests, internet and ChatGPT research about Python libraries for time series feature extraction, and a tough fight against overfitting, you got 16% as your new best error.

A couple of months have passed since you began the project, and, in your mind, you have been so far away from results like 2% up to 5% errors you typically read on Kaggle or Medium articles. You concluded you failed.

Your director was worried about the lack of updates from you. He asked you about the project results. In a shy way, you told him the bad news. "What! You have reduced in 20% the sales team's forecasting error. It will save good money for the company. You will present this to our CEO. Good Job." Said your director.

Lesson learned 4 – In the end, the metrics that matter are the business metrics and not Machine Learning (ML) metrics. For the business decision-makers, it means how much money the ML model can save or gain. Also, ML metrics don't mean the same for different problems. For example, a 5% error is unacceptable for some problems, while 20 % is quite impressive for others.

Yes, you are the guy. That's your big chance to show the CEO how good of a Data Scientist you are. A 30-slide PowerPoint presentation is all you need to achieve this. It includes math formulas to prove you know the theory. Also, you applied beautiful graphics to show him how you did to avoid model overfitting. Finally, you mentioned an incredible phyton library for hyperparameter tuning which you spent many nights researching and learning.

As a test, your director asked you to give him a presentation. Humm. He didn't like it. He suggested you should be more objective and less complicated. You disagreed. It was crystal clear. You were 100% sure. The CEO will like it.

Your presentation was a complete disaster. On the fourth slide, the CEO interrupted you and said your presentation was too complex. Next, he asked about the project results. What benefits will the model bring to the company? How much money will the model save or gain (Do you remember Lesson learned 4?). Finally, he ordered you to make a new presentation with four slides at maximum.

Lesson learned 5 – CEOs, Board members, and business decision-makers, most of the time, are not Data Scientists. They don't care about the ML models or the math behind them. Instead, they care about the results. Think like them. Speak their language. Find out what kind of KPI (Key Performance Indicator) or business metrics matter to them and use it to show them the project impact.

You went ahead. Remade the presentation, and this time you succeeded. The CEO understood the potential results and asked you when the sales team could start to use an application (app) with the new model. You explained to him that this should be a task for a Machine Learning Engineer, not you. He replied to you, that there wasn't a Machine Learning Engineer available, and the model was useless without a working app that Steve's team could use.

You took a deep breath and dove into this new challenge.

You have spent the following months researching, studying, choosing, coding, and testing many matters, such as Windows vs. Linux, company hardware vs. cloud services, SQL vs. Non-SQL databases, Containers, Phyton and libraries versioning, Data orchestration, MLOPS, data drift, building and deploying ML web Apps.

Finally, the app was running and was ready to be used. You taught Steve's sales team how to use the app and did the closing project meeting with all stakeholders. Congratulations and champagne for everyone. Now, it is up to Steve's team to use the app.

Three weeks later, you called Steve to find out how the new model was performing. He answered that his team was not using the app, to your surprise. You asked him why and that's what he told you: He didn't understand the model. It didn't make sense to him. Because of this, he would not allow his team to use the app. Nobody asked him what kind of forecasting he needed and what app he would like to have. Finally, something he didn't say, but you got it: He considered the app a threat to him. He was the only company employee who knew how to run the weekly forecasting. This knowledge meant power to Steve.

Lesson learned 6 – Your final goal must be to deliver a product or service that solves some significant client problem or brings him some gain. If the end user doesn't want to use your product, most of your hard work will end in the trash. To avoid this, you must understand your client's needs and concerns.

Here the story ends.

Conclusion

I told you this story to illustrate that to become a successful Data Scientist, it's not enough to just develop the technical skills in Data Science. You must also:

Enhance your abilities to understand and communicate with people. After all, they are the ones who will use your products or services. If they don't like them, they won't use them, and you will fail.
Learn the principles of the fundamental processes for any business (For example: Sales, Procurement, Production, Budgeting, and Accounting). It's much more effective when someone explains their needs to a Data Scientist who understands these processes.
Whenever possible, carefully select your projects. Balance the project's resource requirements against its potential return. For example, a U$ 10 million problem might demand the same resources and efforts to solve as a U$ 10 thousand problem.

Thank you for reading.

FUNDAMENTAL SKILLS TO BECOME A SUCCESSFUL DATA SCIENTIST THAT FEW TALK ABOUT

Meus contatos