A pathway and mindset to becoming a successful data science generalist (DSG)
In my previous article, I shared reasons why aspiring or junior data scientists should consider becoming data science generalists (DSG’s).
In this article, I want to share how they (aspiring or entry level data scientists) can posture themselves and navigate the pathway towards becoming solid data science generalists — ones who have developed an assortment of competence in the different areas/roles of data science.
A generalist is not meant to be a master at anything, but rather have a full grasp of the grand picture of things. A jack of all trades, a master of none.
The quote above is also representative of who a DSG is. The data science generalist understands the full picture of the entire data science project lifecycle; from understanding the business problem to communicating results and monitoring the data science implementation or solution.
How does one become a data science generalist (DSG)? What is the pathway to getting there and what kind of mindset does one need to have to achieve this?
To be plain and simple, it is not an easy path to navigate. It can be argued that it is even impossible to become a DSG just after graduating or as an entry level data scientist. Why? Because it takes years to acquire all the skills needed to become one, though you don’t need to master all of them. However, beginning a career in data science and choosing to tread the path that will lead to becoming a DSG is not an impossible choice. As long as it fits your vision for a future career in data science, then one just needs the right mindset and to plant feet on the right trail for the journey.
Pathway to becoming a DSG
If you’re someone who is starting out in data science, possibly the easiest pathway to become a DSG is to work for a start-up or a small company. There is high likelihood that you will be the only data scientist within the company and so, you will have the opportunity to take on many hats and quickly become exposed to the different areas of data science. You will build many products from scratch, and this will provide the necessary exposure to become a proficient builder (data engineering), refiner/transformer (data mechanics), modeller (data science), and analyst (data analysis).
These areas of data science were discussed in the previous article as well, bar data mechanics. Data mechanics includes tasks like data formatting, data cleaning, data handling (querying, slicing and joining) and value interpretation. Below is a structured pathway that could be used as a guide in your learning journey towards becoming a DSG. It combines the phases of the project lifecycle with the different areas of data science, and main subjects/tools that are worth focusing on:
Business problem understanding (All areas)
This phase involves understanding the problem which the business wants to solve with data science.
- Learn how to do requirements gathering
Data acquisition (Data engineering)
This involves obtaining the data required for solving the problem.
- Learn about creating data pipelines:
- using APIs to fetch data; ETL process; data streaming to stream/transport data e.g. airflow/dbt/AWS Kinesis.
2. Learn about databases, data storage and data access:
- Postgres; SQL; NoSql databases (e.g. mongodb, graph db’s); data lakes (e.g. AWS S3); data warehouses (e.g. Amazon Redshift)
3. Learn about data structures and formats:
- Structured data; unstructured data (e.g. text files, documents); JSON; arrays; dictionaries
4. Learn about distributed systems & how to use big data technologies/processing frameworks:
- Hadoop distributed file system (HDFS); Sqoop; Floom; Zookeeper; Apache Spark, MapReduce, AWS Lambda.
Data Understanding (Data science & data analysis)
This is the phase where we understand everything about the data and its patterns etc.
- Learn basic statistics:
- Descriptive statistics; Inferential statistics; Distributions; Univariate and multivariate analysis
2. Learn how to use data visualization libraries:
- matplotlib; seaborn;
Data Preparation/Feature Engineering (Data mechanics & Data science)
This phase involves transforming the data into a suitable format for machine learning models.
- Learn about data transformation techniques:
- Log transform; Square-root transformation; Categorical encoding; Vectorization; Power functions and Scaling
2. Learn about dealing with:
- Skewed data; Bias mitigation; Binning; Outlier detection
3. Learn how to use preparation and analysis tools:
- pandas; numpy
Model Implementation (Data science)
This is the core task of a data science project that requires writing, running and refining the programs to analyse and derive meaningful business insights from data.
- Learn how to code better with a programming language:
- pythton; R; Java; object oriented programming; writing clean code; writing algorithm functions
2. Learn math for data science:
- linear algebra; vector calculus; probability and distributions; dimensionality reduction
3. Learn about how machine learning and deep learning models work under the hood:
- logistic regression; naive bayes; SVM; random forest; neural networks; Autoencoders
4. Learn how to use machine learning and deep learning libraries and frameworks:
- Scikit learn; Tensorflow; Keras
Model Deployment (Data science and data engineering)
Once we have trained and evaluated the model, deploying the model into a pre-production or production environment is next.
- Learn about different ways to put the model into production:
- Model training methods — One-off training; batch training; real-time training
- Model serving methods — Batch prediction; online-inference
2. Learn to use cloud technologies for ML model deployment:
- AWS Sagemaker; Google cloud platform; Azure ML
Communicating Results (Data analysis & data science)
This phase involves presenting and communicating the results that the implemented model serves, to the business people and stakeholders.
- Learn about dashboard architecture
- Learn how to use visualization tools (Amazon quicksight, Power BI, Tableau)
Mindset
Always learning — Receiving your diplomat or landing your first job isn’t the end of the journey. In fact, that’s when the work begins for aspiring DSG’s. Because of the amount of proficient skills to be acquired, the DSG is constantly building exposure and acquiring knowledge in the different areas of data science. A major caveat here is that, it’s not just any type of exposure or information being acquired, but the RIGHT ones. Here’s what I mean:
If you decide to gain knowledge by doing data science online courses, be intentional in picking a course that trains you for a particular phase in the lifecycle (e.g. data preparation using pandas) or a particular area of data science (e.g. building data pipelines in data engineering). When you become comfortable with the subject matter, look for a project or use case in that phase/area to practice the materials learned, then tick it off from your pathway list. The goal is continuous learning.
“The more you know the more you know, for sure, there’s many things you don’t know”
Bonus Tip → Since continuous learning and seeing the big picture is key to becoming a data science generalist, I would suggest to diversify your data related blogs and learning platforms as much as possible. Some of these learning platforms have helped me in my journey so far:
Cloudxlab (lots of data engineering related content)
Dataquest (data engineering, data science and data analysis content)
Datacamp (data engineering, data science and data analysis content)
Mode blog (articles about all things data)
Medium (data science related content)
Data school (data mechanics related content e.g. pandas tricks)