Many guides give you advice on how to get started in data science: which online courses to take, which projects to implement for your portfolio, and which skills to acquire. But what if you got started with your learning journey, and now you are somewhere in the middle and don’t know where to go next?
I want to share the 5 things I focused on that really helped me boost my data science learning and build more profound knowledge and skills. I increased my technical and non-technical abilities to implement successful projects and get results. If you’re at this middle point, you will learn more if you spend your time on those 5 things than you would by taking another online course or implementing the ten-thousandth iteration of a Kaggle competition.
1. Implement large projects from start to finish
Most people tend to learn best by doing. That is why online courses will often teach you a skill or concept and then ask you to implement it in a small exercise or project. So working on projects will teach you a lot. Implementing large projects from start to finish with the potential for many pitfalls will teach you even more.
If you work on a project as part of an online course, the research question, dataset, and sometimes even the models and evaluation metrics are chosen for you. So you only need to focus on the code and implementation. However, deciding on a research question, dataset, model, and evaluation metric are actually the tricky parts! If you ever had to write a thesis as part of a university degree, you know that defining a research question is very difficult.
Whether you are a data analyst, BI analyst, or data scientist, part of your job is to identify patterns in large amounts of data without anyone telling you what exactly to look for. In other cases, you might be tasked to investigate a specific question, but you don’t have a dataset and need to think about what could be used to answer this question and how to procure it. These examples show that implementation alone does not fully prepare you for a data science job.
To practice your skills beyond implementation, you should develop your own project from scratch — starting with the research question and ending with deployment. The following resources help you tackle this task:
- Developing project ideas: If you need inspiration to come up with unique data science project ideas, you can check out my guide, where I discuss 5 steps that will help you come up with interesting and executable ideas.
- Project planning approaches: 10 different methods are introduced in this blog post to help you plan and manage your data science project. They include the CRISP-DM, agile, ad hoc, and waterfall approaches.
- Data science project structure: Cookiecutter Data Science provides a standardized project structure that helps you create reproducible, self-documenting projects. It considers all steps in a project — from collecting and transforming data to generating reports with the results.
Aside from the start-to-finish implementation of a project, the focus on larger projects will increase your learning as well. Dealing with large datasets, implementing more models, and answering more questions will cause more problems and struggles along the way. While struggling can be frustrating, it actually teaches you valuable knowledge and skills. Facing a problem and then solving it yourself is a much more effective way to learn than being told about a potential problem and its solution.
2. Create your own datasets
For many data scientists, modeling is the most exciting part — figuring out which algorithms to use, implementing, fine-tuning, and evaluating them. However, as a professional data scientist, you have to deal with data collection and cleaning, requiring up to 80% of your time.
If you work at a company without designated data engineers, you’ll likely be responsible for data acquisition. So understanding what data is relevant for a particular research question, where and how to acquire this data, and what preprocessing steps to take is vital. You should practice web-scraping (but keep it legal and ethical), familiarize yourself with sources that provide existing datasets and APIs (which you can combine and expand on), and transform the data for further analysis and modeling.
While many portfolio projects require a one-time acquisition of data, real-world applications often need ETL pipelines that continuously extract, transform, and load new data. So turn your data acquisition into an ETL process by writing a script that keeps pulling new data, transforms it, and saves it to a database.
The following resources help you get started on creating your own datasets and ETL pipelines:
- Datasets on Kaggle: Users on Kaggle have shared hundreds of datasets they created. Many of them explain the sources and processes used to acquire the data and give you an idea of where to look for data.
- Web-scraping for data science: Kerry Parker wrote a great guide on web-scraping for data scientists.
- APIs for data science: 22 APIs — including IBM Watson, Spotify, and Census.gov — which provide data for data science and machine learning are presented.
- Awesome-ETL: Overview of resources to build ETL pipelines in various languages — including Python.
3. Read academic papers
When I want to get a high-level overview of a topic or understand the basic mechanics of an algorithm, my go-to resource includes blog posts on Towards Data Science and other websites. However, a high-level understanding will only get you so far.
Reading the academic papers that introduce, compare, and contrast algorithms and machine learning approaches will provide you with more profound knowledge than any blog post ever could. For example, you learn why a particular algorithm was introduced, how it works mathematically, what other research and models exist addressing a similar problem, and what questions need to be addressed by future research.
Moreover, reading academic papers helps you stay on top of new developments within your field. All of your favorite ML algorithms and NLP models were developed by researchers and introduced in papers — random forests, XGBoost, BERT, GPT-3. Other papers help in understanding which algorithms perform best in specific scenarios.
With the knowledge gained from regularly reading academic papers, you will be better equipped to explain the inner workings of algorithms, choose the suitable models for your use case, and justify your decision. Yes, it can be difficult and exhausting to read scientific writings. But it is worth your while, and you will get better at it over time. The focus and effort you put into understanding a paper lead to a steeper learning curve.
The following resources can help you get started with academic papers:
- Guide to reading academic papers: Kyle M Shannon explains why reading academic writings is so hard and how you can get better at it.
- Deep learning papers to read: Robert Lange publishes monthly summaries of his favorite new deep learning papers.
- RSS feeds: Use an RSS feed to stay up to date on the latest research. The arXiv feed allows access to research related to specific categories, such as computer science, statistics, or machine learning.
4. Write technical blog posts
According to the Feynman technique, explaining a topic to someone else is a great way to learn it yourself. When you write technical articles on data science at Medium or TDS, you need to understand the material in enough detail to explain it to your audience. Thus, writing about data science is an excellent application of the Feynman technique. Writers on TDS have confirmed this numerous times.
A fantastic bonus of writing technical articles is that you have a resource for yourself. You can go back to your article if you want to get a refresher on an algorithm or project you implemented a while ago. In addition, these blog posts can serve as a demonstration to employers that you understand a topic and are knowledgeable in data science in general. Finally, writing helps practice your communication skills which is a vital skill for data scientists!
The following resources help you get started and improve your technical writing abilities:
- How to get started with data science writing: Rashi Desai provides reasons and guidance on how to write data science articles.
- Guides to writing about data science: Elliot Gunn created an overview of TDS articles discussing how to improve your data science writing.
- Tools and advice to publish successful articles: Anne Bonner shares general advice for blogging on Medium and discusses valuable tools, such as Grammarly and Hemingway App.
5. Work with others
I used to underestimate the value of working with or for others on data science and coding projects. That changed when I worked with a friend to design an online game. I noticed how much I could learn from him, not because he taught me anything, but because we exchanged ideas, shared our code, and had to communicate our vision and progress to each other.
Collaboration with a fellow data scientist or programmer as well as solving a data-related problem for a friend or your current employer teaches you skills that online courses fall short of:
- Communicating with and presenting your findings to technical and non-technical audiences.
- Solving business-related problems where mistakes can have real impacts (if work happens in a business context).
- Adjusting your idea and code based on user feedback.
To get started on working with others, speak to a friend who might face a problem that you could solve with your data and coding skills. If you know data scientists or programmers, ask them to collaborate on a joint project. Pay attention at work to identify potential opportunities where you could leverage your data science skills. There are also numerous opportunities to work on projects with an online community, such as DataKind, Data for Good, or Statistics Without Borders.
Summary
Learning data science is hard— not just because of all the technical knowledge you need to acquire. Structuring your learning journey, sticking to it, and getting the feeling that your hard work pays off is challenging. I used to think online courses were the secret to becoming a great data scientist.
I discovered that working on large projects from start to finish, creating my own datasets and ETL pipelines, reading research papers, writing technical blog posts, and collaborating with others taught me invaluable skills I otherwise wouldn’t have acquired. I hope you will try out some of these tips yourself and boost your data science learning.