Interview with Soledad Galli on “Python Feature Engineering Cookbook- Second Edition”

Soledad Galli is the author of Python Feature Engineering Cookbook, we got the chance to sit down with her and find out more about her experience of writing with Packt.

Q: What was your motivation for writing this book? What inspired you?

Soledad: My goal has always been to make feature engineering simple and accessible to all data practitioners and data scientists worldwide.

Building machine learning models starts with feature engineering. In its native format, raw data is almost never suited to train machine learning models. In fact, data pre-processing takes up at least half of the time that data scientists spend on any data science project.

The good thing is, that there is a growing number of Python open-source libraries that aim to make feature engineering easier. Feature-engine, the Python open-source library I maintain, is one of them.

My intention, through this book, my courses, and Feature-engine, is to first show data practitioners the many ways in which we can transform the data to make it suitable for machine learning. And also, support them in their efforts to simplify their feature engineering pipelines through the use of open-source software and thereby reduce the time they spend on data pre-processing and feature engineering.

Q: What kind of research did you do, and how long did you spend researching before beginning the book?

Soledad: This is the second edition of the “Python Feature Engineering Cookbook”, which was, in turn, conceived at the back of my online course “Feature Engineering for Machine Learning” and the development of Feature-engine. Hence, I would say that I have been actively researching and working on the field of feature engineering for machine learning for the past 7 years.

Throughout this time, I’ve continuously reviewed scientific articles, articles published at the end of data science competitions, and white papers produced by the industry. I try to stay on top of new developments by attending meetings and following open-source developer communities.

Q: What’s your take on the technologies discussed in the book? Where do you see these technologies heading in the future?

Soledad: In the book, I discuss the implementation of feature engineering techniques using open-source Python libraries. Since the first edition of the book, almost every library has evolved towards incorporating more functionality to expand the options for data transformation.

Category encoders, for example, supports more options to represent categorical variables. Feature-engine now supports the creation of features for time series forecasting, besides more methods for tabular data transformation. Scikit-learn now also supports working with pandas dataframes, among other functionality enhancements.

There is a great appetite for automating feature engineering. But, since data preprocessing tends to be specific to the knowledge domain, and the needs for data transformation vary according to the models we want to train, the variables in our data, and our need to make sense of the predictions, this has probed really hard.

While there will continue to be a pursuit for automation, the changes that I’ve seen so far go towards making the knowledge and tools accessible to the public, empowering people to utilize them in their datasets and projects without the need to review an entire field of literature, therefore narrowing the gap between data gathering and project development. And I think, in the coming years, it will continue to go in this direction.

Q: What would you say makes this book unique or different?

Soledad: This book is, to my knowledge, the most exhaustive regarding feature engineering for machine learning. It goes into great detail about the data transformations we need or can perform on tabular data for supervised and unsupervised learning. It also includes a chapter on how to create features from text.

In the second edition, I included two new chapters: one showing how to extract features “automatically” from relational databases utilizing the open-source library Featuretools, and a second chapter to extract features automatically from time series data for classification and regression utilizing the open-source package tsfresh. With these, I think this book leaves almost no aspect of feature engineering untouched.

I am also the developer and maintainer of one of the Python open-source libraries for feature engineering. This gives me a unique opportunity to learn first-hand from users what features, functionality and data transformations they require to advance their projects.

Feature-engine has been vastly expanded thanks to users’ feedback and suggestions for new functionality, as well as many contributions from the community. Thanks to users, we can first make the functionality accessible through our library and then share some of the knowledge further in the second edition of this book, as well as in our courses and online articles and documentation.

Q: What are the key takeaways you want readers to come away with from the book?

Soledad: The book’s most significant lessons are probably these two:

  • There are multiple ways in which we can transform the data, depending on the nature of the variables and the machine learning models we want to create.
  • Even though it may appear overwhelming at first, open-source libraries make the implementation of feature engineering methods simple.

Q: How do you see these technologies benefiting society in the long run?

Soledad: These technologies will support data scientists, machine learning engineers and organisations to develop and put into production entire machine learning pipelines in less time, with less effort and less code involved, thus less maintenance, as they take care of code, tests, versioning, growth and documentation.

Open-source is a great tool to spread knowledge. Anyone can contribute to the library. So techniques used by single organisations or individuals can be made available to the entire community.

The knowledge barrier to the use of machine learning might be lowered, as with the right documentation and guidance, the tools to transform data can be made available to less experienced practitioners. And this is particularly useful for organisations with less resources, like non-profit, as they will be able to implement machine learning and data science solutions more easily.

Q: What advice would you give to readers learning tech? Do you have any top tips?

Soledad: The most important thing, in my opinion, while learning any technology, is to use it. Therefore, my advice would be to code a lot and then some more. I believe that there is much to be learned from studying other people’s code. Hence, checking out code on websites for data science competitions can be useful. Additionally, looking at well-known repositories like Scikit-learn, Yellowbricks, or tsfresh can help to advance our coding abilities. At least that was true for me.

In terms of learning new concepts, I prefer to understand fully how things are done, why they’re done in a certain way, and what their benefits and drawbacks are. Only, when I have all these information, I feel I am well positioned to decide which approach will best resolve the challenge I am working with, as well as help me anticipate any potential future problems. So as much as it is tempting to read a blog and crack on with an implementation, spending some time to really understand what’s going on, pays of in the long run.

Q. How do you keep up-to-date on your tech?

Soledad: I read a lot, mostly online. I regularly do online searches on relevant topics. I talk to colleagues and listen to their challenges and how they overcame them. I follow the main Python open-source libraries, I am subscribed to their mailing lists, to get an overview of what is going on. I regularly attend and/or speak at meetups. I try to keep engaged with the community one way or another.

Q. Do you have a blog that readers can follow?

Soledad: We write a lot about feature engineering and selection at https://www.blog.trainindata.com/

Q. Anything else that you would like to share with your readers?

Soledad: Only that writing this book, teaching my courses and developing the open-source library Feature-engine have been some of the most rewarding activities that I’ve done in my life. I really appreciate reader’s and user’s feedback. So, don’t hesitate to connect on LinkedIn and tell us how we are doing, or jump on to our repo to make requests for functionality or contribute with code.

Q. How would you describe you author journey with Packt? Would you recommend Packt to aspiring authors?

Soledad: Packt editors made it really easy for me to write the book. Guidelines were clear, so were timelines. Everything ran smoothly. So I guess, I would 🙂

You can find Soledad’s books on Amazon by clicking on the cover image:

Python Feature Engineering Cookbook – Available on Amazon.com

Related

Interview With Author Gustavo R Santos

Gustavo R Santos is the author of Data Wrangling...

Interview with Author Bonny P McClain

Bonny P McClain is the author of Geospatial Analysis...

Interview with Author Nadine Shillingford

Nadine Shillingford is the author of Data Analytics with...

Interview With Author Ravindranatha Anthapu

Ravindranatha Anthapu is the author of Graph Data Processing...

Interview with Bharath Sridhar

Bharath Sridhar is the author of Industrial IoT for...