Dipanjan (DJ) Sarkar

237 points
User profile image.

Dipanjan (DJ) Sarkar is a Data Scientist at Red Hat, a published author, consultant and trainer. He has consulted and worked with several startups as well as Fortune 500 companies like Intel. He primarily works on leveraging data science, machine learning and deep learning to build large- scale intelligent systems. He holds a master of technology degree with specializations in Data Science and Software Engineering. He is also an avid supporter of self-learning and massive open online courses. He has recently ventured into the world of open-source products to improve the productivity of developers across the world.

Dipanjan has been an analytics practitioner for several years now, specializing in machine learning, natural language processing, statistical methods and deep learning. Having a passion for data science and education, he also acts as an AI Consultant and Mentor at various organizations like Springboard, where he helps people build their skills on areas like Data Science and Machine Learning. He also acts as a key contributor and Editor for Towards Data Science, a leading online journal focusing on Artificial Intelligence and Data Science. Dipanjan has also authored several books on R, Python, Machine Learning, Social Media Analytics, Natural Language Processing and
Deep Learning.

Dipanjan's interests include learning about new technology, financial markets, disruptive start-ups, data science, artificial intelligence and deep learning. In his spare time he loves reading, gaming, watching popular sitcoms and football and writing interesting articles on https://medium.com/@dipanzan.sarkar and https://www.linkedin.com/in/dipanzan. He is also a strong supporter of open-source and publishes his code and analyses from his books and articles on GitHub at https://github.com/dipanjanS.

Authored Content

Authored Comments

Absolutely, so overall the data structures are kind of similar yet different making it a bit confusing. But if you check the history of the evolution of Spark (https://stackoverflow.com/questions/31508083/difference-between-datafra…), we first had the RDDs and then DataFrames came into the picture in 2013 and then finally Dataset spun off from DataFrames in 2015 as a type-safe version of DFs.

Datasets are pretty good and work quite well in native Spark (leveraging Scala) but since we leverage python in our example, we have to go for Spark DataFrames. Traditionally though Datasets have always been slightly slower than DataFrames but their performance is catching up (https://databricks.com/session/demystifying-dataframe-and-dataset). Hope this helps!