Data Science Online Camp 2021 Summer
3
Tracks
11
Hour of content
15
Video
Conference Agenda
Data Science Solutions
09:30 – 10:00
09:30 – 10:00
Registration
10:00 – 10:45
10:00 – 10:45
Deep Learning for Chemical Reactions Screening
Oleksandr Gurbych
This presentation is about a machine learning approach for chemical reaction yield prediction. Chemical yield is the percentage of the reactants converted to the desired products. Chemists use yield models to select high-yielding reactions and score synthesis routes, saving time and reagents.

Here we adapted the Directed Message Passing Neural Network to work with incomplete reaction schemes and estimate chemical yield, taking molecular graphs as inputs. For a baseline and comparison, we evaluated several machine learning models and chemical structures representations. Models include Linear and Logistic Regression, Support Vector Machine, CatBoost, and Bidirectional Encoder Representations from Transformers. The methods use binary ECFP Morgan fingerprints, SMILESVec embeddings, and textual molecular representations as input features. Within each of the methods, we trained classification and regression models. The goal of each classification model was to separate zero-yield reactions from other, non-zero yields.

The models were trained and validated on the set of commercial reactions with multiple mechanisms and benchmarks on two single-mechanism reaction datasets. The graph neural network was either better or comparable to previously investigated machine learning methods on the benchmark datasets, and displayed the highest predictive power among the tested models on the commercial dataset

In this talk, I will
1. introduce the domain and the project goal
2. describe methods and chemical data representations
3. analyze results and practical applicability
10:45 - 11:30
10:45 - 11:30
Predictive stock allocation
Aviv Gruber
Integrated real-time decision-support tools can provide significant value in supply chain by reducing the impact of inventory imbalances and responding appropriately to the volatile nature of the demand processes. This problem is referred to as the stock allocation model (SAM), which considers a warehouse in a central repair depot, which serves multiple operational fields.

However, in recent years, technological advancements have gradually enabled facilitating repair/replacement of parts in the field, in addition to the central repair depot. This change, although is more expensive to facilitate, could save dramatically operational and logistical costs.

In this talk, I will introduce
1. the domain and the questions which are constantly dealt with,
2. how they are dealt with,
3. an expansion of the mathematical model, which supports local repair faclity and
4. applying learning to the optimization space with a classifier in order to have structural conclusions on the expense of ongoing optimization tasks
11:30 - 12:15
11:30 - 12:15
TBD
Zach Frank
12:15 - 13:00
12:15 - 13:00
Getting Insights through Conversational BI & Analytics
Anand Ranganathan
It often takes a long time for business users in enterprises to get the right data and insights. As a result, traditionally, access to data and analytics has mostly been limited to power users and specialist data scientists with varying degrees of analytical and technical skills. This is where conversational analytics comes in. This paradigm allows any user to ask text and voice questions, in natural language, of their data to a bot and receive back a natural language and visual result.

In this talk, I shall present our efforts and experiences in building a conversational analytics platform on a chat-based interface. Our approach combines natural-language understanding, dialog management, natural-language generation/narration, data understanding and modeling, augmented analytics and automated visualization generation.
13:00 - 13:45
13:00 - 13:45
Good, Fast, Cheap: How to do Data Science with Missing Data
Matt Brems
If you've never heard of the "good, fast, cheap" dilemma, it goes something like this: You can have something good and fast, but it won't be cheap. You can have something good and cheap, but it won't be fast. You can have something fast and cheap, but it won't be good. In short, you can pick two of the three but you can't have all three.

If you've done a data science problem before, I can all but guarantee that you've run into missing data. How do we handle it? Well, we can avoid, ignore, or try to account for missing data. The problem is, none of these strategies are good, fast, *and* cheap.

We'll start by visualizing missing data and identify the three different types of missing data, which will allow us to see how they affect whether we should avoid, ignore, or account for the missing data. We will walk through the advantages and disadvantages of each approach as well as how to visualize and implement each approach. We'll wrap up with practical tips for working with missing data and recommendations for integrating it with your workflow!
13:45 - 14:30
13:45 - 14:30
Using Transformers in NLP
Ravi Ilango
Objective: Participants will learn the use of transformers in NLP and use cases. They will also experience building / using a NLP model using google collab.

Introduction to various sequence learning models and use cases

Huggingface transformer library

Exercise: Participants will do a simple exercise. They need a computer / browser/ internet connection
14:30 - 14:45
14:30 - 14:45
Conference Closing
Machine Learning
09:30 – 10:00
09:30 – 10:00
Registration
10:00 – 10:45
10:00 – 10:45
Continuous SQL with Kafka and Flink
Timothy Spann
Streaming, Real-Time Data, IoT, Kafka, Flink, niFi
10:45 – 11:30
10:45 – 11:30
MLOPS Will Change Machine Learning
Magdalena Stenius
Machine learning has evolved from the experimenting stage to real-world production systems with a need for automated quality assurance and delivery, reproducibility and deployment consistency. How can we extend our DevOps processes into MLOps and how will that impact the machine learning systems we use today?
11:30 - 12:15
11:30 - 12:15
Making your machine learning model usable by others
John Robert
Building the most accurate machine learning models is amazing, but others should also experience how amazing your model is. Machine learning models should not only live on your computer, it should be converted to a product or service. This process involves the deployment of your machine learning model and can be called MLOps.

The world is changing, and the data we are generating is also changing. This implies your machine learning models should be able to adapt to the change in data. Training your model once is not enough, you need to retrain your model with new data. Your model is meant to learn continuously. You also need to manage the lifecycle of your model, for instance, you need to monitor the performance of your model.
12:15 - 13:00
12:15 - 13:00
What Do I See in This Data? Evolution of visual tools to enhance data understanding
Max Novelli
At the Rehab Neural Engineering Labs (RNEL) where I held the position of "Head of Data and Informatics," scientists collect multi-dimensional, multi-modal, multi-format data from a multitude of cutting-edge neuroscience and neural engineering experiments, such as developing brain-computer-interfaces to control robotic limbs in a subject with spinal cord injury, and enhancing natural sensory feedback in neuro-prosthetic limbs. A primary challenge is managing the collected data.

A lack of transparency regarding the foundation of the data structure can prevent users from completely understanding the full range of data descriptors within the datasets, leading to large-scale data duplication or sub-optimal usage. These situations can impact the FAIR principles by degrading appropriate curation and effective use of the datasets. Users are often confronted with a steep learning curve in familiarizing themselves with a large and constantly evolving data structure. The traditional data dictionary models are not helpful in such a dynamic landscape.

Data visualization techniques based on readily-available libraries like d3.js and plotly can provide a highly effective means for exploring this underlying structure and facilitating better understanding. In RNEL, we have produced multiple proof-of-concept multi-dimensional visualizations of complex datasets that can address the issues mentioned above. This talk will illustrate how the Radial Schema Tree, or RadSaT, has been designed and developed to address one specific case. Additionally, the talk will highlight the evolution of the tool during its lifespan and the reasons behind it.
13:00 - 13:45
13:00 - 13:45
How we can use PySpark for building and training an ML model
Cvetanka Eftimoska
When you type Machine Learning on the Google Search Bar, you will find the following definition: Machine learning is a method of data analysis that automates the analytical model building. If we go deeper into Machine learning and the definitions available online, we can further say that ML is, in fact, a branch of AI that has data as a basis, i.e. all decision-making processes, identifying patterns etc., are made with a minimal human touch. Since ML allows computers to find hidden insights without being explicitly programmed where to look, it is widely used in all domains for doing exactly that – finding hidden information in data.

Machine learning is nothing as it was at the beginning. It has gone through many developments and is becoming more popular day by day. Its lifecycle is defined in two phases: training and testing.

Next is PySpark MLlib which in fact is a machine learning library, which is scalable and works on distributed systems. In PySpark MLlib we can find implementation of multiple machine learning algorithms (Linear Regression, Classification, Clustering and so on…). MLlib comes with its own data structure – including dense vectors, sparse vectors, and local and distributed vectors. The library APIs are very user-friendly and efficient.
13:45 - 14:30
13:45 - 14:30
Data Science at scale with VerticaPy
Badr Ouali
VerticaPy allows our customers to grant data warehouse access to their data scientists and developers who are more conversant in Python. Please join us to learn more about our journey of more than three years that has resulted in some amazing and unique functionality that helps Vertica users do data analysis, exploration and machine learning modelling at scale using Python and Vertica.
14:30 - 15:15
14:30 - 15:15
Operationalize machine learning with Kubeflow
Mohamed Sabri
Present Kubeflow/Introduction to MLOps.
Design architecture for deployment and re-deployment.
Live demo of deployment of a pipeline for NLP real life use case.
15:15 - 15:30
15:15 - 15:30
Conference Closing
AI Business
09:30 – 10:00
09:30 – 10:00
Registration
10:00 – 10:45
10:00 – 10:45
Building the 21st Century: AI is the Disruptor… and Foundational Building Block
Neil Sahota
Artificial intelligence (AI) is not the future. It is very much the present. Much like the iPhone, we are at a pivotal time where AI will be incorporated into so many aspects of our professional and personal lives. However, many organizations are focused on automating what we are already doing rather than tapping into the new capabilities AI gives us. In this presentation, you will learn the simple, repeatable framework to move beyond automation to unlocking true innovation with AI.
10:45 – 11:30
10:45 – 11:30
Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir
Ideal components in data & AI platform, available components in Azure, data and AI strategy designing, revenue generation from data & AI
11:30 - 12:15
11:30 - 12:15
Want to adopt AI in your business: good luck!
Olivier Blais
Did you know that only 15% of all AI projects developed today are actually used? Popular case studies circulating on social networks and in the media feature the giants of this world, the likes of Google, UPS and Facebook.

They are developing AI projects at breakneck speed thanks to their army of data scientists and billions invested in R&D.

Don't try this at home : replicate their success, in your company, with your context.
Replicating their recipe and waiting to meet all the winning conditions before launching an AI project will lead you to failure. You will lose time and you will be left behind by your competitors.

Don't worry, all is not lost.

In full transparency, Olivier Blais delivers a conference on the current portrait of AI in business and proposes a new perspective based on his learnings in the field. You'll discover the best practices and the right steps for the adoption of AI in your company.

If you have data and a business goal, don't miss this conference.
12:15 - 12:30
12:15 - 12:30
Conference Closing