Data Science — Jialin Huang, Ph.D

A Data Scientist’s Lessons from Building Artificial Intelligence (AI) Applications

Over the last three months, I've been diving into the world of artificial intelligence (AI), crafting two innovative applications centered around Large Language Models (LLMs). One of the projects is a personal financial advisor, using the OpenAI API and Streamlit. The other, an automatic data analysis tool builds on LangChain and Streamlit. These projects, while prototypes, have been invaluable learning experiences for me, especially as a data scientist. Throughout the process, I've gained insights into five key aspects of AI-related application development that have significantly shaped my approach and understanding of the AI field.

80% of developing time spent on experimenting prompts
Prompt engineering plays a pivotal role in maximizing the usability of Large Language Models (LLMs) APIs. It goes beyond merely framing questions. It extends to controlling the formats and logics of the API outcomes. Several prompt engineering strategies have emerged as effective, notably Chain of Thought (CoT) and Algorithmic of Thoughts (AoT), especially in complex reasoning and logic scenarios such as solving math problems correctly. The quality of API outputs is significantly influenced by the specificity and details provided in the prompts, making personalized prompts yield more accurate results. However, there isn't a one-size-fits-all approach to prompt engineering. While online guidance and templates are available, developers and practitioners need to spend substantial time on experimenting and tailoring prompts according to their unique use cases for optimal outcomes.
- Things to consider: As a starter, DeepLearning.AI offers a practical tutorial for prompt engineering course.
- Things to consider: Set up different user profiles or use cases, experiment prompts accordingly, and evaluate what works well
User interfaces (UI) make a huge difference for AI applications
Effective user interfaces and experiences (UI/UX) are essential for reaching and engaging a broader audience. The user interface serves as the first impression of an application, significantly influencing user adoption and retention rates. OpenAI's remarkable success is partly attributed to its user-friendly chat format, which offers a natural and intuitive interaction. However, not all AI applications can be effectively accommodated by the standard chatbox interface. Therefore, selecting a suitable and tailored interface is crucial for building a successful AI application. Emerging trends, such as Adobe Firefly UI and Notion AI UX, exemplify the evolving landscape of AI interfaces, underlining the importance of adaptable and user-centric designs to ensure meaningful user interactions and engagement.
- Things to consider: “Emerging UI/UX patterns in Gen AI” is worth reading
- Things to consider: Design based on user needs and preferences and solve their pain points by leveraging AI
Choice of open-source vs closed-source models depends
OpenAI operates as a closed-source API, providing stability and consistency in its services. In contrast, LangChain, being open-source, offers a wider options of Large Language Models (LLMs). However, the open-source nature of LangChain brings unique challenges. For instance, frequent API changes make web applications built on it prone to failures. Debugging and manipulation in LangChain are also more complex due to its API design. Another alternative is the customization or fine-tuning of your own models. It grants control but demands considerable resources like data, skills, time, and computing power. Choosing the right option involves weighing these factors against the specific needs of your project.
- Things to consider: Pick a model or API that is most suitable for your use case, and be prepared for their strength and weakness
Accuracy and consistency of LLMs outcomes matter
The accuracy of Large Language Models (LLMs) poses a significant challenge in AI application development. Because they can produce results that end-users cannot easily verify or discern as false. This presents a unique hurdle for developers and practitioners. However, techniques like retrieval-augmented generation (RAG) have proven effective in mitigating inaccuracies under some circustances. Another pain point is the inconsistency of LLMs. In both OpenAI and LangChain APIs, predicting the content LLMs produce each time is difficult. Outcomes can vary widely, making consistency an ongoing concern for developers and practitioners. Implementing checkpoints could help to solve the issue, but OpenAI API does not allow access to this functionality yet.
- Things to consider: Implement best practice to mitigate performance flaws
AI is a powerful tool but it has clear limitations
Artificial intelligence (AI) is a transformative technology with remarkable capabilities, especially in solving certain problems such as generating images, and creating text. However, AI has known limitations. One of these challenges is the phenomenon of hallucinations, where AI models can generate false or misleading information. Bias is another significant limitation: AI systems can inherit biases present in the data they are trained on, leading to unfair or discriminatory outcomes. Moreover, safety is a paramount concern, especially in applications where AI interacts with the real world, such as content moderation. As the AI landscape continues to evolve, addressing these concerns and advancing responsible AI practices will be essential for the future of AI-driven applications.
- Things to consider: Apply AI to solve appropriate business problems
- Things to consider: Be proactive in addressing their caveats with cross-functional partners

Demystifying Data Science: Unveiling Five Distinct Paths for Career Success!

In the ever-changing field of analytics, I've accumulated nearly a decade of experience working as a data scientist across various organizations. I aim to burst common myths about this profession. Contrary to the belief of a one-size-fits-all data scientist, I identified five distinct types within the field. In this post, I'll dive into their specific functions, tools, and skills, offering an in-depth look at each type. My goal is to empower fellow data scientists to reskill and upskill for their career growth. Additionally, for business partners, this information can aid in understanding when and how to engage with data scientists effectively. Companies and HR professionals can also utilize this guide to assess whether integrating data science into their operations is necessary at this point.

Business/product analysts: This category of data scientists are most similar to business analysts, commonly handling ad-hoc data using tools such as Excel. Their tasks often involve running descriptive statistics to cater to specific business requirements. The key to the success lies in their profound business acumen and understanding of business dynamics.

Business intelligence engineers: Data Scientists in this category share similarities with Business Intelligence Engineers (BIEs). They focus on metric development, reporting, and tracking. While they diverge from BIEs in terms of data engineering and automation tasks, they often utilize similar reporting tools like Tableau and PowerBI. Their success hinges on a deep understanding of business contexts, expertise in dashboard creation, and skillful data visualizations.

Data analysts: In this category, there are two distinct types of data scientists, each with specialized expertise. One group specializes in measurements, involving tasks like A/B testing and experiments. The other group focuses on statistical analyses, including inferential and other applied statistics. Both types employ common tools like R, Python, and SQL. Their success is driven by the insights and recommendations derived from analytical outcomes.

Applied scientists: These data scientists perform tasks that align with Machine Learning Engineers (MLEs) to some extent. They leverage machine learning techniques and artificial intelligence algorithms to tackle business problems, although their emphasis on software development is NOT as prominent as that of MLEs. The success is usually measured by the impacts their models have on improving business outcomes.

While these five types of data scientists may vary in their functions, tools, and technical skills, they all rely on shared non-technical skills to succeed. Proficiency in understanding business contexts is essential. It allows Data Scientists to align the analyses with organizational goals and priorities and to make the most meaningful business impacts accordingly. Strong project management skills enable efficient handling of tasks, ensuring timely delivery and effective resource utilization.

Additionally, effective communication is crucial. Data scientists need to convey complex findings and insights in a comprehensible manner to diverse audiences, including both technical and non-technical stakeholders. Clear and concise communication ensures that the implications of data analyses are well-understood, actionable and impactful.

Collaboration is another vital skill. Data scientists often work in interdisciplinary teams, collaborating with colleagues from various functions. The ability to collaborate fosters a inclusive working environment where ideas can be shared, refined, and implemented collectively.

By nurturing these non-technical skills alongside their technical expertise, data scientists can maximize their impact and contribute significantly to the organizations' success and individual career growth.

Cracking the Data Code: Challenges Faced and Solutions Found in Data Science

In every profession, challenges are inevitable. As a data scientist, I've identified three major categories of pain points: technical hurdles, which can be addressed through embracing new tools and collaborative efforts; business-related issues, which can be tackled with support from engaged business partners; and impact challenges, which can be overcome through clear communication and delivering results effectively. By recognizing these challenges, my goal is to offer precise strategies to mitigate these issues and to foster problem resolutions in the world of data science.

Technical hurdles can be broadly classified into three categories: data-related problems, methodological challenges, and tooling issues. Data-related problems are composed of issues such as data integration from multiple sources, dealing with diverse data formats, managing structured and unstructured data, ensuring data accuracy and completeness, handling data complexity, and crafting effective data visualizations. Methodological challenges involve selecting appropriate methods and models, finding a balance between methods and available resources. Tooling issues encompass the need to learn and integrate new tools to effectively address business problems and to deal with a complex and dynamic landscape of past, current and emerging technologies. These challenges underscore the multidimensional nature of technical issues faced in the field of data science.

Strategies:

To address data-related problems, data scientists can collaborate closely with data engineers and other peers to gain a deeper understanding of data. Additionally, data scientists can adopt new tools and technologies to effectively resolve some data-related problems.
For overcoming methodological challenges, data scientists can engage with both internal and external communities to learn and share best practices, empowering us to make more informed decisions.
The ability to learn and adapt to new technologies is an essential skill for successful data scientists in resolving tooling issues.

Business-related issues revolve around understanding the specific contexts within the business landscape. This includes gaining a deep and comprehensive understanding of the business priorities, goals, and challenges. Successful data scientists need to align analysis with the overarching business objectives, scope meaningful analysis within business contexts and address business challenges with data insights. These alignments ensures that data-driven analyses are not just accurate but also strategically meaningful for the business.

Strategies:

Business partners are our allies in understanding the importance of these business challenges and addressing them using data-driven solutions.
By maintaining continuous dialogue and feedback loops with business partners, data scientists can fine-tune analyses to better match the needs and expectations of the business.

Impact challenges emphasizes effective delivery of meaningful insights and orchestrating the landing of these insights within the organizations. A major difficulty lies in the timing of execution and the timely delivery of insights to relevant stakeholders. Timeline is crucial, as insights lose their potency if not delivered when they are most relevant. Data scientists need to strategize, plan and execute analyses in a reasonable timeframe to optimize their impacts. Moreover, orchestrating the landing of data insights involves a careful balance between technical accuracy and accessibility. It is NOT just about presenting complex data and findings. It is about transforming intricate analyses into digestible narratives that resonate with both technical and non-technical stakeholders. Additionally, understanding the organizational dynamics and tailoring the insights to align with different organizations' objectives ensures that the impact of the analysis is maximized.

Strategies:

Ensure clear, concise, and actionable communication so that the findings resonate with various partners effectively.
Effective visualization, storytelling, and interpretation of results are crucial skills too. Data scientists can deliberately cultivate and enhance these skills through practice and continuous learning.

A Data Scientist’s Lessons from Building Artificial Intelligence (AI) Applications

80% of developing time spent on experimenting prompts

User interfaces (UI) make a huge difference for AI applications

Choice of open-source vs closed-source models depends

Accuracy and consistency of LLMs outcomes matter

AI is a powerful tool but it has clear limitations

Demystifying Data Science: Unveiling Five Distinct Paths for Career Success!

Cracking the Data Code: Challenges Faced and Solutions Found in Data Science

Jialin Huang, Ph.D