Explorez tous les épisodes du podcast Data Science Tech Brief By HackerNoon
Plongez dans la liste complète des épisodes de Data Science Tech Brief By HackerNoon. Chaque épisode est catalogué accompagné de descriptions détaillées, ce qui facilite la recherche et l'exploration de sujets spécifiques. Suivez tous les épisodes de votre podcast préféré et ne manquez aucun contenu pertinent.
This story was written by: @liorb. Learn more about this writer by checking @liorb's about page,
and for more stories, please visit hackernoon.com.
Even the most well-equipped organizations can find themselves serving up a mess instead of actionable insights. Here's a step-by-step process of fixing your data strategy, ensuring that you're serving up actionable data instead of a recipe for disaster. In the following sections, we'll dive into the common data strategy nightmares.
How To Measure The Results Of In-App Events When Onelinks Don’t Work
Many app developers and marketing managers face the challenge of accurately measuring the impact of In-App Events (IAEs) on the App Store. While IAEs have proven effective for re-engaging users, attracting new downloads, and increasing revenue, traditional tracking methods like OneLink don’t actually include IAEs. Major mobile attribution platforms confirm that currently there is no way to track IAEs properly. At Social Discovery Group, our portfolio of 60+ dating and entertainment brands is supported by a team of over 100 marketers dedicated to app growth and development. We’re used to measuring all our marketing efforts in terms of financial value. Eventually, we’ve managed to develop our own composite way to evaluate IAEs, and are going to share it with you.
When and When Not to Use Apache Kafka as a Database
This story was written by: @aahil. Learn more about this writer by checking @aahil's about page,
and for more stories, please visit hackernoon.com.
Apache Kafka, while not a traditional database, has database-like properties such as data retention and querying capabilities. This article explores when Kafka can be used for database-like purposes and when it is best suited as a streaming platform.
Random Forest Regression in R: Code and Interpretation
This story was written by: @nikolao. Learn more about this writer by checking @nikolao's about page,
and for more stories, please visit hackernoon.com.
Random forest is one of the most popular algorithms for multiple machine learning tasks. This story looks into random forest regression in R, focusing on understanding the output and variable importance. The package with the original implemetation is called randomForest.
9 Best Data Engineering Courses You Should Take in 2023
This story was written by: @balapriya. Learn more about this writer by checking @balapriya's about page,
and for more stories, please visit hackernoon.com.
Recently, data engineering has become an increasingly coveted space. With an average salary of over 112K USD, the demand for skilled data engineers is growing with every passing day. Data engineers combine their data and software engineering expertise to facilitate the data infrastructure of an organization.
Are you an aspiring data engineer, or someone with experience in the data space—looking to pivot into data engineering?
In this list, you'll find some of the best data engineering courses and career paths that can help you jumpstart your data engineering journey!
A Beginner's Guide to Understanding Unstructured Data Analysis with LangChain and DeepInfra
LangChain and DeepInfra are powerful tools for unstructured data analysis. We'll explore their capabilities, understand the importance of data-driven decisions, and learn how to extract valuable insights. Get ready to uncover hidden patterns and make informed choices using these powerful tools.
How To Plot A Decision Boundary For Machine Learning Algorithms in Python
This story was written by: @kvssetty. Learn more about this writer by checking @kvssetty's about page,
and for more stories, please visit hackernoon.com.
How To Plot A Decision Boundary For Machine Learning Algorithms in Python is a popular diagnostic for understanding the decisions made by a classification algorithm is the decision surface. This is a plot that shows how a trained machine learning algorithm predicts a coarse grid across the input feature space. A decision surface plot is a powerful tool for understanding how a given model ‘sees’ the prediction task and how it has decided to divide up the feature space by class label. The complete source code is available at my git repository.
Demystifying Dimensional Modelling: Unveiling the What, Why, and Who's
This story was written by: @disa. Learn more about this writer by checking @disa's about page,
and for more stories, please visit hackernoon.com.
Dimensional modelling is a database design philosophy. It is the most widely used style of relational database. It has all the basic ingredients of a relational database i.e Primary keys, Foreign Keys and multiple tables. It’s different from your 3NF relational database majorly because of it's ease of understanding and its superior query performance.
As a data engineer, your job involves handling lots of information (we call it data). You need to think about where all this information is coming from, what it looks like, and how it might need to be changed or fixed up. You also need to think about where it's going and what questions it can help answer.
From Crashing to Lift-Off: How to Thrive as the First Data Scientist in a Startup
This story was written by: @breus. Learn more about this writer by checking @breus's about page,
and for more stories, please visit hackernoon.com.
This piece utilizes the game Factorio as a metaphor for a data scientist's progression in a startup, spanning four stages: Manual/Foundation, Initial Automation, Scale, and Flight. Each stage represents different facets of the journey - from scrappy, hands-on work, automating routine tasks, scaling for growth, to evolving in response to changing landscapes.
Data-driven Marketing: Unleashing the Power of Big Data for Targeted Campaigns
Big data provides marketers with a treasure trove of information. By tapping into this wealth of data, businesses can better understand their customers, make informed decisions, and develop targeted campaigns. The benefits of data-driven marketing are far-reaching, from increased customer engagement and loyalty to improved conversion rates.
This story was written by: @mrogati. Learn more about this writer by checking @mrogati's about page,
and for more stories, please visit hackernoon.com.
3 Best Ways To Import JSON To Google Sheets [Ultimate Guide]
This story was written by: @liorb. Learn more about this writer by checking @liorb's about page,
and for more stories, please visit hackernoon.com.
Despite having more information than ever, making informed decisions seems increasingly challenging. This guide is designed to help you transform data from a source of frustration into a powerful tool for driving business growth. From my own experience, I've seen professionals dedicating up to 50% of their workweek to validating data.
Exploring Obyte Use Cases: Programmable Payments, Chatbots, and Beyond - Part I
This story was written by: @obyte. Learn more about this writer by checking @obyte's about page,
and for more stories, please visit hackernoon.com.
Obyte is an open-source distributed ledger (DAG) system. DAGs can be used to pay for goods and services without using banks or middlemen. Obyte has many features and use cases to explore.
A Professional Sports Gambler Used Analytics to Turn a $700,000 Loan Into More Than $300 Million
1) Let's start with some history...
Matthew Benham graduated from the world-renowned University of Oxford in 1989 with a degree in Physics.
He spent the next 12 years working in finance, eventually being named a VP at Bank of America.
But in 2001, he decided to change careers.
Big Tech Companies Have Your Health Info Thanks to Telehealth Startups
This story was written by: @TheMarkup. Learn more about this writer by checking @TheMarkup's about page,
and for more stories, please visit hackernoon.com.
A joint investigation by STAT and The Markup found that 50 direct-to-consumer telehealth companies were leaking sensitive medical information to the world’s largest advertising platforms. Trackers on 25 sites, including those run by industry leaders Hims & Hers, Ro, and Thirty Madison, told at least one big tech platform that the user had added an item like a prescription medication to their cart, or checked out with a subscription for a treatment plan.
Data has become an essential resource for businesses, driving decision-making and innovation. As the volume of data continues to grow, ensuring data quality and compliance is more important than ever. One way to achieve better data governance is through data lineage, which tracks the flow of data throughout an organization. This article will discuss how data lineage can help in user data governance and explore how serverless technology can be incorporated.
Solving Time Series Forecasting Problems: Principles and Techniques
This story was written by: @teenl0ve. Learn more about this writer by checking @teenl0ve's about page,
and for more stories, please visit hackernoon.com.
This article delves into time series analysis, discussing its significance in decision-making processes. It elucidates various techniques such as cross-validation, decomposition, and transformation of time series, as well as feature engineering. It provides a deep understanding of different modeling approaches, including but not limited to, Exponential Smoothing, ARIMA, Prophet, Gradient Boosting, Recurrent Neural Networks (RNNs), N-BEATS, and Temporal Fusion Transformers (TFT). Despite the wide range of techniques covered, the article emphasizes the need for experimentation to choose the method that yields the best performance given the data characteristics and problem specifics.
This story was written by: @nfrankel. Learn more about this writer by checking @nfrankel's about page,
and for more stories, please visit hackernoon.com.
In the previous post, I proposed a sample architecture where location-based routing happened at two different stages. In this post, we'll see how we can implement routing at the two levels. We'll use Apache ShardingSphere as an indirect layer between the application and the data sources.
Tales of the Undead Salmon: Exploring Bonferroni Correction in Multiple Hypothesis Testing
This article explains the problem of testing multiple hypotheses without proper adjustments. It introduces the Bonferroni correction as a solution to control false positive results. Simulation demonstrates the effectiveness of the correction. Understanding and applying corrections in multiple hypothesis testing is essential for accurate data analysis and decision-making.
This story was written by: @epappas. Learn more about this writer by checking @epappas's about page,
and for more stories, please visit hackernoon.com.
As large language models (LLMs) like GPT-4 emerge, managing high-dimensional data structures becomes increasingly important. LangChain, an LLM-powered application development framework, integrates with DataOps and VectorOps processes and utilizes vector databases to create data-aware, interactive applications.
Study: PR Professionals Struggle with Data Literacy, Impeding Communication of Value to Tech C-Suite
This story was written by: @sarahevans. Learn more about this writer by checking @sarahevans's about page,
and for more stories, please visit hackernoon.com.
Half of PR pros said they have presented a metric they didn't understand. Here's what reporting is needed to support the C-suite and show PR value.
Seamlessly Migrate Your On-Premise Data Pipeline to Azure with These Key Steps
This guide details the process of migrating an on-premise Cloudera data system to Azure, covering key considerations, challenges, and best practices to ensure a smooth and secure transition.
5 Skills Every Successful MLOps Engineer Should Have
This story was written by: @huwfulcher. Learn more about this writer by checking @huwfulcher's about page,
and for more stories, please visit hackernoon.com.
MLOps engineering is a rapidly growing field, thanks to the increasing importance of deploying and maintaining machine learning models in today’s business landscape. If you’re looking to excel as an MLOps Engineer, there are certain skills that will set you apart from the competition. In this article, we’ll explore five key skills that every successful MLOps Engineer should have.
7 Strategies to Reduce Training Data Acquisition Cost
Acquiring high-quality training datasets can be expensive, but there are various strategies you can use to minimize the cost. Start by defining your project requirements and target audience, then consider using existing datasets or outsourcing to a data collection service. You can also leverage crowd-sourcing platforms, data partnerships, and data augmentation techniques to reduce the cost of data collection. By following these strategies, you can acquire the data you need without breaking the bank and optimize your machine-learning models for success.
Foursquare Enters the Future With a Geospatial Knowledge Graph
This story was written by: @linked_do. Learn more about this writer by checking @linked_do's about page,
and for more stories, please visit hackernoon.com.
Foursquare Graph is the company’s first application of graph technology to geospatial data. The company has 9 billion-plus visits monthly from 500 million unique devices. Its data is used to power the likes of Apple, Uber and Coca-Cola. We caught up with FSQ Distinguished Engineer Vikram Gundeti to learn more about what kind of data the company deals with.
This story was written by: @davisdavid. Learn more about this writer by checking @davisdavid's about page,
and for more stories, please visit hackernoon.com.
Data science is an ever-evolving field, with new technologies and techniques being developed all the time. As we started the journey of 2023, it’s important for data scientists to stay on top of the latest trends and advancements in order to remain competitive in the job market. In this article, we will explore why it's essential to become a better data scientist in 2023 and provide some tips.
A/B Testing was a Jerk, Until we Found the Replacement for Druid
This story was written by: @HarisDou. Learn more about this writer by checking @HarisDou's about page,
and for more stories, please visit hackernoon.com.
The recipe for successful A/B testing is quick computation, no duplication, and no data loss. For that, we used Apache Flink and Apache Doris to build our data platform.
How High-Quality Datasets Can Revolutionize Business Outcomes with Machine Learning
In machine learning, the quality of the dataset is just as important as the complexity of the model. Without high-quality data, even the most advanced algorithms and models will not be able to deliver accurate results. In this article, we will explore the correlation between datasets and models, and how the accuracy of a model can impact business outcomes.
Discover how product managers can bridge the gap between intuition and data to optimize product improvement. This guide explores the importance of data-driven decision-making, offering best practices and real-world examples from companies like NuBank, Monzo, Deliveroo, and Booking.com. Learn how to acquire insights from customer feedback, track performance metrics, monitor market trends, and refine product roadmaps through iterative experimentation. Become a data-driven PM and create products that users will love.
Discover how product managers can bridge the gap between intuition and data to optimize product improvement. This guide explores the importance of data-driven decision-making, offering best practices and real-world examples from companies like NuBank, Monzo, Deliveroo, and Booking.com. Learn how to acquire insights from customer feedback, track performance metrics, monitor market trends, and refine product roadmaps through iterative experimentation. Become a data-driven PM and create products that users will love.
Leveraging Data Granularity, Distribution, and Modeling for Effective Product Management
Granularity determines the level of detail available in the data, which directly impacts what you can observe and analyze. For instance, finer granularity provides more detailed insights but may require more sophisticated handling and processing techniques.
Distribution helps identify the patterns and spread of data, which is critical for selecting the appropriate analysis techniques and ensuring the accuracy of predictive models.
Data Modeling uses the insights gained from understanding granularity and distribution to build predictive or descriptive models that inform decision-making and strategy.
How Vectors, Rag and Llama 3 Are Changing First-Party Data
The push for first-party data generally goes that companies need to become better stewards of data acquisition and management. Consumers increasingly want to know who is hanging onto their personal information, how they got it, why they have it, and what is being done with it. The push to take back control of data seems essential, but is it practical?
16 Best Sklearn Datasets for Building Machine Learning Models
This story was written by: @datasets. Learn more about this writer by checking @datasets's about page,
and for more stories, please visit hackernoon.com.
Sklearn is a Python module for machine learning built on top of SciPy. It is unique due to its wide range of algorithms and ease of use. Data powers machine learning algorithms and scikit-learn. Sklearn offers high quality datasets that are widely used by researchers, practitioners and enthusiasts.
Enhancing Audit Processes With Advanced Analytical Tools
Developers can leverage advanced analytics tools to streamline and improve software, compliance and internal controls auditing. Advanced analytics tools like artificial intelligence, complex event processing and data mining enable 100% population testing. They eliminate the need for sampling, thereby reducing bias and error risks. Autonomous technologies like AI are particularly beneficial since they eliminate human error.
Go Clean to Be Lean: Data Optimization for Improved Business Efficiency
This article discusses cost optimization with clean data. It explains how businesses can save resources by decreasing the load for data analysts, among other opportunities. It also discusses the differences between raw and clean data and who can benefit from switching to the latter. You'll also find 4 ways in which clean data reduces time to value.
How AI-Powered Data Mapping is Democratizing Data Management
AI is revolutionizing data mapping by automating and simplifying the process, making data management more efficient and accessible for businesses and non-technical users alike.
Efficient Data Management and Workflow Orchestration with Apache Doris Job Scheduler
This story was written by: @frankzzz. Learn more about this writer by checking @frankzzz's about page,
and for more stories, please visit hackernoon.com.
The built-in Doris Job Scheduler triggers pre-defined operations efficiently and reliably. It is useful in many cases including ETL and data lake analytics.
Scaling Ethereum: Data Bloat, Data Availability, and the Cloudless Solution
This story was written by: @logos. Learn more about this writer by checking @logos's about page,
and for more stories, please visit hackernoon.com.
Codex is a cloudless, trustless, p2p storage protocol seeking to offer strong data persistence and durability guarantees for the Ethereum ecosystem and beyond. Due to the rapid development and implementation of new protocols, the Ethereum blockchain chain has become bloated with data. This data bloat can also be defined as “network congestion,” where transaction data clogs the network and undermines scalability. Codex offers a solution to the DA problem, except with data persistence.
This story was written by: @smileek. Learn more about this writer by checking @smileek's about page,
and for more stories, please visit hackernoon.com.
Backend developers can help frontend developers work with their API more efficiently and ship the product with as little friction as possible. Here are a few simple things that can decrease your time-to-market or improve other fancy metrics your managers want you to improve. I will tell it from the web developers’ point of view, but from what I remember, the same works for mobile development.
How to Build an AI Chatbot with Python and Gemini API
This story was written by: @proflead. Learn more about this writer by checking @proflead's about page,
and for more stories, please visit hackernoon.com.
This guide walks you through building a web-based AI chatbot using Python and the Gemini API. From setting up your environment to running your chatbot, you'll learn each step to create your own AI assistant.
DNS servers play a crucial role in translating human-friendly domain names into IP addresses that computers use to identify each other on the network. Setting up your own local DNS server can be beneficial for various reasons, including local development, internal network management, and educational purposes. We’ll create a simple HTTP server using Python’s built-in `http.server` module to serve the HTML files.
The Collective Loves Data: How Big Data Is Shaping and Predicting Our Future
This story was written by: @manoj123. Learn more about this writer by checking @manoj123's about page,
and for more stories, please visit hackernoon.com.
Big data surrounds us! From social media posts to sensor readings, vast amounts of information shape our world. This article by a Google engineer dives into what big data is (think massive, varied, and ever-growing data sets) and how it's analyzed to predict trends and make smarter decisions. Learn about real-world applications and exciting future possibilities like AI and quantum computing.
Apache Doris for Log and Time Series Data Analysis in NetEase: Why Not Elasticsearch and InfluxDB?
This story was written by: @frankzzz. Learn more about this writer by checking @frankzzz's about page,
and for more stories, please visit hackernoon.com.
NetEase has replaced Elasticsearch and InfluxDB with Apache Doris in its monitoring and time series data analysis platforms, respectively, achieving 11X query performance and saving 70% of resources.
Unlocking the Power of Data Lakes for Embedded Analytics in Multi-Tenant SaaS
This story was written by: @goqrvey. Learn more about this writer by checking @goqrvey's about page,
and for more stories, please visit hackernoon.com.
Analytics should extract maximum insight right? Well, to do that, you’ll need complete access to all relevant data. A data lake is a central storage for all kinds of data in its original, unstructured form. Data lakes are generally more cost-effective than data warehouses for embedded analytics use cases.
The LinkedIn Nanotargeting Experiment that Broke All the Rules
A study demonstrates the feasibility of nanotargeting on LinkedIn, bypassing audience size restrictions and achieving successful campaigns by employing JavaScript code to reactivate campaign launch buttons, employing various targeting strategies, and verifying success through campaign metrics and user interaction.
Data Science Interview Question: Creating ROC & Precision Recall Curves From Scratch
This is one of the popular data science interview questions which requires one to create the ROC and similar curves from scratch. For the purposes of this story, I will assume that readers are aware of the meaning and the calculations behind these metrics and what they represent and how are they interpreted. We start with importing the necessary libraries (we import math as well because that module is used in calculations)
Data Engineering: What’s the Value of API Security in the Generative AI Era?
API security is crucial in the era of Generative AI, ensuring data integrity, protecting user privacy, and enabling secure and efficient AI integration. Robust API protection helps prevent unauthorized access, data breaches, and potential misuse of AI capabilities.