Fraud prevention best practices when crowdsourcing AI data
Artificial intelligence (AI) model builders are increasingly turning to crowdsourcing to fulfill training requirements for their models. Crowdsourcing AI data enables experts with varied professional backgrounds, locales and experiences to collaborate for the enhancement of model performance and better alignment with human needs. By engaging a community through on-demand participation, crowdsourcing offers unprecedented opportunities for collaborative problem-solving and large-scale contributions to shared resources.
While crowdsourcing has the potential to solve a variety of AI training data challenges, it is important to be aware of the risks: Crowd workers' anonymity makes it difficult to verify identities and qualifications. This can raise concerns about data quality and, ultimately, model accuracy.
Ensuring the authenticity and reliability of crowdsourced workers can be challenging due to a range of deceptive practices. From selling accounts to misrepresenting identities, there are bad actors who attempt to exploit the open and decentralized nature of crowdsourcing platforms in various ways, compromising data integrity and skewing the reliability of AI models.
The implications are profound. Compromised data doesn't just lead to inaccurate models; it undermines the very trust that is essential for the widespread adoption of AI technologies.
Read on to explore the potential security challenges when crowdsourcing AI training data, including key vulnerabilities to watch for, and defense strategies that organizations can adopt to better protect their data from evolving threats.
Three common security threats that can arise when crowdsourcing AI trainers
Understanding how individuals with ill intent can propagate into your system and contaminate your training data is essential. You must begin with understanding in order to prevent the introduction of harmful noise into the data collection process. Three common security threats to be aware of when crowdsourcing AI training data are:
1. Account misuse and unauthorized transfers
Misuse of accounts commonly works in two ways. First, unethical actors may exploit disenfranchised and impoverished individuals by hiring them to perform tasks in exchange for a fraction of the payment. Second, accounts are sold or transferred to unqualified contributors, compromising data authenticity and quality.
2. Multiple accounts and proxy contributions
To maximize their returns, some contributors may manage several accounts themselves or answer on behalf of others. In some cases, multiple people also manage a single account, distorting data uniqueness and diversity. These can significantly skew model outcomes when gathering important training data through contributors.
3. Identity misrepresentation
Identification abuse or impersonation includes falsifying identity, credentials or location. It can mislead the data collection process, resulting in skewed data and affecting its reliability for model training.
Impact of training AI on unregulated crowdsourcing
There are several drawbacks for AI safety and performance that compound from unchecked crowdsourcing, including opening the door to deliberate data poisoning attacks and the inadvertent inclusion of low-quality contributions from inexpert sources.
Inexpert data contributions
Annotations (or contributions) from domain experts are highly valued in AI training due to their effectiveness and accuracy. It is also operationally efficient to train AI using annotations from a relatively small pool of qualified experts, rather than undergoing large-scale projects using many generalists. For tasks that require specific expertise, a supervised machine learning model trained with expert annotations can outperform one trained with inexpert annotations. When this expertise is misrepresented and an unqualified contributor distorts and corrupts the training dataset, it degrades the model training process.
Further, in terms of responsible AI, data curation by non-vetted contributors can result in poor-quality and biased data, which can mislead learning algorithms and cause AI models to behave inconsistently or unfairly. This can result in discriminatory outputs that negatively impact certain communities. It can also result in damage to brand reputation, loss of sensitive intellectual property, disruption to customers and financial consequences for your business.
Data poisoning attacks
The openness that makes crowdsourcing so flexible also creates opportunities for adversarial parties to introduce data poisoning attacks, where malicious contributors introduce carefully-crafted data to sabotage datasets and corrupt aggregated results. In a data poisoning attack, attackers manipulate data to skew certain computational outcomes. These malicious workers can blend in, submitting data that subtly deviates from accurate labels to influence aggregate values and exploit AI models.
The objective of these adversarial parties is to intentionally increase estimation errors in targeted data points. This can have serious real-world implications. For example, data poisoning can be used to corrupt classifier models that produce inaccurate predictions, such as a website classifier misclassifying a gambling site as a children's entertainment site.
A security framework to help you defend against threats and inaccuracies
Common strategies for managing crowdsourced data — such as qualification tests, demographic filters, incentives, gold standards and advanced worker models — may fall short when dealing with sophisticated adversaries, who can mimic legitimate users, making them harder to detect.
To address these challenges effectively, organizations require a comprehensive security framework that goes beyond traditional measures, integrating advanced threat detection and proactive risk management to safeguard data integrity and authenticity.
Enforce a zero-trust security policy
Zero trust operates on the principle of never granting trust without verification, rather than assuming implicit trust for all users. This granular security approach helps address the security risks posed by crowdsourced workers, and is built on three core principles:
1. Least privilege access
Granting users and devices least-privilege access to resources ensures they have only the minimum permissions necessary to complete their tasks or roles, with these permissions revoked once the session ends. Organizations can further enhance security by implementing dynamic access control policies that assess each access request based on factors like user privileges, physical location and unusual behavior. This approach minimizes the risk of unauthorized access and helps maintain a secure environment.
2. Proactive threat neutralization
Security teams should maintain continuous, real-time monitoring of data patterns to detect anomalies or suspicious behaviors and respond swiftly to minimize the risk of fraud. For example, setting up real-time alerts for internet protocol (IP) discrepancies and deploying automated responses, such as temporarily suspending user accounts, enables teams to quickly address and neutralize potential threats.
3. Continuous monitoring and validation
Crowdsourced workers should be continuously monitored and periodically re-authenticated. Monitoring workers in production interactions can help detect potential signals of account sharing, location-based misrepresentation, bot use, policy violations and more.
Implement layered security framework
Defense in depth is a security strategy that employs multiple layers of protection to safeguard an organization's assets, aiming to contain vulnerabilities and mitigate threats even if one defense measure fails. By implementing multiple complementary security solutions, each addressing specific concerns, organizations reinforce their overall security posture, building resilience against potential breaches.
A well-designed defense-in-depth model typically maps logical security layers, such as ID verification, to established security practices, like biometric authentication, creating a series of protective measures that span the entire data pipeline. These workflows help define necessary actions based on assessed risk levels, ensuring a proactive, comprehensive approach to security. For example, sudden changes in IP location in an account may suggest multiple users sharing the same account. By implementing an automated workflow to monitor and analyze IP logs throughout the data pipeline, organizations can promptly detect and respond to suspicious activity, including terminating accounts involved in fraudulent behavior.
Secure your data pipeline from threats and misrepresentation with the right partner
When crowdsourcing AI data collection, choose a provider with proven safeguards to counter unethical actors. Ensure they verify contributors at every stage — from sourcing and onboarding to project completion — using robust methods to uphold data quality and integrity.
For instance, at TELUS Digital, we implement a multi-stage verification process that spans the entire data pipeline. During sourcing, contributors undergo rigorous ID verification, anti-money laundering checks and facial recognition, including live video "selfies," to prevent fraud and flag high-risk individuals. In production, IP monitoring and re-verification measures help detect location shifts and secure accounts.
Alongside verification and monitoring, your service provider should implement response mechanisms to promptly address any potential threats that may arise. At TELUS Digital, our system continuously monitors expert identity, location and task accuracy, with real-time event tracking and automated responses to ensure data quality and integrity.
Further, our security architecture has built-in feedback loops that our sourcing team uses to refine data points and prevent any future fraudulent activity. This proactive strategy improves the quality of the data collected and reduces costs associated with fraud recovery.
Crowdsourcing is a powerful tool for AI development, but its full potential can only be realized with the proper security frameworks in place. By establishing the right checks and balances — including expert oversight, advanced data verification systems and structured contribution frameworks — organizations can harness the collective intelligence of the crowd while mitigating the risks. Working with an experienced partner like TELUS Digital can be invaluable when navigating these challenges. Contact us to learn how our secure crowdsourcing efforts can enhance your AI training while maintaining data integrity and model reliability.