We introduce a massive dataset on #ESG transparency and performance and new machine learning framework for extraction in a recently released paper "Assessing Corporate Sustainability with Large Language Models: Evidence from Europe" with Kerstin Forster, Lucas Keil, Victor Wagner, Thorsten Sellhorn, and Stefan Feuerriegel. We apply a large-scale machine learning framework to extract 2.9 million quantitative ESG indicators from the annual and sustainability reports of the 600 largest listed European firms over 2014–2023. Our open-source framework enables systematic, indicator-level tracking of both ESG transparency and performance in line with ESRS standards. 🔍 Key insights: – Firms with top ESG ratings disclose 22% more indicators than bottom rated peers, but this gap is narrowing – Scope 1 & 2 emissions dropped sharply, while scope 3 increased 5.6x, likely due to improved transparency around value chain emissions – Gender equality indicators show progress; other social indicators stagnate – Our open-source dataset and ML pipeline democratize access to ESG data 👉 Read the paper: https://lnkd.in/ezF6Y-dn 📊 Explore the data: https://lnkd.in/eXxcwznT 💾 Download the data: https://osf.io/q2jpv/ 🛠 Code & framework: https://lnkd.in/eEi67_Dg We hope this helps researchers, policymakers, and financial actors monitor and drive progress toward sustainability goals. A key insight from our work is that transparency and performance must be analyzed together as increasing transparency can reveal previously hidden poor ESG performance. While we validate our framework and find high agreement overall, accuracy varies across indicators. The method is likely better suited for analyzing trends across larger samples, not necessarily for high-stakes decisions at the individual firm level, where manual validation or tagged data remain important to ensure precision. Sustainability Reporting Navigator TRR 266 Accounting for Transparency #ESG #AI #SustainabilityReporting #MachineLearning #CSRD #Transparency #OpenScience #LLM
AI Tools For Data Analysis
Explore top LinkedIn content from expert professionals.
-
-
Best LLM-based Open-Source tool for Data Visualization, non-tech friendly CanvasXpress is a JavaScript library with built-in LLM and copilot features. This means users can chat with the LLM directly, with no code needed. It also works from visualizations in a web page, R, or Python. It’s funny how I came across this tool first and only later realized it was built by someone I know—Isaac Neuhaus. I called Isaac, of course: This tool was originally built internally for the company he works for and designed to analyze genomics and research data, which requires the tool to meet high-level reliability and accuracy. ➡️Link https://lnkd.in/gk5y_h7W As an open-source tool, it's very powerful and worth exploring. Here are some of its features that stand out the most to me: 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐜 𝐆𝐫𝐚𝐩𝐡 𝐋𝐢𝐧𝐤𝐢𝐧𝐠: Visualizations on the same page are automatically connected. Selecting data points in one graph highlights them in other graphs. No extra code is needed. 𝐏𝐨𝐰𝐞𝐫𝐟𝐮𝐥 𝐓𝐨𝐨𝐥𝐬 𝐟𝐨𝐫 𝐂𝐮𝐬𝐭𝐨𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧: - Filtering data like in Spotfire. - An interactive data table for exploring datasets. - A detailed customizer designed for end users. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐀𝐮𝐝𝐢𝐭 𝐓𝐫𝐚𝐢𝐥: Tracks every customization and keeps a detailed record. (This feature stands out compared to other open-source tools that I've tried.) ➡️Explore it here: https://lnkd.in/gk5y_h7W Isaac's team has also published this tool in a peer-reviewed journal and is working on publishing its LLM capabilities. #datascience #datavisualization #programming #datanalysis #opensource
-
Uber processes millions of invoices globally – different formats, currencies, tax codes, and languages. Traditional rule-based OCR pipelines just don’t scale for that level of variability. Interesting to see how Uber solved this using a two-stage GenAI approach: 1. LLM-based field extraction: zero-shot parsing of key fields like vendor, total amount, tax ID. 2. Post-processing logic: country-specific rules (e.g. GST validation for India). The system improves itself through feedback. But this is where data labeling becomes critical. Without accurately labeled fields and validation, the model can hallucinate or misinterpret formats, especially for low-resource languages or unusual layouts. Labeling ensures: 1. Feedback loop quality 2. Accuracy tracking by field 3. Reliable onboarding of new invoice types It’s a solid example of blending GenAI with traditional ML workflows and domain logic for real-world scale. Worth a read 👇 https://lnkd.in/gFqMS9zW #GenAI #DataScience #UberAI #DocumentUnderstanding #LLM #AIInOperations #DataLabeling #InvoiceAutomation
-
Can large language models make Data Scientists more productive? I've been experimenting lately with large language models (LLMs) for #datascience and wanted to share some thoughts on how they can support a Data Scientist's day-to-day work: 🚀 Literature Review - LLMs can quickly summarise, synthesise and extract key insights from lots of research papers. This helps Data Scientists stay on top of the state-of-the-art algorithms, techniques or datasets. 🧠 Code Generation - LLMs can generate viable code to explore, clean, process and model data. This significantly speeds up what's usually a manual, trial and error process. 💡 Code Explanation - LLMs can automatically add comments explaining what each section of code is doing in plain language. This is invaluable when documenting code or understanding inherited codebase! 🛠️ Code Refactoring - LLMs can inspect code to suggest improvements in structure, efficiency and style. This allows Data Scientists to improve and optimise their code. ⚙️ Task Automation - LLMs can automate repetitive coding tasks like data loading, cleaning, processing, etc. by turning them into functions and scripts. This frees up Data Scientists to focus on value-add activities. 📃 Report Generation - LLMs can generate data analysis reports, documentation and even README files. Say goodbye to mundane and time-consuming documentation tasks! 📊 Results Presentation - LLMs can create stories to convey results and insights to different audiences. LLMs can also provide independent, critical opinion of Data Scientists’ content. The key takeaway? LLMs have considerable potential to enable Data Scientists to be more productive, insightful and impactful. However, Data Scientists shouldn’t blindly follow outputs from LLMs. Instead, Data Scientist should view LLMs as assistants that can augment intelligence, rather than replace it. What are your thoughts on how LLMs can best support Data Scientists? Please let me know in the comments below. #ai #llm #augmentedintelligence #productivity Disclaimer: The opinions expressed in this post are my own and do not represent the views of my employer.
-
There were really 2 key innovations that ultimately enabled LLMs to be a great data analysis tool: 1) Code Interpreter - The ability for LLMs to not only write code, but also to be able to run code in real-time in the browser. ChatGPT enabled this with Python (using WebAssembly). And Claude did it with JavaScript. 2) Model Context Protocol - The ability for LLMs to connect to any data store via a open standard protocol made it so you could quickly find connectors to every major data source. The combination of these two abilities is what is quickly making LLMs the fastest way to conduct data analysis.
-
I was incredibly impressed with the technical depth of Intrinsec’s reporting — their Doppelgänger deep-dive was a masterclass in infrastructure analysis. But their newest work on UAC-0050 and UAC-0006 (code named nefarious actors) takes it even further. It doesn’t just map tools — it shows how they’re being operationalized in #hybridwarfare, blending malware, psychological ops, and #narrative influence under a single infrastructure model. 🧠 5 Technical Tactics with Real Examples: Weaponized Spam Chains Delivering Custom Malware -UAC-0050 deployed phishing emails masquerading as Ukrainian logistics company “Nova Pochta”, delivering LiteManager malware via password-protected Dropbox links. -Recipients included government agencies in Canada, India, and Ukraine's energy sector — each targeted with multi-stage ZIP/RAR/VBS payload chains. False-Flag #PSYOP and Bomb Threats -Threat actors sent fake bomb threats to Ukrainian energy facilities, posing as disabled veterans and demanding action against Zelensky. -The goal? Induce fear, create confusion, and force evacuations. Messages were laced with emotional appeals and Telegram links to arms dealers, reinforcing authenticity. Spoofed Documents via Public Repositories -Campaigns leveraged Bitbucket-hosted Remcos payloads disguised as “pre-trial legal claims.” These files were encrypted, multi-layered, and password protected — ensuring evasion and believability. -The linked C2 was hosted by Shinjiru Technology, a Malaysian hosting firm previously seen in Doppelgänger ops. Bulletproof Hosting Shell Games -C2s and redirect infrastructure were continuously rotated through ASNs like AS215540 (Global Connectivity Solutions) and AS214943 (Railnet LLC). -These networks were registered via shell companies in Seychelles and linked to operators sanctioned for LockBit ransomware activity. Cross-Campaign Infrastructure Reuse -IPs and domains previously observed in QakBot ransomware campaigns were repurposed to deliver sLoad and NetSupport Manager payloads — showing tight overlap between #cybercrime and state-linked PSYOP. 💡 #DefenseTech Implications: 1. Detecting payloads isn't enough. We need visibility into autonomous system behavior, peering history, and hosting reputation. 2. Apply honeypot telemetry and BGP monitoring to uncover the behavioral signatures of bulletproof infrastructure. 3. Integrate public repo scanning (e.g., Bitbucket, Dropbox) into pipeline alerts — this is how adversaries are scaling malware delivery while staying under radar. The reality? These are no longer isolated malware campaigns — they are multi-modal, infrastructure-driven #influence ops with tailored payloads, false documents, and narrative design. We need to evolve beyond content scanning and defend the internet's plumbing. #DefTech #VannevarLabs #FIMI #InfoOps
-
I have been learning about an emerging type of AI agent I’ll call "Smart Document Agents" (SDAs) It’s exciting to think through how SDAs can boost efficiency by 5–10x by: - converting unstructured documents (pdfs, faxes, images) into structured data - embedding these “smart documents” into relevant high value workflows - communicating across multiple parties to get things done automatically or with humans in the loop My friend, Andrei Radulescu-Banu (founder of https://docrouter.ai/) and I recently discussed several compelling use cases - I know some of these are being worked on. 1. B2B Procurement: For example hospitals order countless supplies for patient-specific procedures as well as ongoing clinical care. Meanwhile underlying all this they have thousands of unstructured pdfs / paper contracts that need to be adhered to. Normally, someone manually extracts the details of what is to be ordered when from the EHR / ERP, checks contract terms, creates the corresponding order and inputs it into a supplier’s workflow. An SDA can automatically parse the EHR (patient info, procedure date, item details) or the ERP to understand what is to be ordered when, choose the right supplier, verify pricing and contract terms, and create and submit orders. This should reduce 80% of the manual work and errors on either side while speeding up the process. 2. Tax Prep Automation: While W-2s and 1099s are structured, other tax documents vary widely (charitable donation letters, client prepared schedules, property tax payments, K-1s income classified generically in box 11ZZ). SDAs could learn these formats over time, reduce the manual burden of tax prep, and significantly lower costs. 3. Pre and Post Anesthesia Screening: Medical history, medication lists, allergies, vital signs, post-operative notes - these often reside in unstructured or semi-structured formats (scanned intake forms, typed or handwritten notes, PDF lab reports). SDAs can extract these to flag risk factors, populate checklists, and ensure compliance. Post-surgery, they can collect outcomes, trends, and potential complications for swift follow-up. This reduces errors, enhances patient safety, and expedites billing and auditing. 4. VC/PE/ Consulting Firms: Analysts reviewing large volumes of 10-Ks and 10-Qs could use an SDA to extract key financial metrics, risk factors, and strategically relevant points — accelerating analysis and comparison across companies and time periods. 5. Clinical Trials: A lab invoice might detail services, dates, and amounts to be billed to a trial. An SDA can verify charges against contract terms, flag discrepancies, and submit a verified invoice requiring much lower touch. 6. Shipping Logistics: Shipping container manifests list items, routes, weights, and special instructions. An SDA could automatically verify these details against physical inventory, saving time and reducing errors. What other SDA applications do you find exciting?
-
🎯 Excited to share that AI_EXTRACT() is now live in Snowflake in Public Preview! This new AI function is designed to pull structured data from any input - whether you're dealing with text, images, or documents. What makes this exciting? Instead of wrestling with messy data extraction, you can now ask natural language questions and get clean, structured responses. Need to pull names and addresses from a pile of PDFs? Just tell AI_EXTRACT() what you're looking for and watch it work its magic. The flexibility is what caught my attention - you can define your extraction format however works best for your use case. Arrays, objects, JSON schema - it adapts to how you think about your data. Perfect timing for anyone drowning in unstructured documents who needs to turn that chaos into actionable insights. The possibilities for automating data extraction workflows are pretty impressive. Ready to clean up your data extraction process? Check out the docs and let me know what you're planning to extract! The Docs: https://lnkd.in/gjpcxG4M #Snowflake #AI #DataExtraction
-
I’ve been building and managing data systems at Amazon for the last 8 years. Now that AI is everywhere, the way we work as data engineers is changing fast. Here are 5 real ways I (and many in the industry) use LLMs to work smarter every day as a Senior Data Engineer: 1. Code Review and Refactoring LLMs help break down complex pull requests into simple summaries, making it easier to review changes across big codebases. They can also identify anti-patterns in PySpark, SQL, and Airflow code, helping you catch bugs or risky logic before it lands in prod. If you’re refactoring old code, LLMs can point out where your abstractions are weak or naming is inconsistent, so your codebase stays cleaner as it grows. 2. Debugging Data Pipelines When Spark jobs fail or SQL breaks in production, LLMs help translate ugly error logs into plain English. They can suggest troubleshooting steps or highlight what part of the pipeline to inspect next, helping you zero in on root causes faster. If you’re stuck on a recurring error, LLMs can propose code-level changes or optimizations you might have missed. 3. Documentation and Knowledge Sharing Turning notebooks, scripts, or undocumented DAGs into clear internal docs is much easier with LLMs. They can help structure your explanations, highlight the “why” behind key design choices, and make onboarding or handover notes quick to produce. Keeping platform wikis and technical documentation up to date becomes much less of a chore. 4. Data Modeling and Architecture Decisions When you’re designing schemas, deciding on partitioning, or picking between technologies (like Delta, Iceberg, or Hudi), LLMs can offer quick pros/cons, highlight trade-offs, and provide code samples. If you need to visualize a pipeline or architecture, LLMs can help you draft Mermaid or PlantUML diagrams for clearer communication with stakeholders. 5. Cross-Team Communication When collaborating with PMs, analytics, or infra teams, LLMs help you draft clear, focused updates, whether it’s a Slack message, an email, or a JIRA comment. They’re useful for summarizing complex issues, outlining next steps, or translating technical decisions into language that business partners understand. LLMs won’t replace data engineers, but they’re rapidly raising the bar for what you can deliver each week. Start by picking one recurring pain point in your workflow, then see how an LLM can speed it up. This is the new table stakes for staying sharp as a data engineer.
-
Everyone’s getting AI in security wrong. 🤖🤔 The real shift isn’t chatbots or threat detection. It’s Model Context Protocol (MCP) – AI that works with EVERY security tool. And it’s completely changing the game. 🤯 Think about your daily security tasks: You write complex queries. You switch between tools. You document investigations. Now imagine just telling an AI, “Find all failed login attempts from Russia in the last hour,” and it does it all for you. Here’s what’s already being done with MCP: Active Directory Security 🔵 BloodHound-MCP-AI: Natural language AD analysis – https://lnkd.in/g-dkkxSC 🔵 ROADRecon MCP: Azure AD security made simple – https://lnkd.in/gt44Pz5x 🔵 Mythic MCP: AI-powered red teaming – https://lnkd.in/gyd2FEB6 Reverse Engineering 🔵IDA Pro MCP Plugin: AI-assisted malware analysis – https://lnkd.in/gX9qRWWm 🔵Ghidra MCP: Automated binary investigation – https://lnkd.in/gZhTV6Xq OALabs Integration: Smart reverse engineering – https://lnkd.in/gM8_DhdC Cloud Security 🔵CloudWatch-Logs-MCP: AI log analysis – https://lnkd.in/gNpQPZDn 🔵AWS Labs MCP: Full AWS security integration – https://lnkd.in/gBMWgs-3 🔵ActivePieces: 280+ security automations – https://lnkd.in/ggwJX6XZ Advanced Tools 🔵BrowserMCP: AI-controlled security testing – https://browsermcp.io 🔵GitMCP: Smart code security analysis – https://gitmcp.io 🔵Panther’s SecOps AI: Streamlined investigations – https://lnkd.in/gVSnrmdM The best part? You don’t need to be a coding expert. Security analysts are using these tools to do in minutes what used to take hours. Junior team members are handling complex investigations on day one. Even incident response teams are cutting their investigation time in half. But here’s what nobody’s talking about: MCP is making security expertise accessible to everyone. Small companies are now running enterprise-grade security operations. Want to get started? Check out these additional resources: 🔵 MCP Introduction Video: https://lnkd.in/gM2iSmMM 🔵 Complete Tool Directory: https://mcp.so 🔵 Latest Updates: https://lnkd.in/eE7FRUWJ Here’s my question: Which of these tools would you try first in your security workflow?