Adversaries may search websites owned by the victim for information that can be used during targeting. Corporate sites, employee portals, partner pages, and forgotten subdomains all yield valuable intelligence for operational planning.
Corporate websites are among the most richly rewarding OSINT targets available to adversaries. Unlike social media or third-party databases, victim-owned websites are curated by the organization itself , making them authoritative sources for employee names and roles, office locations, technology stacks, business partners, and organizational structure. According to CISA's AA24-242A advisory on PRC MST APT40, threat actors extensively searched victim-owned websites for targeting information before launching intrusion campaigns.
Job postings are a particularly insidious leak vector , they routinely reveal specific software versions (e.g., "Apache Tomcat 9," "Oracle Database 19c"), security tools in use ("Splunk SIEM," "Palo Alto firewalls"), team structures, and even internal project codenames. Employee directory pages ("Our Team" sections) provide direct targets for spear-phishing, while press releases reveal business relationships, merger plans, and new office openings that expand the attack surface. Meanwhile, technical files like sitemap.xml and robots.txt can inadvertently disclose hidden administrative paths, staging environments, and forgotten development directories.
The Silent Librarian (Mabna Institute) threat group famously conducted extensive pre-compromise reconnaissance on victim university and corporate websites, scraping faculty directories, research profiles, and institutional branding to craft convincing phishing campaigns. APT41 similarly performed thorough website research as part of their operational planning against potential targets. In one documented case, adversaries scraped the source code, branding assets, and organizational contact information from a victim's website to create pixel-perfect phishing pages that bypassed email security filters.
T1594 (Search Victim-Owned Websites) is a reconnaissance technique where adversaries systematically browse, scrape, and analyze websites operated by the target organization to collect intelligence that supports future attacks. This includes the main corporate website, employee portals, partner pages, subdomains, and any forgotten or deprecated web properties.
Everyday Analogy: Like a burglar casing a house by walking through the neighborhood , reading the family's mail, peeking through windows, checking what security company sign is on the lawn, and noting when people come and go , all from the public sidewalk. They never trespass, yet they learn the home's layout, security system, family schedule, and valuable contents before ever stepping foot on the property. Your corporate website is that house, and every page, file, and directory is a window into your organization.
| Term | Definition |
|---|---|
| OSINT | Open Source Intelligence , collecting information from publicly available sources, including websites, social media, and public records, to build a profile of the target. |
| Web Scraping | Automated extraction of data from websites using scripts or bots (e.g., Scrapy, BeautifulSoup) that can rapidly download and parse thousands of pages for names, emails, and technical details. |
| sitemap.xml | An XML file on web servers that lists all URLs the site owner wants search engines to index , but it also gives attackers a complete map of the site's structure, including hidden or administrative pages. |
| robots.txt | A text file that tells web crawlers which pages to avoid , ironically revealing the most sensitive directories (like /admin/, /internal/, /backup/) to anyone who reads it. |
| Directory Enumeration | Systematically probing a web server for hidden directories and files using wordlists, often revealing forgotten development pages, backup files, or unprotected admin panels. |
| Google Dorking | Using advanced search operators (e.g., site:novatech.com filetype:pdf) to find sensitive information indexed by search engines that the site owner may have overlooked. |
NovaTech Solutions had a modern, well-designed corporate website that the marketing team was proud of. The "Our Team" page featured professional headshots, full names, job titles, and email addresses for all 12 executives and department heads. The careers section listed open positions with detailed technology requirements: "Experience with Apache Tomcat 9, Oracle Database 19c, and VMware vSphere required." The "Contact Us" page included three office addresses across Austin, Denver, and Seattle, complete with floor-level details. The company's sitemap.xml listed 47 URLs including a /admin/ portal and /internal/docs/ path that were supposed to be restricted but were visible in the file.
An APT group operating from Eastern Europe spent three weeks meticulously scraping NovaTech's web presence. They extracted 23 employee names and email formats from the "Our Team" page. Job postings revealed the exact technology stack (Apache Tomcat, Oracle Database, Palo Alto firewalls, Splunk SIEM). The sitemap.xml disclosed a hidden /admin/ portal and an /internal/docs/ directory containing outdated API documentation. Using Google Dorking, the attackers found a forgotten staging subdomain (staging.novatech-solutions.com) running an older version of the website with default credentials still active. They scraped all branding assets , logos, color schemes, fonts , from the public /branding/ directory.
The attackers then crafted a pixel-perfect phishing page mimicking NovaTech's login portal, sent spear-phishing emails to three C-suite executives using the harvested names and email patterns, and leveraged knowledge of the Tomcat version to search for known CVEs. Within 72 hours of the phishing emails being sent, the CFO's credentials were compromised.
The breach resulted in $2.8 million in direct costs: incident response forensics, legal fees, regulatory fines, and customer notification. The attackers accessed financial records and exfiltrated 14,000 client records before detection. NovaTech's stock price dropped 12% on news of the breach. The board of directors launched an independent investigation, and Sarah Chen's team spent six months overhauling the company's entire public web presence , removing unnecessary employee data, restricting sitemap.xml access, implementing rate limiting, and deploying a web application firewall.
Map the organization's complete web footprint including the main site, subdomains, related domains, staging environments, and legacy properties.
Systematically discover the full architecture of the victim's web presence using sitemap.xml, robots.txt, and automated directory enumeration tools.
Harvest names, roles, contact information, department structures, and office locations from public-facing web pages.
Identify software versions, frameworks, server configurations, and security tools through job postings, source code, and HTTP headers.
Find forgotten pages, debug directories, backup files, API documentation, and other resources not intended for public access.
Download logos, color schemes, templates, email signatures, and language patterns needed to craft convincing phishing pages and impersonations.
Combine all collected intelligence into a comprehensive target profile that informs phishing, vulnerability exploitation, and social engineering operations.
Automated scrapers typically request pages at speeds no human can match , 50+ pages per minute from a single IP address. Look for sequential URL patterns in access logs that suggest scripted crawling rather than organic browsing behavior. Time-based clustering analysis reveals bot patterns that simple request counting misses.
Legitimate search engine crawlers identify themselves with recognizable user-agents. Attackers often use generic libraries ("python-requests/2.28.0"), HEADLESS browser identifiers ("HeadlessChrome"), or attempt to impersonate legitimate bots with slight variations. Monitor for user-agents that don't match known search engine patterns, especially those requesting sitemap.xml or robots.txt immediately upon connection.
A hallmark of T1594 is the specific access sequence: first sitemap.xml or robots.txt to discover the site structure, followed by systematic requests to discovered paths. Sessions that begin with these reconnaissance files and then proceed to enumerate multiple directories within minutes strongly indicate automated recon. Track the "first request" for each session and correlate with subsequent depth.
Attackers preparing phishing campaigns often bulk-download images, logos, CSS files, and JavaScript from /branding/, /assets/, or /images/ directories. Look for sessions with abnormally high ratios of static asset requests to HTML page requests, or concentrated downloads from media directories followed by session termination , indicating the attacker got what they needed.
Deploy a layered detection approach: Web Analytics (Google Analytics, Matomo) for behavioral anomaly detection on the application layer; WAF Logs (Cloudflare, AWS WAF, ModSecurity) for rule-based bot detection and rate limiting; SIEM Correlation (Splunk, ELK) to cross-reference web access logs with threat intelligence on known malicious IPs; Honeypots (CANARY tokens, fake admin pages) that generate alerts when accessed, confirming active reconnaissance. Combining these layers provides defense-in-depth against T1594.
Every page, every file, and every directory listing on your public-facing website is potential intelligence for an adversary. Take 30 minutes today to audit your organization's web presence: review your sitemap.xml, check robots.txt for sensitive path leaks, minimize employee information exposure, and verify that staging and legacy subdomains are properly secured.
Have questions about website reconnaissance or want to share your defensive strategies? Join the discussion in the comments below. Your insights could help other defenders protect their organizations against T1594 attacks.
Reference: MITRE ATT&CK T1594 , attack.mitre.org/techniques/T1594 | CISA Advisory AA24-242A , cisa.gov
Every contribution moves us closer to our goal: making world-class cybersecurity education accessible to ALL.
Choose the amount of donation by yourself.