Search Victim-Owned Websites

Live Simulation , Website Reconnaissance in Action

https://www.novatech-solutions.com/

Sarah Chen , CISO

Apache Tomcat 9.0

HQ: Austin, TX

Oracle Database 19c

Our Team , Leadership

Careers , Senior DevOps Engineer

Press Release , New Office Opening

recon.sh , APT-Ops

$ curl -s https://novatech-solutions.com/sitemap.xml

$ curl -s https://novatech-solutions.com/robots.txt

Disallow: /admin/ /internal/ /backup/

[!] Hidden paths found: /admin/, /internal/, /backup/

$ scrapy crawl novatech -d employees

[+] Scraped: 23 employee profiles

[+] Emails: s.chen@, m.rodriguez@, j.patel@

$ wget -r -np novatech-solutions.com/branding/

[+] Downloaded: logo.svg, favicon.ico, styles.css

$ python3 analyze_headers.py novatech

Server: Apache/2.4.51 | X-Powered-By: Tomcat/9.0

X-Frame-Options: SAMEORIGIN | CSP: missing

[!] Tech stack fingerprinted: Apache + Tomcat + Oracle

$ cat intel_report.txt | wc -l

347 lines of actionable intelligence collected

Discovered Resources

sitemap.xml47 URLs

robots.txt3 hidden

/our-team/23 profiles

/careers/tech stack

/branding/assets DL

HTTP headersfingerprint

Pages Found

Employees

Tech Items

Hidden Paths

Why It Matters

92%

of attacks begin with reconnaissance via public websites

67%

of breach data collected from victim's own web properties

$4.5M

average cost of breaches that began with web reconnaissance

350M+

customer records exposed through website vulnerabilities (Broadvoice)

Corporate websites are among the most richly rewarding OSINT targets available to adversaries. Unlike social media or third-party databases, victim-owned websites are curated by the organization itself , making them authoritative sources for employee names and roles, office locations, technology stacks, business partners, and organizational structure. According to CISA's AA24-242A advisory on PRC MST APT40, threat actors extensively searched victim-owned websites for targeting information before launching intrusion campaigns.

Job postings are a particularly insidious leak vector , they routinely reveal specific software versions (e.g., "Apache Tomcat 9," "Oracle Database 19c"), security tools in use ("Splunk SIEM," "Palo Alto firewalls"), team structures, and even internal project codenames. Employee directory pages ("Our Team" sections) provide direct targets for spear-phishing, while press releases reveal business relationships, merger plans, and new office openings that expand the attack surface. Meanwhile, technical files like sitemap.xml and robots.txt can inadvertently disclose hidden administrative paths, staging environments, and forgotten development directories.

The Silent Librarian (Mabna Institute) threat group famously conducted extensive pre-compromise reconnaissance on victim university and corporate websites, scraping faculty directories, research profiles, and institutional branding to craft convincing phishing campaigns. APT41 similarly performed thorough website research as part of their operational planning against potential targets. In one documented case, adversaries scraped the source code, branding assets, and organizational contact information from a victim's website to create pixel-perfect phishing pages that bypassed email security filters.

CISA , Cybersecurity Advisories

Official advisories on APT40 and website reconnaissance campaigns

MITRE ATT&CK , T1594

Official technique definition, examples, and detection strategies

NIST Cybersecurity Framework

Guidelines for managing cybersecurity risk and public-facing assets

CSO Online , Threat Intelligence

Industry analysis of web reconnaissance trends and attack patterns

Key Terms & Concepts

Simple Definition

T1594 (Search Victim-Owned Websites) is a reconnaissance technique where adversaries systematically browse, scrape, and analyze websites operated by the target organization to collect intelligence that supports future attacks. This includes the main corporate website, employee portals, partner pages, subdomains, and any forgotten or deprecated web properties.

Everyday Analogy: Like a burglar casing a house by walking through the neighborhood , reading the family's mail, peeking through windows, checking what security company sign is on the lawn, and noting when people come and go , all from the public sidewalk. They never trespass, yet they learn the home's layout, security system, family schedule, and valuable contents before ever stepping foot on the property. Your corporate website is that house, and every page, file, and directory is a window into your organization.

Term	Definition
OSINT	Open Source Intelligence , collecting information from publicly available sources, including websites, social media, and public records, to build a profile of the target.
Web Scraping	Automated extraction of data from websites using scripts or bots (e.g., Scrapy, BeautifulSoup) that can rapidly download and parse thousands of pages for names, emails, and technical details.
sitemap.xml	An XML file on web servers that lists all URLs the site owner wants search engines to index , but it also gives attackers a complete map of the site's structure, including hidden or administrative pages.
robots.txt	A text file that tells web crawlers which pages to avoid , ironically revealing the most sensitive directories (like /admin/, /internal/, /backup/) to anyone who reads it.
Directory Enumeration	Systematically probing a web server for hidden directories and files using wordlists, often revealing forgotten development pages, backup files, or unprotected admin panels.
Google Dorking	Using advanced search operators (e.g., `site:novatech.com filetype:pdf`) to find sensitive information indexed by search engines that the site owner may have overlooked.

Real-World Scenario , NovaTech Solutions

Sarah Chen

CISO at NovaTech Solutions (fictional mid-size tech company, 850 employees)

Before , The False Sense of Security

NovaTech Solutions had a modern, well-designed corporate website that the marketing team was proud of. The "Our Team" page featured professional headshots, full names, job titles, and email addresses for all 12 executives and department heads. The careers section listed open positions with detailed technology requirements: "Experience with Apache Tomcat 9, Oracle Database 19c, and VMware vSphere required." The "Contact Us" page included three office addresses across Austin, Denver, and Seattle, complete with floor-level details. The company's sitemap.xml listed 47 URLs including a /admin/ portal and /internal/docs/ path that were supposed to be restricted but were visible in the file.

The Attack , Systematic Website Reconnaissance

An APT group operating from Eastern Europe spent three weeks meticulously scraping NovaTech's web presence. They extracted 23 employee names and email formats from the "Our Team" page. Job postings revealed the exact technology stack (Apache Tomcat, Oracle Database, Palo Alto firewalls, Splunk SIEM). The sitemap.xml disclosed a hidden /admin/ portal and an /internal/docs/ directory containing outdated API documentation. Using Google Dorking, the attackers found a forgotten staging subdomain (staging.novatech-solutions.com) running an older version of the website with default credentials still active. They scraped all branding assets , logos, color schemes, fonts , from the public /branding/ directory.

The attackers then crafted a pixel-perfect phishing page mimicking NovaTech's login portal, sent spear-phishing emails to three C-suite executives using the harvested names and email patterns, and leveraged knowledge of the Tomcat version to search for known CVEs. Within 72 hours of the phishing emails being sent, the CFO's credentials were compromised.

After , The Cost of Overexposure

The breach resulted in $2.8 million in direct costs: incident response forensics, legal fees, regulatory fines, and customer notification. The attackers accessed financial records and exfiltrated 14,000 client records before detection. NovaTech's stock price dropped 12% on news of the breach. The board of directors launched an independent investigation, and Sarah Chen's team spent six months overhauling the company's entire public web presence , removing unnecessary employee data, restricting sitemap.xml access, implementing rate limiting, and deploying a web application firewall.

Day 1–3

APT scrapes NovaTech website, discovers 47 URLs via sitemap.xml

Day 4–10

23 employee profiles, tech stack, and office locations extracted

Day 11–18

Staging subdomain found, branding assets downloaded, phishing page built

Day 21

Spear-phishing emails sent to 3 executives; CFO credentials compromised

Step-by-Step Guide , How Attackers Search Victim Websites

Identify Victim Web Properties

Map the organization's complete web footprint including the main site, subdomains, related domains, staging environments, and legacy properties.
- Enumerate subdomains using passive DNS, certificate transparency logs, and search engine indexing PREVENT
- Check for forgotten staging, dev, or test subdomains that may expose internal systems DETECT
- Document domain registration details, hosting providers, and CDN configurations MONITOR
Map Website Structure

Systematically discover the full architecture of the victim's web presence using sitemap.xml, robots.txt, and automated directory enumeration tools.
- Request sitemap.xml and robots.txt for complete URL listings and hidden path disclosures PREVENT
- Use directory enumeration (e.g., Gobuster, ffuf) to find unlinked pages, backup files, and admin panels DETECT
- Archive historical versions via the Wayback Machine to find pages that have been removed but were once indexed MONITOR
Extract Organizational Intelligence

Harvest names, roles, contact information, department structures, and office locations from public-facing web pages.
- Scrape employee directories, "About Us," and leadership pages for names, titles, and email addresses PREVENT
- Mine press releases for new partnerships, acquisitions, office openings, and organizational changes RESPOND
- Collect office addresses, phone numbers, and operational details from contact pages and event listings MONITOR
Analyze Technology Stack

Identify software versions, frameworks, server configurations, and security tools through job postings, source code, and HTTP headers.
- Parse job descriptions for specific technology requirements that reveal the production environment stack PREVENT
- Analyze HTTP response headers (Server, X-Powered-By, X-AspNet-Version) for version fingerprinting DETECT
- Examine page source code for framework identifiers, commented-out debug info, and inline JavaScript libraries PREVENT
Discover Hidden Resources

Find forgotten pages, debug directories, backup files, API documentation, and other resources not intended for public access.
- Probe for common backup file patterns (.bak, .old, .sql, .zip) and configuration files (web.config, .env) DETECT
- Check for exposed API documentation, Swagger/OpenAPI endpoints, and GraphQL introspection PREVENT
- Investigate directory listing errors that expose file trees on misconfigured web servers RESPOND
Collect Branding & Social Engineering Material

Download logos, color schemes, templates, email signatures, and language patterns needed to craft convincing phishing pages and impersonations.
- Scrape brand assets from /branding/, /assets/, and /images/ directories for visual phishing template construction PREVENT
- Analyze website tone, language patterns, and email format conventions for natural-sounding phishing emails DETECT
- Download favicon, letterhead templates, and document formatting from publicly shared resources MONITOR
Correlate Findings for Target Development

Combine all collected intelligence into a comprehensive target profile that informs phishing, vulnerability exploitation, and social engineering operations.
- Cross-reference employee names with technology stack to identify high-value targets (sysadmins, executives) DETECT
- Map discovered CVEs for identified software versions to build an exploitation priority list RESPOND
- Use contact forms or support channels to test email delivery patterns and security awareness MONITOR

Common Mistakes & Best Practices

Common Mistakes

Leaving employee details public. Publishing full names, photos, job titles, direct phone numbers, and personal email addresses on "Our Team" pages provides attackers with ready-made target lists for spear-phishing campaigns.

Exposing technology stack in job postings. Listing specific software versions (Apache Tomcat 9.0, Oracle 19c) in job descriptions gives attackers exact vulnerability targets to search for known exploits.

Not reviewing robots.txt for sensitive paths. Using "Disallow" entries for /admin/ or /internal/ tells attackers exactly where your most sensitive directories are , even though robots.txt is not enforced by browsers.

Forgetting about old or deprecated web pages. Abandoned subdomains, old microsites, and legacy staging environments often run outdated software with known vulnerabilities.

Exposing directory listings. Misconfigured web servers that allow directory browsing reveal the complete file tree structure, backup files, configuration files, and source code repositories.

Best Practices

Implement a web application firewall (WAF). Deploy a WAF to detect and block automated scraping attempts, rate-limit aggressive crawlers, and filter suspicious request patterns before they reach your server.

Minimize public employee information. Use generic role descriptions instead of personal names where possible. If names are needed, avoid publishing direct contact details, personal emails, or organizational charts.

Regularly audit public-facing web content. Conduct quarterly reviews of all public web properties, including subdomains, staging environments, and legacy sites. Use tools like the Wayback Machine to check for historical exposures.

Use generic job descriptions. Replace specific technology versions with broader skill categories (e.g., "web application server experience" instead of "Apache Tomcat 9.0") to avoid revealing your exact stack.

Monitor web access logs for scraping patterns. Set up alerts for rapid sequential page requests, unusual user-agent strings, bulk downloads, and automated directory enumeration attempts.

Red Team vs Blue Team View

Red Team , Attack Perspective

Begin with passive reconnaissance , crawl the victim's main website, sitemap.xml, and robots.txt to build a complete URL inventory without generating detectable traffic.
Target "Our Team," "Leadership," and "About Us" pages for employee names, roles, email formats, and reporting hierarchies to build a social engineering target list.
Mine job postings and career pages for technology stack details , server software, frameworks, security tools, and cloud providers that reveal exploitable vulnerabilities.
Download branding assets (logos, color schemes, fonts, favicon) to construct pixel-perfect phishing pages that pass visual inspection by targets.
Use directory enumeration and Google Dorking to find hidden admin panels, backup files, staging environments, and API documentation left exposed.
Exploit contact forms and support channels to test email delivery, security awareness, and even redirect victims to adversary-controlled URLs through form submissions.

Blue Team , Defense Perspective

Deploy comprehensive web access logging and analytics to detect automated scraping patterns , rapid sequential requests, unusual user-agents, and bulk page downloads.
Implement rate limiting and bot detection via WAF rules to throttle or block aggressive crawlers while allowing legitimate search engine indexing.
Minimize public information exposure , audit all public pages for unnecessary employee details, technology specifics, and organizational structure data.
Restrict sitemap.xml to only essential public pages, remove sensitive entries from robots.txt (use server-side access controls instead), and disable directory listings.
Maintain a complete inventory of all web properties including subdomains, staging environments, and legacy sites. Decommission forgotten web assets promptly.
Train employees , especially those listed on the website , to recognize spear-phishing attempts that leverage their publicly available information.

Threat Hunter's Eye , Detecting Website Reconnaissance

Detection Pattern 1 , Automated Scraping Signatures

Rapid Sequential Page Requests

Automated scrapers typically request pages at speeds no human can match , 50+ pages per minute from a single IP address. Look for sequential URL patterns in access logs that suggest scripted crawling rather than organic browsing behavior. Time-based clustering analysis reveals bot patterns that simple request counting misses.

# Apache access log , detect rapid sequential requests
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# Flag IPs with 100+ requests in 5-minute windows

Detection Pattern 2 , Suspicious User-Agent Strings

Unusual or Spoofed Crawlers

Legitimate search engine crawlers identify themselves with recognizable user-agents. Attackers often use generic libraries ("python-requests/2.28.0"), HEADLESS browser identifiers ("HeadlessChrome"), or attempt to impersonate legitimate bots with slight variations. Monitor for user-agents that don't match known search engine patterns, especially those requesting sitemap.xml or robots.txt immediately upon connection.

# Identify non-standard user-agents requesting sitemap.xml
rg "sitemap.xml|robots.txt" access.log | rg -v "Googlebot|bingbot|Slurp"
# Extract and count unique suspicious user-agents

Detection Pattern 3 , Reconnaissance File Access Sequence

Sitemap → Robots.txt → Deep Crawl Pattern

A hallmark of T1594 is the specific access sequence: first sitemap.xml or robots.txt to discover the site structure, followed by systematic requests to discovered paths. Sessions that begin with these reconnaissance files and then proceed to enumerate multiple directories within minutes strongly indicate automated recon. Track the "first request" for each session and correlate with subsequent depth.

# Detect sitemap/robots.txt followed by deep crawl
# Flag sessions where first 2 requests include recon files
# AND session includes 20+ distinct URL paths

Detection Pattern 4 , Bulk Asset Download

Mass Downloading of Branding and Static Assets

Attackers preparing phishing campaigns often bulk-download images, logos, CSS files, and JavaScript from /branding/, /assets/, or /images/ directories. Look for sessions with abnormally high ratios of static asset requests to HTML page requests, or concentrated downloads from media directories followed by session termination , indicating the attacker got what they needed.

# Flag sessions with high static-to-page ratio
# Monitor bulk downloads from /branding/ or /assets/
# Alert on .svg, .ico, .css downloads from unusual referrers

Defensive Tools for Detection

Recommended Monitoring Stack

Deploy a layered detection approach: Web Analytics (Google Analytics, Matomo) for behavioral anomaly detection on the application layer; WAF Logs (Cloudflare, AWS WAF, ModSecurity) for rule-based bot detection and rate limiting; SIEM Correlation (Splunk, ELK) to cross-reference web access logs with threat intelligence on known malicious IPs; Honeypots (CANARY tokens, fake admin pages) that generate alerts when accessed, confirming active reconnaissance. Combining these layers provides defense-in-depth against T1594.

Take Action , Secure Your Web Presence

Your Website Is Your Front Door , Lock It Down

Every page, every file, and every directory listing on your public-facing website is potential intelligence for an adversary. Take 30 minutes today to audit your organization's web presence: review your sitemap.xml, check robots.txt for sensitive path leaks, minimize employee information exposure, and verify that staging and legacy subdomains are properly secured.

Explore Related MITRE ATT&CK Techniques

T1593 , Search Open Websites/Domains T1593.001 , Social Media T1593.002 , Search Engines T1592 , Gather Victim Host Info T1598 , Phishing for Information T1598.001 , Spearphishing Service T1595 , Active Scanning T1591 , Gather Victim Org Info T1589 , Gather Victim Identity Info T1596 , Search Open Technical Databases

Have questions about website reconnaissance or want to share your defensive strategies? Join the discussion in the comments below. Your insights could help other defenders protect their organizations against T1594 attacks.

Reference: MITRE ATT&CK T1594 , attack.mitre.org/techniques/T1594 | CISA Advisory AA24-242A , cisa.gov

MITIGATIONS

Pre-compromise M1056

DETECTION STRATEGY

Detection of Search Victim-Owned Websites DET0810

DONATE · SUPPORT

We keep threat intelligence free. No paywalls, no ads. Your donation directly funds server infrastructure, research, and tools. Every contribution - no matter the size - makes this platform sustainable.

100% of your support goes to the platform. No corporate sponsors, just the community.

ROOT::DONATE