java简单爬虫教程

php中文网 2024-10-15 11:51:53

如何使用 java 构建简单爬虫？创建 maven 项目并添加依赖项。编写爬虫逻辑，包括发送 http 请求、解析 html 文档、提取链接并递归爬取网页。限制并发请求数，使用 jsoup 库解析 html 文档，并使用 completablefuture 在异步模式下爬取网页。

java简单爬虫教程

Java 简单爬虫教程

如何使用 Java 构建爬虫

Java 是构建网络爬虫的理想语言，因为它提供了强大的库、良好的并发性和可扩展性。在本教程中，我们将介绍使用 Java 构建简单爬虫的基础知识。

先决条件

立即学习“Java免费学习笔记（深入）”；

Java 开发环境 (JDK)
Maven 或 Gradle 构建工具

依赖项

Jsoup (用于解析 HTML 文档)
HttpClient (用于发送 HTTP 请求)

步骤 1：创建 Maven 项目

<groupid>com.example</groupid><artifactid>crawler</artifactid><version>1.0-SNAPSHOT</version><dependencies><dependency><groupid>org.jsoup</groupid><artifactid>jsoup</artifactid><version>1.15.3</version></dependency><dependency><groupid>org.apache.httpcomponents</groupid><artifactid>httpclient</artifactid><version>4.5.13</version></dependency></dependencies>

步骤 2：编写爬虫逻辑

在 Crawler.java 类中编写以下逻辑：

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Crawler {

  public static void main(String[] args) throws IOException, InterruptedException {
    // 创建 HTTP 客户端
    HttpClient client = HttpClient.newHttpClient();

    // 起始 URL
    String url = "https://example.com";

    // 限制最大并发请求数
    int maxConcurrentRequests = 10;

    // 创建一个异步请求列表
    List<completablefuture>&gt; requests = List.of();

    // 递归爬取网页
    crawlRecursively(client, url, maxConcurrentRequests, requests);

    // 等待所有请求完成
    CompletableFuture.allOf(requests).get();
  }

  private static void crawlRecursively(HttpClient client, String url, int maxConcurrentRequests, List<completablefuture>&gt; requests)
      throws IOException, InterruptedException {
    // 发送 GET 请求
    HttpRequest request = HttpRequest.newBuilder().GET().uri(URI.create(url)).build();
    HttpResponse<string> response = client.send(request, HttpResponse.BodyHandlers.ofString());

    // 解析 HTML 文档
    Document doc = Jsoup.parse(response.body());

    // 提取页面上的链接
    for (Element link : doc.select("a[href]")) {
      String nextUrl = link.attr("href");

      // 过滤不必要的链接
      if (!nextUrl.startsWith("http") || nextUrl.startsWith("javascript")) {
        continue;
      }

      // 限制并发请求数
      if (requests.size() &gt;= maxConcurrentRequests) {
        CompletableFuture.allOf(requests).get();
        requests.clear();
      }

      // 异步爬取新页面
      requests.add(CompletableFuture.runAsync(() -&gt; {
        try {
          crawlRecursively(client, nextUrl, maxConcurrentRequests, requests);
        } catch (IOException | InterruptedException e) {
          e.printStackTrace();
        }
      }));
    }
  }
}</string></completablefuture></completablefuture>

步骤 3：运行爬虫

在命令行中执行以下命令：

mvn clean install
java -jar target/crawler-1.0-SNAPSHOT.jar

如何限制并发请求数

通过设置 maxConcurrentRequests 变量限制并发请求数。这有助于避免服务器过载。

如何解析 HTML 文档

使用 Jsoup 库解析 HTML 文档。它提供了便捷的方法来提取页面上的元素和链接。

如何在异步模式下爬取网页

使用 CompletableFuture 在异步模式下爬取网页。这允许并行爬取多个页面，提高效率。

以上就是java简单爬虫教程的详细内容，更多请关注php中文网其它相关文章！

本文地址： http://www.ipsmc.com/java/10540.html