[SSA] 실패한 알림 재전송 시 데드락 테스트

이전에 실패한 알림에 대한 재전송 스케줄링 처리를 구현하고

비동기 처리를 할 경우 새로운 스레드가 가지는 스레드 로컬에 대한 버그를 맞닥뜨리고
트랜잭션 전파 범위에 대해 버그를 수정한 적이 있습니다.

그런데 비동기 논블로킹 처리를 하게 될 경우 트랜잭션 전파 범위에 따라서

실패한 알림을 데이터베이스에서 조회하고 FCM으로 전송하는 과정에서 시간이 상당히 걸리게 된다면

다른 모든 스레드들이 FCM 요청 말고 다른 작업을 수행하더라도 커넥션을 획득하지 못하는 상황이 발생할 수 있습니다.

물론 FCM이 수십 건이나 전송에 시간이 오래 걸릴 가능성은 그렇게 높지 않을 수 있습니다만,

혹시라도 그럴 경우가 있을 수 있으니 이와 같은 상황을 테스트해 보고자 했습니다.

테스트 경로입니다.

다음은 스케줄링 메서드와 알림 전송 코드입니다.

@Scheduled(fixedDelay = 600000) // 10분마다 실행
public void retryFailedNotification() {
    List<FailedNotification> targets = failedNotificationRepository.findByRetryCountLessThan(MAX_RETRY_COUNT);

    if (targets.isEmpty()) {
        log.info("알림 재처리 대상이 없습니다.");
        return;
    }

    log.info("알림 재처리 스케줄을 수행합니다. 대상 총 {}건.", targets.size());

    List<CompletableFuture<Void>> futures = targets.stream()
            .map(failedNotification -> CompletableFuture.runAsync(() -> {
                // ID만 넘기거나, 엔티티를 넘겨서 별도 서비스에서 트랜잭션 처리
                notificationRetryService.processSingleRetry(failedNotification.getId());
            }, failedNotificationExecutor))
            .toList();

    try {
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
        log.info("알림 재처리 스케줄이 성공적으로 요청되었습니다. 총 {}건 처리.", futures.size());
    } catch (Exception e) {
        log.error("알림 재처리 작업 중 일부에서 예외가 발생했습니다.", e);
    }
}

@Transactional(propagation = Propagation.REQUIRES_NEW)
public void processSingleRetry(Long id) {
    failedNotificationRepository.findByIdWithLock(id).ifPresent(failedNotification -> {
        try {
            NotificationRequest request = NotificationRequest.fromFail(failedNotification);
            // 응답 받기
            fcmService.retryFcmNotification(request).get();

            // 성공 시 삭제
            failedNotificationRepository.delete(failedNotification);
        } catch (Exception e) {
            log.warn("알림 재전송 실패. ID: {}, Reason: {}", failedNotification.getId(), e.getMessage());
            // 실패 시 카운트 증가 및 업데이트
            failedNotification.incrementRetryCount();
            failedNotificationRepository.save(failedNotification);
        }
    });
}

실패하는 것을 테스트하려고 합니다 ..

테스트용 설정 환경은 다음과 같습니다.

@TestPropertySource(properties = {
        "spring.datasource.hikari.maximum-pool-size=2",       // 커넥션 풀 2개
        "spring.datasource.hikari.connection-timeout=2000",   // 대기 시간 2초
        "spring.datasource.hikari.minimum-idle=2"             // 최소 유휴 커넥션 2개
})

테스트는 정상적으로 동작했고 (실패를 예상하는 테스트)

@Test
void 비동기_처리시_커넥션풀이_작으면_타임아웃_에러가_발생한다() throws InterruptedException {
    // given: 10개의 실패 알림 데이터 생성
    int dataCount = 10;
    for (int i = 0; i < dataCount; i++) {
        failedNotificationRepository.save(new FailedNotification(
                1L, 2L, "title", "body", NotificationType.FCM, "reason"
        ));
    }

    // given: FCM 전송 요청 시 가정으로 3초 딜레이 발생 (Timeout 2초보다 길게 설정)
    // 이렇게 하면 트랜잭션(@Transactional)을 3초간 유지하게 됨
    given(fcmService.retryFcmNotification(any())).willAnswer(invocation -> {
        log.info("--> FCM 전송 시작 (3초 대기) ...");
        Thread.sleep(3000);
        log.info("<-- FCM 전송 완료");
        return ApiFutures.immediateFuture("test-message-id");
    });

    // when: 스케줄러 실행 (비동기)
    log.info("=== 스케줄러 실행 시작 ===");
    notificationRetryScheduler.retryFailedNotifications();

    // 비동기 작업들이 끝날 때까지 충분히 대기
    // 실제로는 CountDownLatch 등을 쓰는 게 좋지만, 테스트 단순화를 위해 sleep 사용
    Thread.sleep(6000);
    log.info("=== 스케줄러 실행 종료 (대기 끝) ===");

    // then: 결과 확인
    // 성공한 개수는 위에서 설정한 커넥션 풀 크기와 같거나 비슷해야 하고 나머지는 타임아웃으로 실패
    // 실패한 경우 카운트가 증가하므로 retry_count > 0 인 데이터를 세어 봄
    long failedCount = failedNotificationRepository.findAll().stream()
            .filter(fn -> fn.getRetryCount() > 0)
            .count();

    // 삭제된 건 성공한 건데 여기선 롤백 등으로 남아있을 수 있음
    long successCount = failedNotificationRepository.count() - failedCount;

    log.info("성공 추정 건수: {}", successCount);
    log.info("실패(타임아웃 등) 건수: {}", failedCount);

    // 검증: 모든 요청이 성공하지 못하고, 일부는 반드시 실패해야 함
    assertThat(failedCount).isGreaterThan(0);
}

로그에는 이렇게 트랜잭션을 열 수 없다고 에러를 반환했습니다.

현재 프로젝트에서 알림 전송과 실패한 알림 재전송에 대한 비동기 처리 시에는

완벽하진 않겠지만, 어느 정도의 해결 방법으로 다음의 과정을 생각해 봤습니다.

비동기 논블로킹: 최초 정상 발송 시에는 유지
재시도 로직: 안정성을 위해 순차적 동기 처리로 전환 (스케줄러)
트랜잭션/데드락: 스케줄러는 트랜잭션 없이 조회만, 처리는 개별 트랜잭션으로 분리
카운트 증가: 예외 발생 시 재시도 카운트 증가
성공 시 삭제: 동기 처리 후 즉시 삭제

### 결론

재시도 로직에서 중요한 것은

'1초라도 빨리 보내는 것'이 아니라,

'조금이나마 확실하게 처리하는 것'이 조금 더 우선한다고 생각했습니다.

따라서 비동기를 버리고 배치 사이즈만큼 끊어서 순차 처리(Sequential Processing)하는 방식을 택할 것 같습니다.

해당 구현과 조금 더 다른 방식을 고민하게 될 거 같은데 이건 다른 글에서 작성해 보도록 하겠습니다 !

'SSA > Back' 카테고리의 다른 글

[SSA] 끝나지 않는 실패한 알림 처리 방안에 대한 고민 .. (0)	2026.03.25
[SSA] 실패한 알림에 대해서는 순차 처리 ? 병렬 처리 ? (0)	2025.12.14
[SSA] 비동기 콜백 등록 후 스케줄러를 통해 Retry를 하게 될 경우 무한 루프에 빠진다 ..? (0)	2025.12.13
[SSA] 스케줄러를 통해 Retry를 하게 될 경우 트랜잭션 처리 (0)	2025.12.11
[SSA] 카프카 메시지를 발행할 때 ZERO-PAYLOAD ? 아니면 Event-Carried State Transfer ? (0)	2025.12.11

'SSA > Back' 카테고리의 다른 글

티스토리툴바