[SSA] 스케줄러를 통해 Retry를 하게 될 경우 트랜잭션 처리

이전에 알림 모듈에서 알림 전송에 실패한 메시지를 데이터베이스에 저장하고

스프링에서 @EnableRetry 설정값을 부여하면 retry 스케줄러를 등록하여 로직을 처리할 수 있습니다.

그런데 이때 주의할 게 있습니다.

여러 번 재시도를 하게 될 텐데 이때

재시도를 처리하는 로직 자체에 비동기 처리를 하게 될 경우

@Transactional 설정에 대한 주의입니다.

다음은 문제의 코드입니다.

@Scheduled(fixedDelay = 60000) // 1분마다 실행
@Transactional
public void retryFailedNotifications() {
    // ...
    List<CompletableFuture<Void>> futures = targets.stream()
            .map(failedNotification -> CompletableFuture.runAsync(() -> {
                try {
                    NotificationRequest event = NotificationRequest.fromFail(failedNotification);
                    fcmService.sendFcmNotification(event);

                    // 성공 시 재시도 목록에서 삭제
                    failedNotificationRepository.delete(failedNotification);
                } catch (Exception e) {
                    // 재시도 또 실패 시, 카운트 증가
                    failedNotification.incrementRetryCount();
                    failedNotificationRepository.save(failedNotification);
                }
            }, notificationExecutor))
            .toList();
    // ...
}

이렇게 하게 되면 @Transactional을 달아 놔도 제대로 원하는 대로 동작하지 않습니다.

왜냐하면 비동기 호출을 하게 되면 현재 메서드를 수행하는 스레드가 아니라 완전히 다른 스레드가 해당 로직을 수행하기 때문입니다.

따라서, 다른 스레드가 수행하는 메서드는 @Transactional의 영향을 받지 않게 됩니다.

### 스프링의 @Transactional이 동작하는 메커니즘 중 하나는 ThreadLocal

스프링의 동작 방식:
1. 트랜잭션이 시작되면, 스프링의 TransactionSynchronizationManager는 현재 스레드의 ThreadLocal 변수에 데이터베이스 커넥션(또는 세션) 정보를 저장합니다.
2. 이후 JPA Repository가 호출되면, ThreadLocal을 찾아서 '현재 참여 중인 트랜잭션(커넥션)이 있는지'를 확인하고 그것을 가져다 씁니다.
문제 발생 (비동기 호출 시):
- CompletableFuture.runAsync는 작업을 Executor라는 다른 스레드 풀의 스레드에게 위임합니다.
- 스레드 A(스케줄러)의 ThreadLocal과 스레드 B(비동기 작업)의 ThreadLocal은 완전히 분리된 메모리입니다.
- 결과적으로 스레드 B는 스레드 A가 열어둔 트랜잭션을 볼 수 없으며, 트랜잭션 없이 동작하거나(JPA 설정에 따라 에러 발생), Repository가 자체적으로 새로운 트랜잭션을 매번 새로 엽니다.

### 메인 스레드(스케줄러)가 @Transactional을 잡고 있으면 DB 커넥션 1개를 점유한 채로 대기

시나리오:
1. 스케줄러 스레드: @Transactional 시작 -> DB 커넥션풀에서 [Conn-A]를 빌립니다.
2. 스케줄러 스레드: CompletableFuture.allOf(...).join()을 호출하며 자식 스레드들이 끝날 때까지 [Conn-A]를 가진 채로 멈춰 있습니다(Blocking).
3. 비동기 스레드들: 각자 작업을 수행하기 위해 failedNotificationRepository.save() 등을 호출. 이때 새로운 커넥션 [Conn-B, Conn-C...]가 필요합니다.
문제점:
- 만약 처리량이 많아져서 비동기 스레드들이 커넥션을 다 써버리거나, 반대로 스케줄러 스레드들이 커넥션을 다 쥐고 놓아주지 않으면 ?
- 비동기 스레드는 커넥션을 못 얻어 대기하고, 스케줄러 스레드는 비동기 스레드가 끝나길 대기하는 데드락(Deadlock) 상태에 빠집니다.

### AOP 프록시의 자기 호출(Self-Invocation) 문제

@Transactional의 동작 원리:
- 스프링은 빈(Bean)을 생성할 때, @Transactional이 붙은 클래스를 감싸는 프록시(Proxy) 객체를 만듭니다.
- 외부에서 scheduler.retryFailedNotifications()를 호출하면, 실제로는 프록시가 먼저 가로채서 Transaction.begin()을 하고 실제 메서드를 실행합니다.
같은 클래스 내 메서드 호출 (this.method):
- 그러면 @Transactional이 붙은 메서드 아래에 private 메서드를 생성하면 될까요 ?
- 그리고 만약 NotificationRetryScheduler 클래스 안에서 @Transactional이 붙은 다른 메서드(private처럼)를 this.processSingleRetry() 처럼 호출하면 어떻게 될까요 ?
- 이것은 프록시를 거치지 않고 자기 자신의 내부 인스턴스 메서드를 직접 호출하는 것입니다.
- 결과적으로 트랜잭션 코드가 적용되지 않은 메서드만 실행됩니다.

따라서, 트랜잭션을 적용하려면 외부의 다른 빈(Service)을 주입받아 호출해야 합니다.

다음은 기존의 레포지토리 코드입니다.

public interface FailedNotificationRepository extends JpaRepository<FailedNotification, Long> {

    @Lock(LockModeType.PESSIMISTIC_WRITE)
    List<FailedNotification> findByRetryCountLessThan(int retryCount);
}

조건에 맞는 전체 실패 메시지를 가져 오는데

이렇게 되면 스케줄러의 메서드를 수행하는 스레드에서 스케줄러가 100개를 가져오면서 락을 걸어버리면,

비동기로 실행되는 100개의 processSingleRetry 스레드들이 락 대기 상태에 빠져 데드락이 걸리게 될 겁니다.

따라서 리스트는 락 없이 읽기만 하고, 실제 수정할 때 하나의 레코드씩 잠그는 것으로 수정하려고 합니다.

수정한 레포지토리 코드입니다.

public interface FailedNotificationRepository extends JpaRepository<FailedNotification, Long> {

    List<FailedNotification> findByRetryCountLessThan(int retryCount);

    @Lock(LockModeType.PESSIMISTIC_WRITE)
    @Query("select fn from FailedNotification fn where fn.id = :id")
    Optional<FailedNotification> findByIdWithLock(@Param("id") Long id);
}

기존의 스케줄러 메인 메서드입니다.

@Scheduled(fixedDelay = 60000) // 1분마다 실행
public void retryFailedNotifications() {
    // ...
    List<CompletableFuture<Void>> futures = targets.stream()
            .map(failedNotification -> CompletableFuture.runAsync(() -> {
                try {
                    NotificationRequest event = NotificationRequest.fromFail(failedNotification);
                    fcmService.sendFcmNotification(event);

                    // 성공 시 재시도 목록에서 삭제
                    failedNotificationRepository.delete(failedNotification);
                } catch (Exception e) {
                    // 재시도 또 실패 시, 카운트 증가
                    failedNotification.incrementRetryCount();
                    failedNotificationRepository.save(failedNotification);
                }
            }, notificationExecutor))
            .toList();
    // ...
}

기존 코드대로 작동하면

Lock wait timeout exceeded 에러가 발생합니다.

이는 스케줄러(메인 스레드)가 락을 쥐고 놓아 주지 않아서 자식 스레드들이 DB에 접근하지 못하고 타임아웃이 걸린 것입니다.

가끔 가다가 정상적으로 동작하긴 합니다만, 데드락이 계속 걸리게 됩니다.

수정한 스케줄러 메인 메서드입니다.

@Scheduled(fixedDelay = 600000) // 10분마다 실행
public void retryFailedNotifications() {
    List<FailedNotification> targets = failedNotificationRepository.findByRetryCountLessThan(MAX_RETRY_COUNT);

    if (targets.isEmpty()) {
        return;
    }

    log.info("알림 재처리 스케줄을 수행합니다. 대상 총 {}.", targets.size());

    List<CompletableFuture<Void>> futures = targets.stream()
            .map(failedNotification -> CompletableFuture.runAsync(() -> {
                // ID만 넘기거나, 엔티티를 넘겨서 별도 서비스에서 트랜잭션 처리
                notificationRetryService.processSingleRetry(failedNotification.getId());
            }, notificationExecutor))
            .toList();

    try {
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
        log.info("알림 재처리 스케줄이 성공적으로 완료되었습니다. 총 {}건 처리.", futures.size());
    } catch (Exception e) {
        log.error("알림 재처리 작업 중 일부에서 예외가 발생했습니다.", e);
    }
}

새로 추가해서 호출하는 인스턴스의 메서드입니다.

@Transactional(propagation = Propagation.REQUIRES_NEW)
public void processSingleRetry(Long id) {
    failedNotificationRepository.findByIdWithLock(id).ifPresent(failedNotification -> {
        try {
            NotificationRequest event = NotificationRequest.fromFail(failedNotification);
            fcmService.sendFcmNotification(event);

            // 성공 시 삭제
            failedNotificationRepository.delete(failedNotification);
        } catch (Exception e) {
            log.warn("알림 재전송 실패. ID: {}", failedNotification.getId());
            // 실패 시 카운트 증가 및 업데이트
            failedNotification.incrementRetryCount();
            failedNotificationRepository.save(failedNotification);
        }
    });
}

'SSA > Back' 카테고리의 다른 글

[SSA] 실패한 알림 재전송 시 데드락 테스트 (0)	2025.12.13
[SSA] 비동기 콜백 등록 후 스케줄러를 통해 Retry를 하게 될 경우 무한 루프에 빠진다 ..? (0)	2025.12.13
[SSA] 카프카 메시지를 발행할 때 ZERO-PAYLOAD ? 아니면 Event-Carried State Transfer ? (0)	2025.12.11
[SSA] Redis의 특징을 활용한 동시성 제어 (Redis 분산 락 아님) (0)	2025.09.13
[SSA] 특강 개설 시 선착순 동시성 이슈 해결 (2)	2025.09.10

'SSA > Back' 카테고리의 다른 글

티스토리툴바