Pipelines in Java

Edoardo Midali

Le pipeline rappresentano un pattern architetturale fondamentale per l’elaborazione di dati, permettendo di comporre operazioni sequenziali in modo modulare e riutilizzabile. In Java, le pipeline possono essere implementate attraverso diverse tecnologie, dalle Stream API native a framework specializzati, offrendo soluzioni per elaborazione sincrona, asincrona e distribuita.

Concetti Fondamentali

Una pipeline è una sequenza di stadi di elaborazione connessi, dove l’output di uno stadio diventa l’input del successivo. Questo pattern promuove la separazione delle responsabilità, facilita il testing e permette ottimizzazioni specifiche per ogni stadio.

Caratteristiche delle Pipeline

Composabilità: Gli stadi possono essere combinati in modi diversi per creare pipeline complesse.

Riutilizzabilità: Singoli stadi possono essere riutilizzati in pipeline diverse.

Parallelizzazione: Stadi indipendenti possono essere eseguiti in parallelo.

Scalabilità: Le pipeline possono essere distribuite su più thread o macchine.

Testabilità: Ogni stadio può essere testato indipendentemente.

// Esempio concettuale di pipeline
public interface PipelineStage<T, R> {
    R process(T input);

    default <V> PipelineStage<T, V> andThen(PipelineStage<R, V> next) {
        return input -> next.process(this.process(input));
    }
}

// Implementazione di stadi specifici
PipelineStage<String, String> normalize = String::toLowerCase;
PipelineStage<String, String> trim = String::trim;
PipelineStage<String, Integer> length = String::length;

// Composizione della pipeline
PipelineStage<String, Integer> pipeline = normalize
    .andThen(trim)
    .andThen(length);

// Utilizzo
int result = pipeline.process("  HELLO WORLD  "); // 11

Stream API come Pipeline Foundation

Le Stream API di Java forniscono il meccanismo più naturale per implementare pipeline di elaborazione dati, offrendo operazioni lazy, ottimizzazioni automatiche e supporto per parallelismo.

Pipeline Dichiarative

Le Stream API permettono di esprimere pipeline di elaborazione in modo dichiarativo, separando il “cosa” dal “come” dell’elaborazione.

public class DataProcessingPipeline {

    public static List<ProcessedData> processTransactions(List<Transaction> transactions) {
        return transactions.stream()
            // Stadio 1: Filtro di validazione
            .filter(transaction -> transaction.getAmount() > 0)
            .filter(transaction -> transaction.getDate().isAfter(LocalDate.now().minusDays(30)))

            // Stadio 2: Trasformazione e arricchimento
            .map(transaction -> enrichTransaction(transaction))

            // Stadio 3: Aggregazione per cliente
            .collect(Collectors.groupingBy(Transaction::getCustomerId))
            .entrySet().stream()

            // Stadio 4: Calcolo statistiche
            .map(entry -> calculateCustomerStats(entry.getKey(), entry.getValue()))

            // Stadio 5: Filtro finale e ordinamento
            .filter(stats -> stats.getTotalAmount() > 1000)
            .sorted(Comparator.comparing(ProcessedData::getTotalAmount).reversed())
            .collect(Collectors.toList());
    }

    private static Transaction enrichTransaction(Transaction transaction) {
        // Arricchimento con dati esterni
        CustomerInfo customer = customerService.getCustomerInfo(transaction.getCustomerId());
        return transaction.withCustomerInfo(customer);
    }

    private static ProcessedData calculateCustomerStats(String customerId, List<Transaction> transactions) {
        double totalAmount = transactions.stream()
            .mapToDouble(Transaction::getAmount)
            .sum();

        long transactionCount = transactions.size();
        double averageAmount = totalAmount / transactionCount;

        return new ProcessedData(customerId, totalAmount, transactionCount, averageAmount);
    }
}

Pipeline con Gestione Errori

Le pipeline reali devono gestire errori senza interrompere l’intera elaborazione, implementando strategie di resilienza.

public class ResilientPipeline {

    public static class Result<T> {
        private final T value;
        private final Exception error;
        private final boolean success;

        private Result(T value, Exception error, boolean success) {
            this.value = value;
            this.error = error;
            this.success = success;
        }

        public static <T> Result<T> success(T value) {
            return new Result<>(value, null, true);
        }

        public static <T> Result<T> failure(Exception error) {
            return new Result<>(null, error, false);
        }

        public boolean isSuccess() { return success; }
        public T getValue() { return value; }
        public Exception getError() { return error; }
    }

    public static List<Result<ProcessedDocument>> processDocuments(List<String> documentPaths) {
        return documentPaths.stream()
            // Stadio 1: Caricamento sicuro
            .map(ResilientPipeline::safeLoadDocument)

            // Stadio 2: Elaborazione solo documenti caricati con successo
            .map(result -> {
                if (result.isSuccess()) {
                    return safeProcessDocument(result.getValue());
                } else {
                    return Result.<ProcessedDocument>failure(result.getError());
                }
            })

            .collect(Collectors.toList());
    }

    private static Result<Document> safeLoadDocument(String path) {
        try {
            Document doc = loadDocument(path);
            return Result.success(doc);
        } catch (Exception e) {
            return Result.failure(e);
        }
    }

    private static Result<ProcessedDocument> safeProcessDocument(Document document) {
        try {
            ProcessedDocument processed = processDocument(document);
            return Result.success(processed);
        } catch (Exception e) {
            return Result.failure(e);
        }
    }
}

Pipeline Asincrone

Per elaborazioni che coinvolgono operazioni I/O o calcoli pesanti, le pipeline asincrone permettono di sfruttare meglio le risorse sistema.

CompletableFuture Pipeline

public class AsyncPipeline {

    private final ExecutorService executor = ForkJoinPool.commonPool();

    public CompletableFuture<List<AnalysisResult>> processDataAsync(List<DataSource> sources) {

        List<CompletableFuture<AnalysisResult>> futures = sources.stream()
            .map(this::createProcessingPipeline)
            .collect(Collectors.toList());

        // Combina tutti i risultati
        return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
            .thenApply(v -> futures.stream()
                .map(CompletableFuture::join)
                .collect(Collectors.toList()));
    }

    private CompletableFuture<AnalysisResult> createProcessingPipeline(DataSource source) {
        return CompletableFuture
            // Stadio 1: Caricamento asincrono
            .supplyAsync(() -> loadData(source), executor)

            // Stadio 2: Validazione
            .thenApply(this::validateData)

            // Stadio 3: Trasformazione (CPU-intensive)
            .thenApplyAsync(this::transformData, executor)

            // Stadio 4: Analisi
            .thenApply(this::analyzeData)

            // Stadio 5: Persistenza asincrona
            .thenCompose(this::saveResultAsync)

            // Gestione errori
            .exceptionally(this::handleError);
    }

    private RawData loadData(DataSource source) {
        // Simulazione I/O
        try {
            Thread.sleep(100);
            return new RawData(source.getData());
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new RuntimeException(e);
        }
    }

    private ValidatedData validateData(RawData data) {
        if (data.isValid()) {
            return new ValidatedData(data);
        }
        throw new IllegalArgumentException("Invalid data");
    }

    private TransformedData transformData(ValidatedData data) {
        // Simulazione elaborazione CPU-intensive
        return new TransformedData(data.transform());
    }

    private AnalysisResult analyzeData(TransformedData data) {
        return new AnalysisResult(data.analyze());
    }

    private CompletableFuture<AnalysisResult> saveResultAsync(AnalysisResult result) {
        return CompletableFuture.supplyAsync(() -> {
            // Simulazione salvataggio asincrono
            saveToDatabase(result);
            return result;
        }, executor);
    }

    private AnalysisResult handleError(Throwable error) {
        System.err.println("Pipeline error: " + error.getMessage());
        return AnalysisResult.empty();
    }
}

Reactive Streams Pipeline

Per elaborazioni con backpressure e gestione avanzata del flusso, i reactive streams offrono controllo granulare.

// Utilizzando una libreria reactive come RxJava (concettuale)
public class ReactivePipeline {

    public Observable<ProcessedEvent> processEventStream(Observable<RawEvent> eventStream) {
        return eventStream
            // Stadio 1: Filtro rapido
            .filter(event -> event.isValid())

            // Stadio 2: Raggruppamento per finestre temporali
            .window(Duration.ofSeconds(5))

            // Stadio 3: Elaborazione batch per finestra
            .flatMap(window -> window
                .toList()
                .map(this::processBatch)
                .flatMapObservable(Observable::fromIterable)
            )

            // Stadio 4: Arricchimento asincrono
            .flatMap(event ->
                enrichEventAsync(event).toObservable()
            )

            // Stadio 5: Gestione backpressure
            .onBackpressureBuffer(1000)

            // Gestione errori con retry
            .retry(3)
            .onErrorResumeNext(error -> {
                System.err.println("Stream error: " + error.getMessage());
                return Observable.empty();
            });
    }

    private List<ProcessedEvent> processBatch(List<RawEvent> events) {
        return events.stream()
            .map(this::processEvent)
            .collect(Collectors.toList());
    }

    private Single<ProcessedEvent> enrichEventAsync(ProcessedEvent event) {
        return Single.fromCallable(() -> {
            // Arricchimento asincrono
            return enrichEvent(event);
        }).subscribeOn(Schedulers.io());
    }
}

Pipeline Personalizzate

Per casi d’uso specifici, è spesso necessario implementare pipeline personalizzate con controllo granulare su esecuzione, error handling e performance.

Framework Pipeline Custom

public class CustomPipelineFramework {

    public static class Pipeline<T> {
        private final List<Stage<?, ?>> stages = new ArrayList<>();
        private final ExecutorService executor;

        public Pipeline(ExecutorService executor) {
            this.executor = executor;
        }

        @SuppressWarnings("unchecked")
        public <R> Pipeline<R> addStage(Stage<T, R> stage) {
            stages.add(stage);
            return (Pipeline<R>) this;
        }

        public CompletableFuture<List<Object>> processAsync(List<T> inputs) {
            return CompletableFuture.supplyAsync(() -> {
                List<Object> currentData = new ArrayList<>(inputs);

                for (Stage<?, ?> stage : stages) {
                    currentData = processStage(stage, currentData);
                }

                return currentData;
            }, executor);
        }

        @SuppressWarnings("unchecked")
        private <I, O> List<Object> processStage(Stage<I, O> stage, List<Object> inputs) {
            return inputs.stream()
                .map(input -> {
                    try {
                        return stage.process((I) input);
                    } catch (Exception e) {
                        return stage.handleError(e, (I) input);
                    }
                })
                .filter(Objects::nonNull)
                .collect(Collectors.toList());
        }
    }

    public interface Stage<T, R> {
        R process(T input) throws Exception;

        default R handleError(Exception error, T input) {
            System.err.println("Stage error for input " + input + ": " + error.getMessage());
            return null; // Skip questo elemento
        }
    }

    // Implementazioni di stadi specifici
    public static class ValidationStage implements Stage<String, String> {
        @Override
        public String process(String input) throws Exception {
            if (input == null || input.trim().isEmpty()) {
                throw new IllegalArgumentException("Invalid input");
            }
            return input.trim();
        }
    }

    public static class TransformationStage implements Stage<String, ProcessedData> {
        @Override
        public ProcessedData process(String input) throws Exception {
            // Simulazione elaborazione
            Thread.sleep(10);
            return new ProcessedData(input.toUpperCase(), input.length());
        }
    }

    // Utilizzo del framework
    public static void demonstrateCustomPipeline() {
        ExecutorService executor = Executors.newFixedThreadPool(4);

        Pipeline<String> pipeline = new Pipeline<String>(executor)
            .addStage(new ValidationStage())
            .addStage(new TransformationStage());

        List<String> inputs = Arrays.asList("hello", "", "world", null, "java");

        pipeline.processAsync(inputs)
            .thenAccept(results -> {
                System.out.println("Pipeline results: " + results);
            })
            .exceptionally(error -> {
                System.err.println("Pipeline failed: " + error.getMessage());
                return null;
            });
    }
}

Pipeline per ETL e Data Processing

Le pipeline sono fondamentali per operazioni ETL (Extract, Transform, Load) e elaborazione di grandi volumi di dati.

ETL Pipeline

public class ETLPipeline {

    public static class ETLJob<S, T> {
        private final DataExtractor<S> extractor;
        private final DataTransformer<S, T> transformer;
        private final DataLoader<T> loader;
        private final BatchProcessor batchProcessor;

        public ETLJob(DataExtractor<S> extractor,
                     DataTransformer<S, T> transformer,
                     DataLoader<T> loader,
                     int batchSize) {
            this.extractor = extractor;
            this.transformer = transformer;
            this.loader = loader;
            this.batchProcessor = new BatchProcessor(batchSize);
        }

        public CompletableFuture<ETLResult> execute() {
            return CompletableFuture.supplyAsync(() -> {
                ETLResult result = new ETLResult();

                try (Stream<S> sourceData = extractor.extract()) {
                    sourceData
                        .map(data -> {
                            try {
                                return transformer.transform(data);
                            } catch (Exception e) {
                                result.addError(e);
                                return null;
                            }
                        })
                        .filter(Objects::nonNull)
                        .forEach(transformedData -> {
                            batchProcessor.add(transformedData);

                            if (batchProcessor.isFull()) {
                                loadBatch(batchProcessor.flush(), result);
                            }
                        });

                    // Load rimanenti dati
                    if (!batchProcessor.isEmpty()) {
                        loadBatch(batchProcessor.flush(), result);
                    }

                } catch (Exception e) {
                    result.addError(e);
                }

                return result;
            });
        }

        private void loadBatch(List<T> batch, ETLResult result) {
            try {
                loader.load(batch);
                result.addProcessedCount(batch.size());
            } catch (Exception e) {
                result.addError(e);
            }
        }
    }

    private static class BatchProcessor {
        private final List<Object> batch;
        private final int batchSize;

        public BatchProcessor(int batchSize) {
            this.batchSize = batchSize;
            this.batch = new ArrayList<>(batchSize);
        }

        public void add(Object item) {
            batch.add(item);
        }

        public boolean isFull() {
            return batch.size() >= batchSize;
        }

        public boolean isEmpty() {
            return batch.isEmpty();
        }

        @SuppressWarnings("unchecked")
        public <T> List<T> flush() {
            List<T> result = new ArrayList<>((List<T>) batch);
            batch.clear();
            return result;
        }
    }

    // Interfacce per componenti ETL
    public interface DataExtractor<T> {
        Stream<T> extract() throws Exception;
    }

    public interface DataTransformer<S, T> {
        T transform(S source) throws Exception;
    }

    public interface DataLoader<T> {
        void load(List<T> data) throws Exception;
    }

    public static class ETLResult {
        private int processedCount = 0;
        private final List<Exception> errors = new ArrayList<>();

        public void addProcessedCount(int count) {
            this.processedCount += count;
        }

        public void addError(Exception error) {
            this.errors.add(error);
        }

        public int getProcessedCount() { return processedCount; }
        public List<Exception> getErrors() { return new ArrayList<>(errors); }
        public boolean hasErrors() { return !errors.isEmpty(); }
    }
}

Performance e Ottimizzazioni

Parallel Pipeline Processing

public class ParallelPipelineOptimizations {

    // Pipeline con parallelismo controllato
    public static <T, R> List<R> processParallel(List<T> data,
                                                 Function<T, R> processor,
                                                 int parallelism) {
        ForkJoinPool customThreadPool = new ForkJoinPool(parallelism);
        try {
            return customThreadPool.submit(() ->
                data.parallelStream()
                    .map(processor)
                    .collect(Collectors.toList())
            ).get();
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            customThreadPool.shutdown();
        }
    }

    // Pipeline con chunking per ottimizzare throughput
    public static <T, R> List<R> processInChunks(List<T> data,
                                                 Function<List<T>, List<R>> batchProcessor,
                                                 int chunkSize) {
        return IntStream.range(0, (data.size() + chunkSize - 1) / chunkSize)
            .parallel()
            .mapToObj(i -> {
                int start = i * chunkSize;
                int end = Math.min(start + chunkSize, data.size());
                return data.subList(start, end);
            })
            .map(batchProcessor)
            .flatMap(List::stream)
            .collect(Collectors.toList());
    }

    // Pipeline con backpressure
    public static <T> void processWithBackpressure(Stream<T> dataStream,
                                                   Consumer<T> processor,
                                                   int bufferSize) {
        BlockingQueue<T> buffer = new ArrayBlockingQueue<>(bufferSize);

        // Producer thread
        CompletableFuture<Void> producer = CompletableFuture.runAsync(() -> {
            dataStream.forEach(item -> {
                try {
                    buffer.put(item); // Blocca se buffer pieno
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException(e);
                }
            });
        });

        // Consumer thread
        CompletableFuture<Void> consumer = CompletableFuture.runAsync(() -> {
            try {
                T item;
                while (!producer.isDone() || !buffer.isEmpty()) {
                    item = buffer.poll(100, TimeUnit.MILLISECONDS);
                    if (item != null) {
                        processor.accept(item);
                    }
                }
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                throw new RuntimeException(e);
            }
        });

        CompletableFuture.allOf(producer, consumer).join();
    }
}

Best Practices

Composabilità: Progetta stadi piccoli e riutilizzabili che fanno una cosa sola bene.

Error Handling: Implementa gestione errori robusta che non interrompe l’intera pipeline.

Monitoring: Aggiungi metriche e logging per monitorare performance e errori.

Testing: Testa ogni stadio indipendentemente e l’intera pipeline end-to-end.

Resource Management: Gestisci appropriatamente thread pool, connessioni e memory usage.

Backpressure: Implementa meccanismi di controllo del flusso per prevenire overflow.

Conclusione

Le pipeline rappresentano un pattern architetturale potente per l’elaborazione strutturata dei dati in Java. Dalle semplici Stream API alle implementazioni asincrone complesse, le pipeline offrono modularità, testabilità e scalabilità.

La scelta dell’implementazione dipende dai requisiti specifici: Stream API per elaborazioni sincrone semplici, CompletableFuture per parallelismo controllato, reactive streams per backpressure avanzata, o framework custom per controllo granulare.

Una comprensione solida dei pattern pipeline e delle loro implementazioni è essenziale per architettare sistemi di elaborazione dati robusti e performanti in Java moderno.