Partage
  • Partager sur Facebook
  • Partager sur Twitter

[Cours] Calculs distribués sur données massives

AMA!

    11 octobre 2018 à 12:18:46

    Bonjour,

    je pose la question sur l'environnement de travail.

    Est- il possible d'utiliser spark SQL (l'activite 2) sur windows avec Anaconda.

    Merci.

    • Partager sur Facebook
    • Partager sur Twitter
      11 octobre 2018 à 20:26:28

      Bonjour.

      Spark SQL fonctionne très bien avec Python. Pour cela, il faut installer pyspark, lancer le script avec spark-submit et initialiser au début un objet SparkSession, par exemple comme ceci:

          from pyspark import SparkContext
          from pyspark.sql import SparkSession
          sc = SparkContext()
          spark = SparkSession.builder.appName("myapp").getOrCreate()

      Par contre, pour être complet dans la réponse il faudrait connaitre les versions de Spark et Python (anaconda) que tu utilises. Il se trouve en effet que certaines combinaisons de versions Spark-Python ne marchent pas, par exemple Spark 2.1.0 et Python 3.6.

      Michel.

      -
      Edité par Michel D. 11 octobre 2018 à 20:28:16

      • Partager sur Facebook
      • Partager sur Twitter
      Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
        12 octobre 2018 à 11:03:33

        Bonjour Michel,

        Merci ta réponse.

        Concernant les versions, j’utilise  Anaconda 1.9.2  notbook 5.6 et Python 3.7.0

        Voici l'erreur que j'ai reçu après exécution de ces lignes :

         from pyspark import SparkContext
            from pyspark.sql import SparkSession

            sc = SparkContext() 

        ---------------------------------------------------------------------------
        Exception                                 Traceback (most recent call last)
        <ipython-input-3-2dfc28fca47d> in <module>()
        ----> 1sc = SparkContext()
        
        ~\Anaconda3\lib\site-packages\pyspark\context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
         113         """
         114         self._callsite = first_spark_call() or CallSite(None, None, None)
        --> 115SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
         116         try:
         117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
        
        ~\Anaconda3\lib\site-packages\pyspark\context.py in _ensure_initialized(cls, instance, gateway, conf)
         290         with SparkContext._lock:
         291             if not SparkContext._gateway:
        --> 292SparkContext._gateway = gateway or launch_gateway(conf)
         293                 SparkContext._jvm = SparkContext._gateway.jvm
         294 
        
        ~\Anaconda3\lib\site-packages\pyspark\java_gateway.py in launch_gateway(conf)
         91 
         92             if not os.path.isfile(conn_info_file):
        ---> 93raise Exception("Java gateway process exited before sending its port number")
         94 
         95             with open(conn_info_file, "rb") as info:
        
        Exception: Java gateway process exited before sending its port number
        • Partager sur Facebook
        • Partager sur Twitter
          1 novembre 2018 à 22:32:48

          Bonjour,

          Le lien vers le jeu de données movies dans l'exercice sur le page rank est en erreur 404.

          Est-il possible de ré-uploader le jeu de données sur le site d'Openclassrooms ?

          Merci d'avance

          -------

          EDIT :

          Alors je m'auto-réponds :

          Le jeu de données a été sauvegardé sur archive.org, je l'ai retrouvé à cette URL :

          https://web.archive.org/web/20070314100741/http://www.cs.toronto.edu/~tsap/experiments/download/_movies.tar.Z

          Ceux qui ont fait ce cours confirment qu'il s'agit bien du même fichier ? On peut mettre à jour le lien de l'exercice avec ce lien là maintenant que la page originale est abandonnée ?

          -
          Edité par mikachou34 1 novembre 2018 à 22:48:07

          • Partager sur Facebook
          • Partager sur Twitter
            3 novembre 2018 à 12:55:17

            @Aimad

            Désolé pour la réponse tardive ! Tu as peut-être déjà résolu ton problème, sinon, peux-tu m'ndiquer ta version de Spark et également me dire sur quel OS tu travailles ?

            Je vois que tu utilises Python 3.7. Je ne sais pas si cela fonctionne avec la 3.7 qui est récente.

            J'ai également Anaconda 1.9.2 mais j'utilise Spyder 3.3.1 qui charge un environnement Python 3.6.3 et j'ai également la même erreur.

            Par contre dans un terminal avec le CLI python 3.6, je n'ai pas cette erreur.

            Hope this helps.

            • Partager sur Facebook
            • Partager sur Twitter
            Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
              24 novembre 2018 à 23:12:41

              Bonjour,

              Bon je reviens avec un nouveau petit soucis, le script iliad_odyssey.py dans la partie « Apprenez à débugger une application spark » semble ne pas fonctionner :

              $ env PYSPARK_PYTHON=python3 ./spark-2.4.0-bin-hadoop2.7/bin/spark-submit ./iliad_odyssey.py
              18/11/24 22:47:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 31
                  .map(lambda (word, count): (word, count/float(word_count)))\
                              ^
              SyntaxError: invalid syntax

              A noter que même en supprimant les parenthèses après le mot-clé lambda, je tombe sur une nouvelle erreur :

              $ env PYSPARK_PYTHON=python3 ./spark-2.4.0-bin-hadoop2.7/bin/spark-submit ./iliad_odyssey.py
              18/11/24 22:54:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 8)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 11)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 357, in main
                  eval_type = read_int(infile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
                  raise EOFError
              EOFError
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
              	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
              	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
              	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              	at org.apache.spark.scheduler.Task.run(Task.scala:121)
              	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR PythonRunner: This may have been caused by a prior exception:
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 10)
              org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              18/11/24 22:54:34 ERROR TaskSetManager: Task 3 in stage 4.0 failed 1 times; aborting job
              Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 60, in <module>
                  main()
                File "/home/michael/openclassrooms/spark/./iliad_odyssey.py", line 47, in main
                  emerging_words = join_words.takeOrdered(10, lambda word, freq_diff: -freq_diff)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1304, in takeOrdered
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 844, in reduce
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
              py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
              : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4.0 (TID 11, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
              
              Driver stacktrace:
              	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
              	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
              	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
              	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
              	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
              	at scala.Option.foreach(Option.scala:257)
              	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
              	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
              	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
              	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
              	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
              	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
              	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
              	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
              	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
              	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
              	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
              	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
              	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              	at java.lang.reflect.Method.invoke(Method.java:498)
              	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
              	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
              	at py4j.Gateway.invoke(Gateway.java:282)
              	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
              	at py4j.commands.CallCommand.execute(CallCommand.java:79)
              	at py4j.GatewayConnection.run(GatewayConnection.java:238)
              	at java.lang.Thread.run(Thread.java:748)
              Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
                  process()
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
                  serializer.dump_stream(func(split_index, iterator), outfile)
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 390, in dump_stream
                  vs = list(itertools.islice(iterator, batch))
                File "/home/michael/openclassrooms/spark/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
                  return f(*args, **kwargs)
              TypeError: <lambda>() missing 1 required positional argument: 'count'
              
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
              	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
              	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
              	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
              	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
              	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
              	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
              	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
              	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
               


              Ensuite je reviens à la charge avec le liens _movies de l'execice PageRank pour la partie Hadoop : c'est un lien 404 (il n'existe plus). J'ai retrouvé le jeu de données il y a quelques quelques semaines sur archive.org, je suggère aux responsables du cours de mettre à jour ce lien : https://web.archive.org/web/20070314100741/http://www.cs.toronto.edu/~tsap/experiments/download/_movies.tar.Z

              Enfin pour l'exercice Hadoop, on me demandait de retourner 6 corrigés. En cours de route on ne m'en a demandé plus que 5 (« vous avez fait votre partie du travail »), mais depuis ce soir on m'en demande à nouveau 6. Mais pourquoi ???

              • Partager sur Facebook
              • Partager sur Twitter
                27 novembre 2018 à 9:43:22

                Bonjour.

                Concernant l'exception du script iliad, l'erreur provient du fait que la syntaxe utilisée dans le cours est sous Python 2 alors que tu dois utiliser Python 3. Dans ce cas, la syntaxe tuple est remplacer par un tableau. Il faut donc écrire :

                .map(lambda wordcount: (wordcount[0], wordcount[1]/float(word_count)))\

                Pour _movies.tar, je pouvais te proposer un lien mais tu as trouvé tout seul.

                Pour la gestion des corrigés, il faut envoyer un mail à hello@openclassrooms.com si tu n'as pas de réponse dans ce forum.

                Michel.

                • Partager sur Facebook
                • Partager sur Twitter
                Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
                  Team OC 28 novembre 2018 à 14:52:49

                  Bonjour Mika,

                  Merci pour vos retours ! Dans un premier temps, nous avons mis-à-jour l'activité avec le nouveau lien. 

                  Bien cordialement,

                  • Partager sur Facebook
                  • Partager sur Twitter
                    4 décembre 2018 à 0:07:53

                    Bonsoir Michel et Luc,

                    Merci pour vos réponses, et désolé de ma réponse tardive.

                    @Michel :

                    Effectivement il fallait remplacer les tuples par des tableaux dans les paramètres de fonction, à l'endroit que vous avez indiqué et aussi un peu plus bas. Fort de vos indications je propose le script suivant (qui fonctionne chez moi avec python 3) :

                    from pyspark import SparkContext
                    
                    sc = SparkContext()
                    
                    def filter_stop_words(word):
                        from nltk.corpus import stopwords
                        english_stop_words = stopwords.words("english")
                        return word not in english_stop_words
                    
                    def load_text(text_path):
                        # Split text in words
                        # Remove empty word artefacts
                        # Remove stop words ('I', 'you', 'a', 'the', ...)
                        vocabulary = sc.textFile(text_path)\
                            .flatMap(lambda lines: lines.lower().split())\
                            .flatMap(lambda word: word.split("."))\
                            .flatMap(lambda word: word.split(","))\
                            .flatMap(lambda word: word.split("!"))\
                            .flatMap(lambda word: word.split("?"))\
                            .flatMap(lambda word: word.split("'"))\
                            .flatMap(lambda word: word.split("\""))\
                            .filter(lambda word: word is not None and len(word) > 0)\
                            .filter(filter_stop_words)
                    
                        # Count the total number of words in the text
                        word_count = vocabulary.count()
                    
                        # Compute the frequency of each word: frequency = #appearances/#word_count
                        word_freq = vocabulary.map(lambda word: (word, 1))\
                            .reduceByKey(lambda count1, count2: count1 + count2)\
                            .map(lambda wordcount: (wordcount[0], wordcount[1]/float(word_count)))\
                    
                        return word_freq
                    
                    def main():
                        iliad = load_text('./iliad.mb.txt')
                        odyssey = load_text('./odyssey.mb.txt')
                    
                        # Join the two datasets and compute the difference in frequency
                        # Note that we need to write (freq or 0) because some words do not appear
                        # in one of the two books. Thus, some frequencies are equal to None after
                        # the full outer join.
                        join_words = iliad.fullOuterJoin(odyssey)\
                            .map(lambda wordfreqs: (wordfreqs[0], (wordfreqs[1][1] or 0) - (wordfreqs[1][0] or 0)))
                    
                        # 10 words that get a boost in frequency in the sequel
                        emerging_words = join_words.takeOrdered(10, lambda freq_diff: -freq_diff[1])
                        # 10 words that get a decrease in frequency in the sequel
                        disappearing_words = join_words.takeOrdered(10, lambda freq_diff: freq_diff[1])
                    
                        # Print results
                        for word, freq_diff in emerging_words:
                            print("%.2f" % (freq_diff*10000), word)
                        for word, freq_diff in disappearing_words[::-1]:
                            print("%.2f" % (freq_diff*10000), word)
                    
                        input("press ctrl+c to exit")
                    
                    if __name__ == "__main__":
                        main()

                    @Luc :

                    Super ! Merci d'avoir mis l'exercice à jour !

                    • Partager sur Facebook
                    • Partager sur Twitter
                      5 décembre 2018 à 18:56:06

                      Bonsoir Mika.

                      Oui, c'est tout-à-fait cela. Je pense que ton script pourrait remplacer ou compléter (pour Python 3) avantageusement celui du cours... mais ce n'est pas moi qui décide !

                      Bonne continuation.

                      Michel.

                      • Partager sur Facebook
                      • Partager sur Twitter
                      Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
                        6 avril 2019 à 0:15:20

                        hello,

                        Pour info le fichier du bucket s3 dans le quizz de fin n'est plus accessible...

                        1 point s'en est allé :/...

                        Sinon super cours ! Bravo. Qq notes tout de même :

                        les exos du tp1 (hadoop), surtout le page rank, sont assez coton mais force à se remettre un peu au math et à l'algo. Par contre bien que le code soit correct, tous les tps que j'ai corrigés (3) ainsi que le mien ont des résultats tfidf identiques mais différents du corrigés (on va dire 30-40% d'union), je n'arrive pas à voir d'où vient cette différence. La liste de stopwords serait-elle la même?

                        Au niveau de spark, qq précision supplémentaires sur python2/3 et pip ainsi que les interactions avec les différents modules (boto3 par exemple pour aws) serait intéressante. 

                        -
                        Edité par JulienLeroux38 6 avril 2019 à 0:16:17

                        • Partager sur Facebook
                        • Partager sur Twitter
                          14 mai 2019 à 1:03:09

                          Bonjour,
                          J'ai de gros problèmes pour faire tourner hadoop en suivant les instruction du cours "Familiarisez-vous avec Hadoop".
                          La compilation java se déroule bien et j'arrive à obtenir le fichier "ooc_cours1_wordcount.jar".
                          En revanche, lorsque je lance la commande
                          hadoop jar ooc_cours1_wordcount.jar ooc.cours1.wordcount.WordCountDriver /input/lejourseleve.txt /results

                          j'obtiens,

                          19/05/14 00:48:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                          Exception in thread "main" java.net.ConnectException: Call From da-ThinkPad-X280/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connexion refusée; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
                          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
                          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                          	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
                          	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
                          	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
                          	at org.apache.hadoop.ipc.Client.call(Client.java:1479)
                          	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
                          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
                          	at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
                          	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
                          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                          	at java.lang.reflect.Method.invoke(Method.java:498)
                          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
                          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
                          	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
                          	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
                          	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
                          	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
                          	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
                          	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
                          	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
                          	at ooc.cours1.wordcount.WordCountDriver.run(WordCountDriver.java:51)
                          	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
                          	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
                          	at ooc.cours1.wordcount.WordCountDriver.main(WordCountDriver.java:60)
                          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                          	at java.lang.reflect.Method.invoke(Method.java:498)
                          	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
                          	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
                          Caused by: java.net.ConnectException: Connexion refusée
                          	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
                          	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
                          	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
                          	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
                          	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
                          	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
                          	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
                          	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
                          	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
                          	at org.apache.hadoop.ipc.Client.call(Client.java:1451)
                          	... 27 more
                          

                          J'ai épuisé tous les conseils de stackoverflow, je m'en remets à vous.

                          (J'ai tenté de redémarrer le serveur avec start-all.sh dans le dossier hadoop-2.7.3/sbin)


                          • Partager sur Facebook
                          • Partager sur Twitter
                            14 mai 2019 à 9:39:06

                            Bonjour,

                            je viens de corriger les exercices TF-IDF et pagerank, je ne comprend pas pourquoi le corrigé du TF IDF donne une liste très différente de la mienne et de toutes celles que j'ai corrigé (defoe.txt); peut être une liste différente de stopwords ?

                            Concernant le pagerank, le corrigé ne mentionne pas le nombre d'itérations. Mon output pour une seule iteration donne la même chose que le corrigé, mais comme pas mal de monde j'en ai fait (bêtement ?) neuf et les résultats sont tout de même assez différents.

                            De plus, dans l'énoncé s=0.15 et il vaut 0.25 dans le corrigé.

                            Mis à part ça, je trouve le cours vraiment bien fait avec de vrais exercices qui demandent de se retrousser les manches !

                            -
                            Edité par VincentLarmet 14 mai 2019 à 10:03:51

                            • Partager sur Facebook
                            • Partager sur Twitter
                              16 mai 2019 à 12:17:36

                              @Damien

                              Essaye de rajouter export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native

                              dans spark-env.sh

                              • Partager sur Facebook
                              • Partager sur Twitter
                              Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
                                16 mai 2019 à 23:54:50

                                s'agirait-il plutôt du fichier yarn-env.sh ?

                                • Partager sur Facebook
                                • Partager sur Twitter
                                  18 mai 2019 à 10:51:42

                                  Ou plutôt hadoop-env.sh. En fait il y a aussi le même problème avec spark, que tu rencontreras plus tard je pense.

                                  voir https://github.com/rancher/community-catalog/issues/698

                                  Dans tous les cas, il faut essayer et voir si quel fichier env règle le pb.

                                  • Partager sur Facebook
                                  • Partager sur Twitter
                                  Rien n'arrive dans la vie... ni comme on le craint, ni comme on l'espère.
                                    14 juin 2019 à 0:11:51

                                    Bonsoir,

                                    Je suis entrain de faire le TP suivant :

                                    http://exercices.openclassrooms.com/assessment/625?id=4297166&slug=realisez-des-calculs-distribues-sur-des-donnees-massives&login=6429877&tk=ee595542426b98a0f8a158bc08fc2aae&sbd=2016-02-01&sbdtk=fa78d6dd3126b956265a25af9b322d55

                                    Pour chaque étape, il faut faire un job MapReduce ?

                                    Merci d'avance

                                    Edit: Je ne suis pas sur à 100% mais je pense que oui.

                                    -
                                    Edité par Bibilerikiki 15 juin 2019 à 10:38:45

                                    • Partager sur Facebook
                                    • Partager sur Twitter

                                    [Cours] Calculs distribués sur données massives

                                    × Après avoir cliqué sur "Répondre" vous serez invité à vous connecter pour que votre message soit publié.
                                    • Editeur
                                    • Markdown